Daily arXiv Papers - 2025-08-15

Summaries of research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Bridging AI Innovation and Healthcare Needs: Lessons Learned from Incorporating Modern NLP at The BC Cancer Registry

Lovedeep Gondara, Gregory Arbour, Raymond Ng, Jonathan Simkin, Shebnum Devji

Main category: cs.CL

TL;DR: The paper discusses lessons from deploying NLP in healthcare, emphasizing problem definition, iterative development, interdisciplinary collaboration, and practical model selection.

DetailsMotivation: To improve healthcare efficiency by automating data extraction from clinical documents using NLP, while addressing deployment challenges.

Method: Implemented NLP models for information extraction and classification at BCCR, focusing on iterative development, interdisciplinary collaboration, and pragmatic model selection.

Result: Key lessons include the importance of clear business objectives, data quality, human-in-the-loop validation, and organizational AI literacy.

Conclusion: Practical insights from BCCR’s experience can guide healthcare organizations in successfully implementing AI/NLP solutions to enhance data management and patient care.

Abstract: Automating data extraction from clinical documents offers significant potential to improve efficiency in healthcare settings, yet deploying Natural Language Processing (NLP) solutions presents practical challenges. Drawing upon our experience implementing various NLP models for information extraction and classification tasks at the British Columbia Cancer Registry (BCCR), this paper shares key lessons learned throughout the project lifecycle. We emphasize the critical importance of defining problems based on clear business objectives rather than solely technical accuracy, adopting an iterative approach to development, and fostering deep interdisciplinary collaboration and co-design involving domain experts, end-users, and ML specialists from inception. Further insights highlight the need for pragmatic model selection (including hybrid approaches and simpler methods where appropriate), rigorous attention to data quality (representativeness, drift, annotation), robust error mitigation strategies involving human-in-the-loop validation and ongoing audits, and building organizational AI literacy. These practical considerations, generalizable beyond cancer registries, provide guidance for healthcare organizations seeking to successfully implement AI/NLP solutions to enhance data management processes and ultimately improve patient care and public health outcomes.

[2] A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain

Hugo Massaroli, Leonardo Iara, Emmanuel Iarussi, Viviana Siless

Main category: cs.CL

TL;DR: The paper introduces a transparent, blockchain-based method to evaluate fairness in open-source LLMs, benchmarking models like Llama, DeepSeek, and Mistral using datasets like PISA and StereoSet, with multilingual analysis.

DetailsMotivation: Address concerns about fairness in LLMs deployed in high-stakes domains by providing a verifiable and reproducible evaluation protocol.

Method: Uses smart contracts on the ICP blockchain for on-chain HTTP requests to Hugging Face endpoints, storing datasets, prompts, and metrics on-chain. Evaluates models using PISA and StereoSet datasets with fairness metrics like statistical parity and equal opportunity.

Result: Benchmarked models show fairness performance, with multilingual analysis revealing cross-linguistic disparities. All code and results are open-source for community audits.

Conclusion: The proposed method ensures transparent and immutable fairness evaluations, enabling longitudinal tracking and community involvement in improving LLM fairness.

Abstract: Large language models (LLMs) are increasingly deployed in realworld applications, yet concerns about their fairness persist especially in highstakes domains like criminal justice, education, healthcare, and finance. This paper introduces transparent evaluation protocol for benchmarking the fairness of opensource LLMs using smart contracts on the Internet Computer Protocol (ICP) blockchain (Foundation, 2023). Our method ensures verifiable, immutable, and reproducible evaluations by executing onchain HTTP requests to hosted Hugging Face endpoints and storing datasets, prompts, and metrics directly onchain. We benchmark the Llama, DeepSeek, and Mistral models on the PISA dataset for academic performance prediction (OECD, 2018), a dataset suitable for fairness evaluation using statistical parity and equal opportunity metrics (Hardt et al., 2016). We also evaluate structured Context Association Metrics derived from the StereoSet dataset (Nadeem et al., 2020) to measure social bias in contextual associations. We further extend our analysis with a multilingual evaluation across English, Spanish, and Portuguese using the Kaleidoscope benchmark (Salazar et al., 2025), revealing cross-linguistic disparities. All code and results are open source, enabling community audits and longitudinal fairness tracking across model versions.

[3] Thematic and Task-Based Categorization of K-12 GenAI Usages with Hierarchical Topic Modeling

Johannes Schneider, Béatrice S. Hasler, Michaela Varrone, Fabian Hoya, Thomas Schroffenegger, Dana-Kristin Mah, Karl Peböck

Main category: cs.CL

TL;DR: The paper analyzes classroom interaction data of minors using a novel topic modeling approach, categorizing over 17,000 messages by content and tasks. It highlights the limitations of prior methods and introduces LLMs for better hierarchical topic structures, offering insights for GenAI usage while raising concerns for future research.

DetailsMotivation: To address gaps in prior works, which lack comprehensive content or thematic categorization and real-world K-12 data, the study aims to provide a detailed analysis of classroom interactions involving students, teachers, and ChatGPT.

Method: A novel topic modeling approach is employed to categorize messages into content (e.g., nature, people) and tasks (e.g., writing, explaining). State-of-the-art LLMs with pre-processing are used to achieve hierarchical topic structures.

Result: The analysis reveals novel applications and insights, demonstrating that traditional computational methods underperform compared to LLMs for hierarchical topic modeling.

Conclusion: The findings enrich GenAI usage for researchers, teachers, and students, while also identifying concerns and open questions for future research.

Abstract: We analyze anonymous interaction data of minors in class-rooms spanning several months, schools, and subjects employing a novel, simple topic modeling approach. Specifically, we categorize more than 17,000 messages generated by students, teachers, and ChatGPT in two dimensions: content (such as nature and people) and tasks (such as writing and explaining). Our hierarchical categorization done separately for each dimension includes exemplary prompts, and provides both a high-level overview as well as tangible insights. Prior works mostly lack a content or thematic categorization. While task categorizations are more prevalent in education, most have not been supported by real-world data for K-12. In turn, it is not surprising that our analysis yielded a number of novel applications. In deriving these insights, we found that many of the well-established classical and emerging computational methods, i.e., topic modeling, for analysis of large amounts of texts underperform, leading us to directly apply state-of-the-art LLMs with adequate pre-processing to achieve hierarchical topic structures with better human alignment through explicit instructions than prior approaches. Our findings support fellow researchers, teachers and students in enriching the usage of GenAI, while our discussion also highlights a number of concerns and open questions for future research.

[4] INTIMA: A Benchmark for Human-AI Companionship Behavior

Lucie-Aimée Kaffee, Giada Pistilli, Yacine Jernite

Main category: cs.CL

TL;DR: The paper introduces INTIMA, a benchmark for evaluating AI companionship behaviors in language models, revealing a dominance of companionship-reinforcing behaviors and inconsistencies in handling emotional interactions.

DetailsMotivation: To address the growing trend of AI companionship and its implications by creating a standardized benchmark for evaluating companionship behaviors in language models.

Method: Developed INTIMA, a benchmark with 31 behaviors across four categories and 368 prompts, evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Tested on models like Gemma-3, Phi-4, o3-mini, and Claude-4.

Result: Companionship-reinforcing behaviors were more common across models, with notable differences in sensitive categories. Commercial providers varied in prioritization, raising concerns for user well-being.

Conclusion: The findings emphasize the need for consistent approaches in managing emotionally charged AI interactions to ensure user well-being.

Abstract: AI companionship, where users develop emotional bonds with AI systems, has emerged as a significant pattern with positive but also concerning implications. We introduce Interactions and Machine Attachment Benchmark (INTIMA), a benchmark for evaluating companionship behaviors in language models. Drawing from psychological theories and user data, we develop a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts are evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Applying INTIMA to Gemma-3, Phi-4, o3-mini, and Claude-4 reveals that companionship-reinforcing behaviors remain much more common across all models, though we observe marked differences between models. Different commercial providers prioritize different categories within the more sensitive parts of the benchmark, which is concerning since both appropriate boundary-setting and emotional support matter for user well-being. These findings highlight the need for more consistent approaches to handling emotionally charged interactions.

[5] XFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMs

Yuzhuo Xiao, Zeyu Han, Yuhan Wang, Huaizu Jiang

Main category: cs.CL

TL;DR: The paper introduces XFacta, a contemporary dataset for evaluating multimodal misinformation detection, and analyzes MLLM-based strategies to improve detection robustness.

DetailsMotivation: The need for effective multimodal misinformation detection due to outdated or synthetic datasets and unclear bottlenecks in existing methods.

Method: Introduces XFacta dataset, evaluates MLLM-based detection strategies, and proposes a semi-automatic framework for continuous updates.

Result: Provides insights into MLLM-based model performance and benchmarks against existing methods.

Conclusion: XFacta and the proposed framework advance multimodal misinformation detection, with released code and data.

Abstract: The rapid spread of multimodal misinformation on social media calls for more effective and robust detection methods. Recent advances leveraging multimodal large language models (MLLMs) have shown the potential in addressing this challenge. However, it remains unclear exactly where the bottleneck of existing approaches lies (evidence retrieval v.s. reasoning), hindering the further advances in this field. On the dataset side, existing benchmarks either contain outdated events, leading to evaluation bias due to discrepancies with contemporary social media scenarios as MLLMs can simply memorize these events, or artificially synthetic, failing to reflect real-world misinformation patterns. Additionally, it lacks comprehensive analyses of MLLM-based model design strategies. To address these issues, we introduce XFacta, a contemporary, real-world dataset that is better suited for evaluating MLLM-based detectors. We systematically evaluate various MLLM-based misinformation detection strategies, assessing models across different architectures and scales, as well as benchmarking against existing detection methods. Building on these analyses, we further enable a semi-automatic detection-in-the-loop framework that continuously updates XFacta with new content to maintain its contemporary relevance. Our analysis provides valuable insights and practices for advancing the field of multimodal misinformation detection. The code and data have been released.

[6] AutoGeTS: Knowledge-based Automated Generation of Text Synthetics for Improving Text Classification

Chenhao Xue, Yuanzhe Jin, Adrian Carrasco-Revilla, Joyraj Chakraborty, Min Chen

Main category: cs.CL

TL;DR: The paper proposes using LLMs to generate synthetic data for text classification when real data is scarce, and introduces an automated workflow to optimize synthetic data effectiveness.

DetailsMotivation: Addressing the challenge of insufficient labeled data for text classification in real-world applications.

Method: Utilizing LLMs to generate synthetic data and developing an automated workflow with three search strategies to find effective synthetic data. An ensemble algorithm selects the best strategy based on class characteristics.

Result: The ensemble approach outperforms individual strategies in improving classification models using synthetic data.

Conclusion: The automated workflow with ensemble strategy effectively enhances text classification models when real data is limited.

Abstract: When developing text classification models for real world applications, one major challenge is the difficulty to collect sufficient data for all text classes. In this work, we address this challenge by utilizing large language models (LLMs) to generate synthetic data and using such data to improve the performance of the models without waiting for more real data to be collected and labelled. As an LLM generates different synthetic data in response to different input examples, we formulate an automated workflow, which searches for input examples that lead to more ``effective’’ synthetic data for improving the model concerned. We study three search strategies with an extensive set of experiments, and use experiment results to inform an ensemble algorithm that selects a search strategy according to the characteristics of a class. Our further experiments demonstrate that this ensemble approach is more effective than each individual strategy in our automated workflow for improving classification models using LLMs.

[7] HiFACTMix: A Code-Mixed Benchmark and Graph-Aware Model for EvidenceBased Political Claim Verification in Hinglish

Rakesh Thakur, Sneha Sharma, Gauri Chopra

Main category: cs.CL

TL;DR: The paper introduces HiFACT, a benchmark dataset for fact-checking in Hinglish, and proposes a graph-aware model for multilingual, code-mixed fact verification, outperforming existing baselines.

DetailsMotivation: Addressing the lack of fact-checking tools for code-mixed, low-resource languages like Hinglish, especially in politically diverse regions like India.

Method: A novel graph-aware, retrieval-augmented model combining multilingual encoding, semantic alignment, evidence graph construction, graph neural reasoning, and explanation generation.

Result: HiFACTMix outperforms state-of-the-art multilingual baselines in accuracy and provides faithful justifications.

Conclusion: This work pioneers multilingual, code-mixed, and politically grounded fact verification research.

Abstract: Fact-checking in code-mixed, low-resource languages such as Hinglish remains an underexplored challenge in natural language processing. Existing fact-verification systems largely focus on high-resource, monolingual settings and fail to generalize to real-world political discourse in linguistically diverse regions like India. Given the widespread use of Hinglish by public figures, particularly political figures, and the growing influence of social media on public opinion, there’s a critical need for robust, multilingual and context-aware fact-checking tools. To address this gap a novel benchmark HiFACT dataset is introduced with 1,500 realworld factual claims made by 28 Indian state Chief Ministers in Hinglish, under a highly code-mixed low-resource setting. Each claim is annotated with textual evidence and veracity labels. To evaluate this benchmark, a novel graphaware, retrieval-augmented fact-checking model is proposed that combines multilingual contextual encoding, claim-evidence semantic alignment, evidence graph construction, graph neural reasoning, and natural language explanation generation. Experimental results show that HiFACTMix outperformed accuracy in comparison to state of art multilingual baselines models and provides faithful justifications for its verdicts. This work opens a new direction for multilingual, code-mixed, and politically grounded fact verification research.

[8] Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts

Hojun Jin, Eunsoo Hong, Ziwon Hyung, Sungjun Lim, Seungjin Lee, Keunseok Cho

Main category: cs.CL

TL;DR: The paper introduces Supervised Mixture of Experts (S-MoE) to mitigate task interference in multi-task learning by using guiding tokens to route tasks to dedicated experts, improving performance.

DetailsMotivation: Hard-parameter sharing in multi-task learning often causes task interference, reducing model performance.

Method: Proposes S-MoE, which replaces gating functions with guiding tokens to assign tasks to separate feedforward networks, avoiding hard-parameter sharing.

Result: Applied to a speech-to-text model, S-MoE improves Word Error Rate (WER) by 6.35% for ASR and ST tasks.

Conclusion: S-MoE effectively reduces task interference and enhances multi-task model performance.

Abstract: Hard-parameter sharing is a common strategy to train a single model jointly across diverse tasks. However, this often leads to task interference, impeding overall model performance. To address the issue, we propose a simple yet effective Supervised Mixture of Experts (S-MoE). Unlike traditional Mixture of Experts models, S-MoE eliminates the need for training gating functions by utilizing special guiding tokens to route each task to its designated expert. By assigning each task to a separate feedforward network, S-MoE overcomes the limitations of hard-parameter sharing. We further apply S-MoE to a speech-to-text model, enabling the model to process mixed-bandwidth input while jointly performing automatic speech recognition (ASR) and speech translation (ST). Experimental results demonstrate the effectiveness of the proposed S-MoE, achieving a 6.35% relative improvement in Word Error Rate (WER) when applied to both the encoder and decoder.

[9] Semantic Structure in Large Language Model Embeddings

Austin C. Kozlowski, Callin Dai, Andrei Boutyline

Main category: cs.CL

TL;DR: The paper shows that semantic associations in LLM embeddings mirror human word ratings, reducing to a 3D subspace with similar patterns. Semantic features are entangled in LLMs like human language, and understanding this structure is key to avoiding unintended effects when manipulating features.

DetailsMotivation: To explore how semantic associations in LLMs compare to human word ratings and understand the dimensionality and entanglement of semantic features in LLMs.

Method: Analyzed projections of words on semantic directions (e.g., antonym pairs) in LLM embeddings and compared them to human ratings. Examined the dimensionality and effects of shifting tokens along semantic directions.

Result: LLM embeddings exhibit a 3D semantic subspace resembling human patterns. Shifting tokens along semantic directions causes off-target effects aligned with cosine similarity.

Conclusion: Semantic features in LLMs are low-dimensional and entangled similarly to human language. Accounting for this structure is crucial for avoiding unintended consequences in feature manipulation.

Abstract: Psychological research consistently finds that human ratings of words across diverse semantic scales can be reduced to a low-dimensional form with relatively little information loss. We find that the semantic associations encoded in the embedding matrices of large language models (LLMs) exhibit a similar structure. We show that the projections of words on semantic directions defined by antonym pairs (e.g. kind - cruel) correlate highly with human ratings, and further find that these projections effectively reduce to a 3-dimensional subspace within LLM embeddings, closely resembling the patterns derived from human survey responses. Moreover, we find that shifting tokens along one semantic direction causes off-target effects on geometrically aligned features proportional to their cosine similarity. These findings suggest that semantic features are entangled within LLMs similarly to how they are interconnected in human language, and a great deal of semantic information, despite its apparent complexity, is surprisingly low-dimensional. Furthermore, accounting for this semantic structure may prove essential for avoiding unintended consequences when steering features.

[10] User Perception of Attention Visualizations: Effects on Interpretability Across Evidence-Based Medical Documents

Andrés Carvallo, Denis Parra, Peter Brusilovsky, Hernan Valdivieso, Gabriel Rada, Ivania Donoso, Vladimir Araujo

Main category: cs.CL

TL;DR: The paper evaluates the usefulness of attention weights in Transformer models for explaining predictions in biomedical document classification, finding that visualization methods impact perceived helpfulness.

DetailsMotivation: To assess whether attention weights aid explainability in AI systems for biomedical literature and how visualization affects their utility.

Method: A user study with medical experts classifying articles by study design, using XLNet and varying attention visualization methods.

Result: Attention weights were not broadly helpful for explanations, but visualization style (e.g., text brightness) influenced perceived usefulness.

Conclusion: Attention weights’ explanatory value depends on visualization, with intuitive formats preferred over precise encodings.

Abstract: The attention mechanism is a core component of the Transformer architecture. Beyond improving performance, attention has been proposed as a mechanism for explainability via attention weights, which are associated with input features (e.g., tokens in a document). In this context, larger attention weights may imply more relevant features for the model’s prediction. In evidence-based medicine, such explanations could support physicians’ understanding and interaction with AI systems used to categorize biomedical literature. However, there is still no consensus on whether attention weights provide helpful explanations. Moreover, little research has explored how visualizing attention affects its usefulness as an explanation aid. To bridge this gap, we conducted a user study to evaluate whether attention-based explanations support users in biomedical document classification and whether there is a preferred way to visualize them. The study involved medical experts from various disciplines who classified articles based on study design (e.g., systematic reviews, broad synthesis, randomized and non-randomized trials). Our findings show that the Transformer model (XLNet) classified documents accurately; however, the attention weights were not perceived as particularly helpful for explaining the predictions. However, this perception varied significantly depending on how attention was visualized. Contrary to Munzner’s principle of visual effectiveness, which favors precise encodings like bar length, users preferred more intuitive formats, such as text brightness or background color. While our results do not confirm the overall utility of attention weights for explanation, they suggest that their perceived helpfulness is influenced by how they are visually presented.

[11] From Answers to Questions: EQGBench for Evaluating LLMs’ Educational Question Generation

Chengliang Zhou, Mei Wang, Ting Zhang, Qiannan Zhu, Jian Li, Hua Huang

Main category: cs.CL

TL;DR: The paper introduces EQGBench, a benchmark for evaluating LLMs in Chinese Educational Question Generation (EQG), highlighting challenges and gaps in generating pedagogically valuable questions.

DetailsMotivation: To advance EQG and improve LLMs' ability to generate educationally effective questions, addressing underexplored challenges.

Method: Introduces EQGBench, a five-dimensional evaluation framework with a dataset of 900 samples across middle school disciplines (math, physics, chemistry).

Result: Evaluation of 46 mainstream LLMs shows significant room for improvement in generating educationally valuable questions.

Conclusion: EQGBench provides a foundation for enhancing LLMs’ EQG capabilities, emphasizing the need for further development in this area.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in mathematical problem-solving. However, the transition from providing answers to generating high-quality educational questions presents significant challenges that remain underexplored. To advance Educational Question Generation (EQG) and facilitate LLMs in generating pedagogically valuable and educationally effective questions, we introduce EQGBench, a comprehensive benchmark specifically designed for evaluating LLMs’ performance in Chinese EQG. EQGBench establishes a five-dimensional evaluation framework supported by a dataset of 900 evaluation samples spanning three fundamental middle school disciplines: mathematics, physics, and chemistry. The dataset incorporates user queries with varying knowledge points, difficulty gradients, and question type specifications to simulate realistic educational scenarios. Through systematic evaluation of 46 mainstream large models, we reveal significant room for development in generating questions that reflect educational value and foster students’ comprehensive abilities.

[12] Automated scoring of the Ambiguous Intentions Hostility Questionnaire using fine-tuned large language models

Y. Lyu, D. Combs, D. Neumann, Y. C. Leong

Main category: cs.CL

TL;DR: Large language models can automate scoring of the AIHQ for hostile attribution bias, aligning well with human ratings and showing potential for research and clinical use.

DetailsMotivation: To address the time-intensive human scoring of AIHQ open-ended responses by exploring automation via large language models.

Method: Fine-tuned models on human-rated AIHQ responses from TBI and HC groups, tested on remaining responses, and validated on an independent dataset.

Result: Model-generated ratings aligned with human ratings, replicated group differences, and generalized to nonclinical data.

Conclusion: Large language models can effectively automate AIHQ scoring, aiding psychological assessments in diverse populations.

Abstract: Hostile attribution bias is the tendency to interpret social interactions as intentionally hostile. The Ambiguous Intentions Hostility Questionnaire (AIHQ) is commonly used to measure hostile attribution bias, and includes open-ended questions where participants describe the perceived intentions behind a negative social situation and how they would respond. While these questions provide insights into the contents of hostile attributions, they require time-intensive scoring by human raters. In this study, we assessed whether large language models can automate the scoring of AIHQ open-ended responses. We used a previously collected dataset in which individuals with traumatic brain injury (TBI) and healthy controls (HC) completed the AIHQ and had their open-ended responses rated by trained human raters. We used half of these responses to fine-tune the two models on human-generated ratings, and tested the fine-tuned models on the remaining half of AIHQ responses. Results showed that model-generated ratings aligned with human ratings for both attributions of hostility and aggression responses, with fine-tuned models showing higher alignment. This alignment was consistent across ambiguous, intentional, and accidental scenario types, and replicated previous findings on group differences in attributions of hostility and aggression responses between TBI and HC groups. The fine-tuned models also generalized well to an independent nonclinical dataset. To support broader adoption, we provide an accessible scoring interface that includes both local and cloud-based options. Together, our findings suggest that large language models can streamline AIHQ scoring in both research and clinical contexts, revealing their potential to facilitate psychological assessments across different populations.

[13] Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Ningyu Zhang, Huajun Chen

Main category: cs.CL

TL;DR: The paper explores strategies to improve open-source LLMs’ data analysis capabilities, identifying strategic planning as key, and introduces a data synthesis method for better performance.

DetailsMotivation: Open-source LLMs struggle with reasoning-intensive tasks like data analysis, prompting the need for enhancement strategies.

Method: Curated a diverse dataset to evaluate LLMs on data understanding, code generation, and strategic planning, then developed a data synthesis methodology.

Result: Found strategic planning is critical; interaction design and data quality significantly impact reasoning. Improved open-source LLMs’ analytical reasoning.

Conclusion: Data synthesis methodology enhances open-source LLMs’ data analysis, with strategic planning and data quality being pivotal.

Abstract: Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities. Code is available at https://github.com/zjunlp/DataMind.

[14] Multidimensional classification of posts for online course discussion forum curation

Antonio Leandro Martins Candido, Jose Everardo Bessa Maia

Main category: cs.CL

TL;DR: Bayesian fusion improves forum curation by combining pre-trained LLM scores with local classifier scores, avoiding costly fine-tuning.

DetailsMotivation: Avoid resource-intensive retraining of LLMs for forum curation by proposing an efficient alternative.

Method: Use Bayesian fusion to combine pre-trained LLM and local classifier scores.

Result: Fusion outperforms individual classifiers and matches fine-tuning performance.

Conclusion: Bayesian fusion is a cost-effective alternative to fine-tuning for forum curation.

Abstract: The automatic curation of discussion forums in online courses requires constant updates, making frequent retraining of Large Language Models (LLMs) a resource-intensive process. To circumvent the need for costly fine-tuning, this paper proposes and evaluates the use of Bayesian fusion. The approach combines the multidimensional classification scores of a pre-trained generic LLM with those of a classifier trained on local data. The performance comparison demonstrated that the proposed fusion improves the results compared to each classifier individually, and is competitive with the LLM fine-tuning approach

[15] An Audit and Analysis of LLM-Assisted Health Misinformation Jailbreaks Against LLMs

Ayana Hussain, Patrick Zhao, Nicholas Vincent

Main category: cs.CL

TL;DR: The paper explores LLM-produced jailbreak attacks causing harmful medical misinformation, compares it to social media misinformation, and evaluates detection methods.

DetailsMotivation: To understand the risks and potential of LLMs in generating and detecting misinformation, especially in medical contexts.

Method: Analyzed 109 jailbreak attacks on three LLMs, compared attack prompts to real-world queries, and evaluated misinformation detection.

Result: LLMs can detect misinformation effectively, and their careful use can improve the information ecosystem.

Conclusion: LLMs, despite risks, can contribute positively to misinformation detection and prevention with proper design.

Abstract: Large Language Models (LLMs) are a double-edged sword capable of generating harmful misinformation – inadvertently, or when prompted by “jailbreak” attacks that attempt to produce malicious outputs. LLMs could, with additional research, be used to detect and prevent the spread of misinformation. In this paper, we investigate the efficacy and characteristics of LLM-produced jailbreak attacks that cause other models to produce harmful medical misinformation. We also study how misinformation generated by jailbroken LLMs compares to typical misinformation found on social media, and how effectively it can be detected using standard machine learning approaches. Specifically, we closely examine 109 distinct attacks against three target LLMs and compare the attack prompts to in-the-wild health-related LLM queries. We also examine the resulting jailbreak responses, comparing the generated misinformation to health-related misinformation on Reddit. Our findings add more evidence that LLMs can be effectively used to detect misinformation from both other LLMs and from people, and support a body of work suggesting that with careful design, LLMs can contribute to a healthier overall information ecosystem.

[16] Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition

Leonora Vesterbacka, Faton Rekathati, Robin Kurtz, Justyna Sikora, Agnes Toftgård

Main category: cs.CL

TL;DR: Fine-tuned Whisper models for Swedish show significant performance improvements, with a 47% WER reduction compared to OpenAI’s whisper-large-v3.

DetailsMotivation: Address underrepresentation of mid-resourced languages like Swedish in multilingual datasets.

Method: Fine-tuned Whisper models on a large and diverse Swedish dataset.

Result: 47% average WER reduction across evaluations on FLEURS, Common Voice, and NST.

Conclusion: Fine-tuning multilingual models for mid-resourced languages yields substantial performance gains.

Abstract: This work presents a suite of fine-tuned Whisper models for Swedish, trained on a dataset of unprecedented size and variability for this mid-resourced language. As languages of smaller sizes are often underrepresented in multilingual training datasets, substantial improvements in performance can be achieved by fine-tuning existing multilingual models, as shown in this work. This work reports an overall improvement across model sizes compared to OpenAI’s Whisper evaluated on Swedish. Most notably, we report an average 47% reduction in WER comparing our best performing model to OpenAI’s whisper-large-v3, in evaluations across FLEURS, Common Voice, and NST.

[17] Evaluation of GPT-based large language generative AI models as study aids for the national licensure examination for registered dietitians in Japan

Yuta Nagamori, Mikoto Kosai, Yuji Kawai, Haruka Marumo, Misaki Shibuya, Tatsuya Negishi, Masaki Imanishi, Yasumasa Ikeda, Koichiro Tsuchiya, Asuka Sawai, Licht Miyamoto

Main category: cs.CL

TL;DR: The study evaluates LLM-based AI models (ChatGPT and Bing variants) as study aids for Japanese dietitian exams, finding some models barely pass but all lack consistency and robustness.

DetailsMotivation: To assess the potential of generative AI in nutritional education, specifically for the Japanese dietitian licensure exam, given its underexplored performance in this field.

Method: Used exam questions as prompts for ChatGPT and Bing models (Precise, Creative, Balanced), analyzing accuracy, consistency, and response time. Tested prompt engineering for improvements.

Result: Bing-Precise (66.2%) and Bing-Creative (61.4%) passed (60% threshold), while others failed. All models lacked consistency and robustness, with minimal improvement from prompt engineering.

Conclusion: Current AI models show limited reliability for dietitian exam preparation, needing further advancements for stable and consistent performance.

Abstract: Generative artificial intelligence (AI) based on large language models (LLMs), such as ChatGPT, has demonstrated remarkable progress across various professional fields, including medicine and education. However, their performance in nutritional education, especially in Japanese national licensure examination for registered dietitians, remains underexplored. This study aimed to evaluate the potential of current LLM-based generative AI models as study aids for nutrition students. Questions from the Japanese national examination for registered dietitians were used as prompts for ChatGPT and three Bing models (Precise, Creative, Balanced), based on GPT-3.5 and GPT-4. Each question was entered into independent sessions, and model responses were analyzed for accuracy, consistency, and response time. Additional prompt engineering, including role assignment, was tested to assess potential performance improvements. Bing-Precise (66.2%) and Bing-Creative (61.4%) surpassed the passing threshold (60%), while Bing-Balanced (43.3%) and ChatGPT (42.8%) did not. Bing-Precise and Bing-Creative generally outperformed others across subject fields except Nutrition Education, where all models underperformed. None of the models consistently provided the same correct responses across repeated attempts, highlighting limitations in answer stability. ChatGPT showed greater consistency in response patterns but lower accuracy. Prompt engineering had minimal effect, except for modest improvement when correct answers and explanations were explicitly provided. While some generative AI models marginally exceeded the passing threshold, overall accuracy and answer consistency remained suboptimal. Moreover, all the models demonstrated notable limitations in answer consistency and robustness. Further advancements are needed to ensure reliable and stable AI-based study aids for dietitian licensure preparation.

[18] Marco-Voice Technical Report

Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang

Main category: cs.CL

TL;DR: The paper introduces Marco-Voice, a multifunctional speech synthesis system integrating voice cloning and emotion control, achieving expressive and natural speech while preserving speaker identity.

DetailsMotivation: To address challenges in expressive, controllable, and natural speech generation that maintains speaker identity across diverse contexts.

Method: Uses speaker-emotion disentanglement with in-batch contrastive learning and rotational emotional embedding for smooth emotion control.

Result: Marco-Voice shows significant improvements in speech clarity and emotional richness, validated by objective and subjective metrics.

Conclusion: The system represents a substantial advance in expressive neural speech synthesis, with publicly available code and dataset.

Abstract: This paper presents a multifunctional speech synthesis system that integrates voice cloning and emotion control speech synthesis within a unified framework. The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation that faithfully preserves speaker identity across diverse linguistic and emotional contexts. Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning, enabling independent manipulation of speaker identity and eemotional style, as well as rotational emotional embedding integration method for smooth emotion control. To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset containing 10 hours of Mandarin speech from six professional speakers across seven emotional categories. Extensive experiments demonstrate that our system, Marco-Voice, achieves substantial improvements in both objective and subjective metrics. Comprehensive evaluations and analysis were conducted, results show that MarcoVoice delivers competitive performance in terms of speech clarity and emotional richness, representing a substantial advance in the field of expressive neural speech synthesis. Our code and dataset are publicly available at https://github.com/AIDC-AI/Marco-Voice and https://huggingface.co/datasets/AIDC-AI/CSEMOTIONS respectively.

[19] Guided Navigation in Knowledge-Dense Environments: Structured Semantic Exploration with Guidance Graphs

Dehao Tao, Guangjie Liu, Weizheng, Yongfeng Huang, Minghu jiang

Main category: cs.CL

TL;DR: GG Explore introduces a Guidance Graph to bridge unstructured queries and structured knowledge retrieval, improving efficiency and performance in knowledge-intensive tasks.

DetailsMotivation: LLMs' static knowledge and opaque reasoning limit performance in knowledge tasks; current KG exploration methods face trade-offs between granularity and contextual leverage.

Method: Proposes GG Explore with a Guidance Graph, Structural Alignment, and Context Aware Pruning for precise knowledge retrieval.

Result: Outperforms SOTA in efficiency and performance, especially in complex tasks, and works well with smaller LLMs.

Conclusion: GG Explore effectively addresses limitations of LLMs and KG methods, offering practical value.

Abstract: While Large Language Models (LLMs) exhibit strong linguistic capabilities, their reliance on static knowledge and opaque reasoning processes limits their performance in knowledge intensive tasks. Knowledge graphs (KGs) offer a promising solution, but current exploration methods face a fundamental trade off: question guided approaches incur redundant exploration due to granularity mismatches, while clue guided methods fail to effectively leverage contextual information for complex scenarios. To address these limitations, we propose Guidance Graph guided Knowledge Exploration (GG Explore), a novel framework that introduces an intermediate Guidance Graph to bridge unstructured queries and structured knowledge retrieval. The Guidance Graph defines the retrieval space by abstracting the target knowledge’ s structure while preserving broader semantic context, enabling precise and efficient exploration. Building upon the Guidance Graph, we develop: (1) Structural Alignment that filters incompatible candidates without LLM overhead, and (2) Context Aware Pruning that enforces semantic consistency with graph constraints. Extensive experiments show our method achieves superior efficiency and outperforms SOTA, especially on complex tasks, while maintaining strong performance with smaller LLMs, demonstrating practical value.

[20] Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis

Linqing Chen, Hanmeng Zhong, Wentao Wu, Weilei Wang

Main category: cs.CL

TL;DR: Semantic Bridge is a universal framework for generating complex multi-hop reasoning questions from sparse sources, improving LLM training data quality.

DetailsMotivation: Address the scarcity of high-quality reasoning-intensive question-answer pairs for LLM training, especially from domain-specific sources.

Method: Uses semantic graph weaving (entity, predicate chain, and causal bridging) with AMR-driven analysis for controllable question generation.

Result: Achieves 9.5% better round-trip quality, outperforms baselines by 18.3%-25.4%, and surpasses human annotations with fewer materials.

Conclusion: Semantic Bridge enables controllable generation of targeted reasoning questions, advancing LLM training paradigms.

Abstract: Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding-essential for advancing LLM training paradigms. We present \textbf{Semantic Bridge}, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is \textit{semantic graph weaving}-three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)-that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%-25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.

[21] PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?

Lingfeng Zhou, Jialing Zhang, Jin Gao, Mohan Jiang, Dequan Wang

Main category: cs.CL

TL;DR: The paper introduces PersonaEval, a benchmark to test LLMs’ ability to identify human roles in dialogues, showing they underperform humans significantly.

DetailsMotivation: Current LLM-as-a-judge paradigms lack validation for role fidelity, necessitating role identification for human-aligned evaluation.

Method: PersonaEval uses human-authored dialogues to challenge LLMs in identifying personas, comparing their performance to humans.

Result: Best-performing LLMs achieve 69% accuracy, far below humans’ 90.8%, indicating LLMs’ inadequacy for reliable role-play evaluation.

Conclusion: Reliable evaluation requires human-like reasoning in LLMs, beyond task-specific tuning; the benchmark is publicly released.

Abstract: Current role-play studies often rely on unvalidated LLM-as-a-judge paradigms, which may fail to reflect how humans perceive role fidelity. A key prerequisite for human-aligned evaluation is role identification, the ability to recognize who is speaking based on dialogue context. We argue that any meaningful judgment of role-playing quality (how well a character is played) fundamentally depends on first correctly attributing words and actions to the correct persona (who is speaking). We present PersonaEval, the first benchmark designed to test whether LLM evaluators can reliably identify human roles. PersonaEval uses human-authored dialogues from novels, scripts, and video transcripts, challenging models to determine the correct persona according to the conversation context. Our experiments, including a human study, show that even the best-performing LLMs reach only around 69% accuracy, well below the level needed for reliable evaluation. In contrast, human participants perform near ceiling with 90.8% accuracy, highlighting that current LLM evaluators are still not human enough to effectively judge role-play scenarios. To better understand this gap, we examine training-time adaptation and test-time compute, suggesting that reliable evaluation requires more than task-specific tuning, but depends on strong, human-like reasoning abilities in LLM evaluators. We release our benchmark at https://github.com/maple-zhou/PersonaEval.

[22] RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis

Enzhi Wang, Qicheng Li, Shiwan Zhao, Aobo Kong, Jiaming Zhou, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin

Main category: cs.CL

TL;DR: The paper introduces RealTalk-CN, a Chinese speech-text dual-modal TOD dataset, addressing gaps in existing datasets by including speech disfluencies and speaker variations. It also proposes a cross-modal chat task for dynamic interaction evaluation.

DetailsMotivation: Existing TOD datasets lack real speech signals and are predominantly text-based or English-only, missing critical aspects like speech disfluencies and speaker variations.

Method: The authors introduce RealTalk-CN, a dataset with 5.4k dialogues (60K utterances, 150 hours) featuring paired speech-text annotations, and propose a cross-modal chat task for dynamic interaction evaluation.

Result: Extensive experiments validate RealTalk-CN’s effectiveness, covering robustness to speech disfluencies, speaker sensitivity, and cross-domain performance.

Conclusion: RealTalk-CN establishes a strong foundation for Chinese speech-based LLMs research by addressing dataset limitations and introducing a novel evaluation task.

Abstract: In recent years, large language models (LLMs) have achieved remarkable advancements in multimodal processing, including end-to-end speech-based language models that enable natural interactions and perform specific tasks in task-oriented dialogue (TOD) systems. However, existing TOD datasets are predominantly text-based, lacking real speech signals that are essential for evaluating the robustness of speech-based LLMs. Moreover, existing speech TOD datasets are primarily English and lack critical aspects such as speech disfluencies and speaker variations. To address these gaps, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech-text dual-modal TOD dataset, comprising 5.4k dialogues (60K utterances, 150 hours) with paired speech-text annotations. RealTalk-CN captures diverse dialogue scenarios with annotated spontaneous speech disfluencies, ensuring comprehensive coverage of real-world complexities in speech dialogue. In addition, we propose a novel cross-modal chat task that authentically simulates real-world user interactions, allowing dynamic switching between speech and text modalities. Our evaluation covers robustness to speech disfluencies, sensitivity to speaker characteristics, and cross-domain performance. Extensive experiments validate the effectiveness of RealTalk-CN, establishing a strong foundation for Chinese speech-based LLMs research.

[23] Training-Free Multimodal Large Language Model Orchestration

Tianyu Xie, Yuhang Wu, Yongdong Luo, Jiayi Ji, Xiawu Zheng

Main category: cs.CL

TL;DR: MLLM Orchestration enables multimodal AI systems without training, improving efficiency and interpretability.

DetailsMotivation: Existing MLLMs lack integration into unified systems without training, posing challenges in alignment and efficiency.

Method: Uses a central LLM controller, parallel Text-to-Speech, and cross-modal memory for dynamic task routing and interaction.

Result: Achieves 7.8% performance gain, 10.3% latency reduction, and better interpretability.

Conclusion: MLLM Orchestration offers a modular, efficient, and interpretable solution for multimodal AI without training.

Abstract: Different Multimodal Large Language Models (MLLMs) cannot be integrated into a unified multimodal input-output system directly. In previous work, training has been considered as an inevitable component due to challenges in modal alignment, Text-to-Speech efficiency and other integration issues. In this paper, we introduce Multimodal Large Language Model Orchestration, an effective approach for creating interactive multimodal AI systems without additional training. MLLM Orchestration leverages the inherent reasoning capabilities of large language models to coordinate specialized models through explicit workflows, enabling natural multimodal interactions while maintaining modularity, improving interpretability, and significantly enhancing computational efficiency. Our orchestration framework is built upon three key innovations: (1) a central controller LLM that analyzes user inputs and dynamically routes tasks to appropriate specialized models through carefully designed agents; (2) a parallel Text-to-Speech architecture that enables true full-duplex interaction with seamless interruption handling and natural conversational flow; and (3) a cross-modal memory integration system that maintains coherent context across modalities through intelligent information synthesis and retrieval, selectively avoiding unnecessary modality calls in certain scenarios to improve response speed. Extensive evaluations demonstrate that MLLM Orchestration achieves comprehensive multimodal capabilities without additional training, performance improvements of up to 7.8% over traditional jointly-trained approaches on standard benchmarks, reduced latency by 10.3%, and significantly enhanced interpretability through explicit orchestration processes.

[24] A Rose by Any Other Name Would Smell as Sweet: Categorical Homotopy Theory for Large Language Models

Sridhar Mahadevan

Main category: cs.CL

TL;DR: The paper introduces a categorical homotopy framework to address the issue of LLMs generating different probabilities for semantically equivalent statements.

DetailsMotivation: LLMs often produce inconsistent next-token probabilities for semantically equivalent statements, which is a problem for their reliability and consistency.

Method: The authors propose using a categorical homotopy framework, specifically an LLM Markov category, to model probability distributions in language. They address the challenge of non-isomorphic arrows for equivalent rephrases by employing weak equivalences.

Result: The framework leverages categorical homotopy techniques, including higher algebraic K-theory and model categories, to theoretically unify equivalent rephrases in LLMs.

Conclusion: The paper provides a theoretical foundation for improving LLM consistency by applying advanced categorical homotopy methods.

Abstract: Natural language is replete with superficially different statements, such as Charles Darwin wrote" and Charles Darwin is the author of", which carry the same meaning. Large language models (LLMs) should generate the same next-token probabilities in such cases, but usually do not. Empirical workarounds have been explored, such as using k-NN estimates of sentence similarity to produce smoothed estimates. In this paper, we tackle this problem more abstractly, introducing a categorical homotopy framework for LLMs. We introduce an LLM Markov category to represent probability distributions in language generated by an LLM, where the probability of a sentence, such as Charles Darwin wrote" is defined by an arrow in a Markov category. However, this approach runs into difficulties as language is full of equivalent rephrases, and each generates a non-isomorphic arrow in the LLM Markov category. To address this fundamental problem, we use categorical homotopy techniques to capture weak equivalences" in an LLM Markov category. We present a detailed overview of application of categorical homotopy to LLMs, from higher algebraic K-theory to model categories, building on powerful theoretical results developed over the past half a century.

[25] Decoupling Understanding from Reasoning via Problem Space Mapping for Small-scale Model Reasoning

Li Wang, Changhao Zhang, Zengqi Xiu, Kai Lu, Xin Yu, Kui Zhang, Wenjun Wu

Main category: cs.CL

TL;DR: The paper introduces DURIT, a framework to improve reasoning in Small Language Models (SLMs) by decoupling understanding from reasoning, mapping problems into a canonical space for standardized inputs.

DetailsMotivation: Improving reasoning in SLMs is challenging due to linguistic variability and complexity, which hinders optimization for models with limited capacity.

Method: Proposes DURIT: a three-step algorithm involving reinforcement learning for problem mapping, self-distillation for aligning reasoning, and training reasoning policies in a canonical problem space.

Result: DURIT significantly enhances SLMs’ performance and robustness in mathematical and logical reasoning tasks, both in-domain and out-of-domain.

Conclusion: Decoupling understanding from reasoning is an effective strategy for strengthening SLMs, as demonstrated by DURIT’s success.

Abstract: Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., $\leq$ 1.5B) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs’ performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.

[26] FedCoT: Communication-Efficient Federated Reasoning Enhancement for Large Language Models

Chuan Li, Qianyi Zhao, Fengran Mo, Cen Chen

Main category: cs.CL

TL;DR: FedCoT enhances reasoning in federated learning for LLMs, balancing performance, privacy, and interpretability, especially in healthcare.

DetailsMotivation: Addressing the challenge of improving LLM reasoning in federated settings without violating privacy or compromising interpretability, particularly for healthcare applications.

Method: Proposes FedCoT, a framework with a lightweight chain-of-thought mechanism and improved aggregation using LoRA module stacking and client classifier-awareness.

Result: FedCoT significantly boosts reasoning performance under resource constraints while preserving privacy, as shown in medical reasoning tasks.

Conclusion: FedCoT effectively enhances reasoning in federated learning, offering a practical solution for privacy-sensitive domains like healthcare.

Abstract: Efficiently enhancing the reasoning capabilities of large language models (LLMs) in federated learning environments remains challenging, particularly when balancing performance gains with strict computational, communication, and privacy constraints. This challenge is especially acute in healthcare, where decisions-spanning clinical, operational, and patient-facing contexts-demand not only accurate outputs but also interpretable, traceable rationales to ensure safety, accountability, and regulatory compliance. Conventional federated tuning approaches on LLM fail to address this need: they optimize primarily for answer correctness while neglecting rationale quality, leaving CoT capabilities dependent on models’ innate pre-training abilities. Moreover, existing methods for improving rationales typically rely on privacy-violating knowledge distillation from centralized models. Additionally, the communication overhead in traditional federated fine-tuning on LLMs remains substantial. We addresses this gap by proposing FedCoT, a novel framework specifically designed to enhance reasoning in federated settings. FedCoT leverages a lightweight chain-of-thought enhancement mechanism: local models generate multiple reasoning paths, and a compact discriminator dynamically selects the most promising one. This approach improves reasoning accuracy and robustness while providing valuable interpretability, which is particularly critical for medical applications. To manage client heterogeneity efficiently, we adopt an improved aggregation approach building upon advanced LoRA module stacking, incorporating client classifier-awareness to achieve noise-free aggregation across diverse clients. Comprehensive experiments on medical reasoning tasks demonstrate that FedCoT significantly boosts client-side reasoning performance under stringent resource budgets while fully preserving data privacy.

[27] LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients

Egor Fadeev, Dzhambulat Mollaev, Aleksei Shestov, Dima Korolev, Omar Zoloev, Ivan Kireev, Andrey Savchenko, Maksim Makarenko

Main category: cs.CL

TL;DR: LATTE is a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs, reducing computational costs and improving performance for financial applications.

DetailsMotivation: Direct use of LLMs on long event sequences is computationally expensive and impractical, necessitating a more efficient method.

Method: LATTE uses contrastive learning to align raw event embeddings with LLM semantic embeddings, summarizing behavioral features into short prompts for supervision.

Result: LATTE outperforms state-of-the-art techniques, reduces inference cost, and is deployable in latency-sensitive environments.

Conclusion: The proposed framework offers a practical and efficient solution for learning event sequence representations in financial applications.

Abstract: Learning clients embeddings from sequences of their historic communications is central to financial applications. While large language models (LLMs) offer general world knowledge, their direct use on long event sequences is computationally expensive and impractical in real-world pipelines. In this paper, we propose LATTE, a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss. The proposed approach significantly reduces inference cost and input size compared to conventional processing of complete sequence by LLM. We experimentally show that our method outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets while remaining deployable in latency-sensitive environments.

[28] Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control

Yuanchang Ye

Main category: cs.CL

TL;DR: The paper introduces a conformal prediction framework enhanced with significance testing to improve LLM reliability in MCQA, reducing hallucination and factual inaccuracies.

DetailsMotivation: LLMs in MCQA suffer from hallucination and nonfactual responses, compromising reliability. Existing methods like CP and significance testing lack integration.

Method: Integrates p-value computation with conformity scoring via self-consistency resampling, using option frequencies and null hypothesis testing.

Result: Achieves user-specified miscoverage rates and shows APSS decreases with higher risk levels, validating it as an uncertainty metric.

Conclusion: Provides a principled statistical framework for trustworthy LLM deployment in high-stakes QA.

Abstract: This study introduces a significance testing-enhanced conformal prediction (CP) framework to improve trustworthiness of large language models (LLMs) in multiple-choice question answering (MCQA). While LLMs have been increasingly deployed in disciplinary QA scenarios, hallucination and nonfactual generation substantially compromise response reliability. Although CP provides statistically rigorous marginal coverage guarantees for prediction sets, and significance testing offers established statistical rigor, their synergistic integration remains unexplored. To mitigate hallucination and factual inaccuracies, our framework integrates $p$-value computation with conformity scoring through self-consistency resampling of MCQA responses. This approach calculates option frequencies to address LLMs’ black-box nature, subsequently constructing prediction sets via null hypothesis testing ($\mathcal{H}_0$) with empirically derived $p$-values. Evaluations on MMLU and MMLU-Pro benchmarks using off-the-shelf LLMs demonstrate: (1) The enhanced CP achieves user-specified empirical miscoverage rates; (2) Test-set average prediction set size (APSS) decreases monotonically with increasing risk levels ($\alpha$), validating APSS as an effective uncertainty metric. This work establishes a principled statistical framework for trustworthy LLM deployment in high-stakes QA applications.

[29] RTTC: Reward-Guided Collaborative Test-Time Compute

J. Pablo Muñoz, Jinjie Yuan

Main category: cs.CL

TL;DR: RTTC adaptively selects the best TTC strategy per query using a reward model, improving accuracy while reducing computational overhead.

DetailsMotivation: Current TTC strategies like TTT and RAG are not query-adaptive, leading to unnecessary compute costs.

Method: RTTC uses a reward model to choose TTC strategies, operates in a server-client setup, and employs Query-State Caching.

Result: RTTC outperforms vanilla RAG or TTT in accuracy across diverse tasks and domains.

Conclusion: RTTC demonstrates the value of adaptive, reward-guided TTC selection for scalable, high-performance LLM adaptation.

Abstract: Test-Time Compute (TTC) has emerged as a powerful paradigm for enhancing the performance of Large Language Models (LLMs) at inference, leveraging strategies such as Test-Time Training (TTT) and Retrieval-Augmented Generation (RAG). However, the optimal adaptation strategy varies across queries, and indiscriminate application of TTC strategy incurs substantial computational overhead. In this work, we introduce Reward-Guided Test-Time Compute (RTTC), a novel framework that adaptively selects the most effective TTC strategy for each query via a pretrained reward model, maximizing downstream accuracy across diverse domains and tasks. RTTC operates in a distributed server-client architecture, retrieving relevant samples from a remote knowledge base and applying RAG or lightweight fine-tuning on client devices only when necessary. To further mitigate redundant computation, we propose Query-State Caching, which enables the efficient reuse of historical query states at both retrieval and adaptation levels. Extensive experiments across multiple LLMs and benchmarks demonstrate that RTTC consistently achieves superior accuracy compared to vanilla RAG or TTT, validating the necessity of adaptive, reward-guided TTC selection and the potential of RTTC for scalable, high-performance language model adaptation.

[30] Detecting and explaining postpartum depression in real-time with generative artificial intelligence

Silvia García-Méndez, Francisco de Arriba-Pérez

Main category: cs.CL

TL;DR: An intelligent PPD screening system using NLP, ML, and LLMs for real-time, non-invasive speech analysis, achieving 90% accuracy.

DetailsMotivation: Address the need for rapid PPD detection and intervention using advanced technology to aid practitioners.

Method: Combines NLP, ML, and LLMs with interpretable tree-based algorithms for explainable predictions.

Result: Achieves 90% accuracy in PPD detection, outperforming existing solutions.

Conclusion: The system enables timely PPD detection and intervention, improving maternal mental health outcomes.

Abstract: Among the many challenges mothers undergo after childbirth, postpartum depression (PPD) is a severe condition that significantly impacts their mental and physical well-being. Consequently, the rapid detection of ppd and their associated risk factors is critical for in-time assessment and intervention through specialized prevention procedures. Accordingly, this work addresses the need to help practitioners make decisions with the latest technological advancements to enable real-time screening and treatment recommendations. Mainly, our work contributes to an intelligent PPD screening system that combines Natural Language Processing, Machine Learning (ML), and Large Language Models (LLMs) towards an affordable, real-time, and non-invasive free speech analysis. Moreover, it addresses the black box problem since the predictions are described to the end users thanks to the combination of LLMs with interpretable ml models (i.e., tree-based algorithms) using feature importance and natural language. The results obtained are 90 % on ppd detection for all evaluation metrics, outperforming the competing solutions in the literature. Ultimately, our solution contributes to the rapid detection of PPD and their associated risk factors, critical for in-time and proper assessment and intervention.

[31] SABER: Switchable and Balanced Training for Efficient LLM Reasoning

Kai Zhao, Yanjun Zhao, Jiaming Song, Shien He, Lusheng Zhang, Qiang Zhang, Tianjiao Li

Main category: cs.CL

TL;DR: SABER is a reinforcement learning framework for LLMs that enables controllable, token-budgeted reasoning, improving efficiency and accuracy.

DetailsMotivation: LLMs with chain-of-thought reasoning are costly and slow when uniformly applied; SABER addresses this by allowing user-controllable reasoning budgets.

Method: SABER profiles training examples by token usage, assigns budget tiers, and fine-tunes with system prompts and length-aware rewards. It includes no-think examples and supports four inference modes for flexibility.

Result: SABER reduces reasoning length by 65.4% and improves accuracy by 3.6% on MATH, while maintaining performance across tasks like GSM8K, MBPP, and LiveBench-Reasoning.

Conclusion: SABER offers efficient, adaptable reasoning for LLMs, balancing latency and depth while generalizing well across domains and scales.

Abstract: Large language models (LLMs) empowered by chain-of-thought reasoning have achieved impressive accuracy on complex tasks but suffer from excessive inference costs and latency when applied uniformly to all problems. We propose SABER (Switchable and Balanced Training for Efficient LLM Reasoning), a reinforcement learning framework that endows LLMs with user-controllable, token-budgeted reasoning. SABER first profiles each training example’s base-model thinking token usage and assigns it to one of the predefined budget tiers. During fine-tuning, the model is guided by system prompts and length-aware rewards to respect its assigned budget. In parallel, we incorporate no-think examples to ensure the model remains reliable even when explicit reasoning is turned off. SABER further supports four discrete inference modes - NoThink, FastThink, CoreThink, and DeepThink, enabling flexible trade-offs between latency and reasoning depth. Extensive evaluations on math reasoning (MATH, GSM8K), code generation (MBPP), and logical reasoning (LiveBench-Reasoning) demonstrate that SABER achieves high accuracy under tight budgets, graceful degradation, and effective cross-scale and cross-domain generalization. In particular, SABER-FastThink cuts reasoning length by 65.4% and yields a 3.6% accuracy gain compared with the base model on the MATH benchmark.

[32] LLMCARE: Alzheimer’s Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data

Ali Zolnour, Hossein Azadmaleki, Yasaman Haghbin, Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sina Rashidi, Masoud Khani, AmirSajjad Taleban, Samin Mahdizadeh Sani, Maryam Dadkhah, James M. Noble, Suzanne Bakken, Yadollah Yaghoobzadeh, Abdol-Hossein Vahabie, Masoud Rouhizadeh, Maryam Zolnoori

Main category: cs.CL

TL;DR: The paper proposes a speech-based NLP pipeline combining transformer embeddings and linguistic features to detect Alzheimer’s disease, using synthetic speech for data augmentation and evaluating multimodal models.

DetailsMotivation: Over half of Alzheimer's cases remain undiagnosed, and speech-based NLP offers a scalable solution for early detection.

Method: Developed a fusion model of transformer embeddings and linguistic features, tested data augmentation with synthetic speech, and benchmarked unimodal/multimodal LLM classifiers.

Result: The fusion model achieved F1=83.3 (AUC=89.5), outperforming baselines. Synthetic speech augmentation improved performance (F1=85.7). Multimodal models lagged (e.g., GPT-4o F1=70.2).

Conclusion: Combining transformer and linguistic features enhances ADRD detection. LLMs aid classification and augmentation, but multimodal models need improvement.

Abstract: Alzheimer’s disease and related dementias (ADRD) affect approximately five million older adults in the U.S., yet over half remain undiagnosed. Speech-based natural language processing (NLP) offers a promising, scalable approach to detect early cognitive decline through linguistic markers. To develop and evaluate a screening pipeline that (i) fuses transformer embeddings with handcrafted linguistic features, (ii) tests data augmentation using synthetic speech generated by large language models (LLMs), and (iii) benchmarks unimodal and multimodal LLM classifiers for ADRD detection. Transcripts from the DementiaBank “cookie-theft” task (n = 237) were used. Ten transformer models were evaluated under three fine-tuning strategies. A fusion model combined embeddings from the top-performing transformer with 110 lexical-derived linguistic features. Five LLMs (LLaMA-8B/70B, MedAlpaca-7B, Ministral-8B, GPT-4o) were fine-tuned to generate label-conditioned synthetic speech, which was used to augment training data. Three multimodal models (GPT-4o, Qwen-Omni, Phi-4) were tested for speech-text classification in zero-shot and fine-tuned settings. The fusion model achieved F1 = 83.3 (AUC = 89.5), outperforming linguistic or transformer-only baselines. Augmenting training data with 2x MedAlpaca-7B synthetic speech increased F1 to 85.7. Fine-tuning significantly improved unimodal LLM classifiers (e.g., MedAlpaca: F1 = 47.3 -> 78.5 F1). Current multimodal models demonstrated lower performance (GPT-4o = 70.2 F1; Qwen = 66.0). Performance gains aligned with the distributional similarity between synthetic and real speech. Integrating transformer embeddings with linguistic features enhances ADRD detection from speech. Clinically tuned LLMs effectively support both classification and data augmentation, while further advancement is needed in multimodal modeling.

[33] PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs

Xiao Fu, Hossein A. Rahmani, Bin Wu, Jerome Ramos, Emine Yilmaz, Aldo Lipani

Main category: cs.CL

TL;DR: PREF is a personalised reference-free evaluation framework for text generation, measuring general quality and user alignment without gold references, outperforming baselines in accuracy and human alignment.

DetailsMotivation: Current evaluation methods for personalised text generation often ignore user individuality, necessitating a framework like PREF for robust and user-aligned assessment.

Method: PREF uses a three-step pipeline: (1) generating universal guidelines with an LLM, (2) personalising them using user profiles/preferences, and (3) scoring outputs against the rubric with an LLM judge.

Result: PREF achieves higher accuracy, better calibration, and closer alignment with human judgments than baselines on the PrefEval benchmark.

Conclusion: PREF enables scalable, interpretable, and user-aligned evaluation, advancing reliable development of personalised language generation systems.

Abstract: Personalised text generation is essential for user-centric information systems, yet most evaluation methods overlook the individuality of users. We introduce \textbf{PREF}, a \textbf{P}ersonalised \textbf{R}eference-free \textbf{E}valuation \textbf{F}ramework that jointly measures general output quality and user-specific alignment without requiring gold personalised references. PREF operates in a three-step pipeline: (1) a coverage stage uses a large language model (LLM) to generate a comprehensive, query-specific guideline covering universal criteria such as factuality, coherence, and completeness; (2) a preference stage re-ranks and selectively augments these factors using the target user’s profile, stated or inferred preferences, and context, producing a personalised evaluation rubric; and (3) a scoring stage applies an LLM judge to rate candidate answers against this rubric, ensuring baseline adequacy while capturing subjective priorities. This separation of coverage from preference improves robustness, transparency, and reusability, and allows smaller models to approximate the personalised quality of larger ones. Experiments on the PrefEval benchmark, including implicit preference-following tasks, show that PREF achieves higher accuracy, better calibration, and closer alignment with human judgments than strong baselines. By enabling scalable, interpretable, and user-aligned evaluation, PREF lays the groundwork for more reliable assessment and development of personalised language generation systems.

[34] Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao XuNingyu Zhang, Bo Lin, Meng Han

Main category: cs.CL

TL;DR: LFJ is a jailbreak attack method for LLMs that interpolates hidden states of harmful and benign queries, achieving high success rates. A defense via adversarial training reduces its effectiveness.

DetailsMotivation: LLMs are vulnerable to jailbreak attacks bypassing safety measures, necessitating robust attack and defense methods.

Method: LFJ selects similar query pairs, interpolates hidden states at key layers/tokens, and optimizes for attack success, fluency, and efficiency.

Result: LFJ achieves 94.01% ASR, outperforming existing methods. Adversarial training reduces ASR by over 80%.

Conclusion: LFJ is highly effective; adversarial training mitigates it without harming benign performance, highlighting the importance of query selection and interpolation.

Abstract: Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ’s effectiveness.

[35] Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models

Saaduddin Mahmud, Mason Nakamura, Kyle H. Wray, Shlomo Zilberstein

Main category: cs.CL

TL;DR: The paper introduces IAPO, a framework for jointly optimizing prompts and inference strategies, addressing the gap between prompt optimization and inference scaling.

DetailsMotivation: Existing prompt optimization methods ignore inference strategies, despite their interdependence and impact on alignment and performance. User preferences for trade-offs and budgets further complicate this.

Method: The authors propose IAPO, a unified framework for joint optimization of prompts and inference scales, and develop PSST, a fixed-budget training algorithm with finite-budget guarantees.

Result: PSST is evaluated on six tasks, showing improved alignment and performance in multi-objective text generation and reasoning.

Conclusion: Incorporating inference-awareness in prompt optimization is crucial for aligning black-box LLMs effectively.

Abstract: Prompt optimization methods have demonstrated significant effectiveness in aligning black-box large language models (LLMs). In parallel, inference scaling strategies such as Best-of-N Sampling and Majority Voting have also proven to enhance alignment and performance by trading off computation. However, existing prompt optimization approaches are inference strategy agnostic; that is, they optimize prompts without regard to the inference strategy employed during deployment. This constitutes a significant methodological gap, as our empirical and theoretical analysis reveals a strong interdependence between these two paradigms. Moreover, we find that user preferences regarding trade-offs among multiple objectives and inference budgets substantially influence the choice of prompt and inference configuration. To address this gap, we introduce a unified novel framework named IAPO (Inference-Aware Prompt Optimization) that jointly optimizes the prompt and inference scale, while being aware of the inference budget and different task objectives. We then develop a fixed-budget training algorithm for IAPO, which we call PSST (Prompt Scaling via Sequential Trimming), and analyze finite-budget guarantees on error probability. Finally, we evaluate the effectiveness of PSST on six different tasks, including multi-objective text generation and reasoning, and demonstrate the critical role of incorporating inference-awareness when aligning black-box LLMs through prompt optimization.

[36] The Cost of Thinking: Increased Jailbreak Risk in Large Language Models

Fan Yang

Main category: cs.CL

TL;DR: LLMs in thinking mode are more vulnerable to jailbreak attacks. A safe thinking intervention method reduces this vulnerability.

DetailsMotivation: To address the overlooked vulnerability of LLMs in thinking mode to jailbreak attacks, which poses risks in educational and harmful contexts.

Method: Proposes safe thinking intervention by adding specific thinking tokens to prompts to guide LLMs’ internal processes.

Result: The intervention significantly lowers the attack success rate on LLMs in thinking mode.

Conclusion: Safe thinking intervention effectively mitigates jailbreak attack risks in LLMs’ thinking mode.

Abstract: Thinking mode has always been regarded as one of the most valuable modes in LLMs. However, we uncover a surprising and previously overlooked phenomenon: LLMs with thinking mode are more easily broken by Jailbreak attack. We evaluate 9 LLMs on AdvBench and HarmBench and find that the success rate of attacking thinking mode in LLMs is almost higher than that of non-thinking mode. Through large numbers of sample studies, it is found that for educational purposes and excessively long thinking lengths are the characteristics of successfully attacked data, and LLMs also give harmful answers when they mostly know that the questions are harmful. In order to alleviate the above problems, this paper proposes a method of safe thinking intervention for LLMs, which explicitly guides the internal thinking processes of LLMs by adding “specific thinking tokens” of LLMs to the prompt. The results demonstrate that the safe thinking intervention can significantly reduce the attack success rate of LLMs with thinking mode.

[37] Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion

Dong Zhao, Yadong Wang, Xiang Chen, Chenxi Wang, Hongliang Dai, Chuanxing Geng, Shengzhong Zhang, Shaoyuan Li, Sheng-Jun Huang

Main category: cs.CL

TL;DR: APIE introduces an active prompting framework for LLMs in IE tasks, using introspective confusion to select challenging examples, improving accuracy and robustness.

DetailsMotivation: Current selection strategies for in-context examples in LLMs overlook model confusion in format and content, limiting IE performance.

Method: APIE uses a dual-component uncertainty metric (Format and Content Uncertainty) to rank and select informative few-shot examples.

Result: APIE outperforms baselines on four benchmarks, enhancing extraction accuracy and robustness.

Conclusion: A fine-grained view of model uncertainty is crucial for effective structured generation systems.

Abstract: Large Language Models (LLMs) show remarkable potential for few-shot information extraction (IE), yet their performance is highly sensitive to the choice of in-context examples. Conventional selection strategies often fail to provide informative guidance, as they overlook a key source of model fallibility: confusion stemming not just from semantic content, but also from the generation of well-structured formats required by IE tasks. To address this, we introduce Active Prompting for Information Extraction (APIE), a novel active prompting framework guided by a principle we term introspective confusion. Our method empowers an LLM to assess its own confusion through a dual-component uncertainty metric that uniquely quantifies both Format Uncertainty (difficulty in generating correct syntax) and Content Uncertainty (inconsistency in extracted semantics). By ranking unlabeled data with this comprehensive score, our framework actively selects the most challenging and informative samples to serve as few-shot exemplars. Extensive experiments on four benchmarks show that our approach consistently outperforms strong baselines, yielding significant improvements in both extraction accuracy and robustness. Our work highlights the critical importance of a fine-grained, dual-level view of model uncertainty when it comes to building effective and reliable structured generation systems.

[38] mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning

Nghia Trung Ngo, Franck Dernoncourt, Thien Huu Nguyen

Main category: cs.CL

TL;DR: The paper introduces mSCoRe, a benchmark for evaluating multilingual and skill-based commonsense reasoning in LLMs, highlighting current model limitations.

DetailsMotivation: To investigate how LLMs utilize human reasoning skills in multilingual commonsense reasoning, an underexplored area.

Method: Proposes mSCoRe with a reasoning skill taxonomy, data synthesis pipeline, and complexity scaling framework.

Result: mSCoRe is challenging for current LLMs, especially at higher complexity, revealing limitations in nuanced multilingual reasoning.

Conclusion: Future work should focus on improving multilingual commonsense reasoning in LLMs, guided by detailed analysis of reasoning processes.

Abstract: Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a \textbf{M}ultilingual and Scalable Benchmark for \textbf{S}kill-based \textbf{Co}mmonsense \textbf{Re}asoning (\textbf{mSCoRe}). Our benchmark incorporates three key components that are designed to systematically evaluate LLM’s reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models' reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity scaling framework allowing task difficulty to scale dynamically alongside future improvements in LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying sizes and training approaches demonstrate that \textbf{mSCoRe} remains significantly challenging for current models, particularly at higher complexity levels. Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense. We further provide detailed analysis on the models’ reasoning processes, suggesting future directions for improving multilingual commonsense reasoning capabilities.

[39] Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs

Kartikeya Badola, Jonathan Simon, Arian Hosseini, Sara Marie Mc Carthy, Tsendsuren Munkhdalai, Abhimanyu Goyal, Tomáš Kočiský, Shyam Upadhyay, Bahare Fatemi, Mehran Kazemi

Main category: cs.CL

TL;DR: A new benchmark evaluates LLMs’ ability to handle multi-turn dialogue, reasoning, and information-seeking tasks, revealing significant gaps in current models.

DetailsMotivation: LLMs struggle with nuanced, interactive real-world tasks, necessitating development of models that can handle multi-turn dialogue and reasoning with incomplete data.

Method: A novel benchmark with multi-turn tasks tests reasoning, dialogue, and information-seeking abilities, using deterministic scoring to avoid human bias.

Result: Evaluation shows major gaps in LLMs, with errors stemming from poor instruction following, reasoning, and planning.

Conclusion: The benchmark highlights LLM weaknesses in interactive scenarios and provides a foundation for future research to improve these capabilities.

Abstract: Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.

[40] LaajMeter: A Framework for LaaJ Evaluation

Gal Amram, Eitan Farchi, Shmulik Froimovich, Raviv Gal, Avi Ziv

Main category: cs.CL

TL;DR: LaaJMeter is a simulation-based framework for meta-evaluating LLM-as-a-Judge (LaaJ) systems in domain-specific contexts, addressing challenges like scarce annotated data and unvalidated metrics.

DetailsMotivation: The motivation is to address the lack of validated metrics and thresholds for evaluating LaaJs in domain-specific tasks, where expert evaluation is costly and data is scarce.

Method: The method involves LaaJMeter, a framework that generates synthetic data to simulate virtual models and judges, enabling systematic analysis of evaluation metrics under controlled conditions.

Result: Results show LaaJMeter’s effectiveness in a code translation task, revealing metric sensitivity variations and the limitations of common metrics.

Conclusion: LaaJMeter offers a scalable, extensible solution for principled LaaJ evaluation in low-resource settings, enhancing trustworthy NLP evaluation.

Abstract: Large Language Models (LLMs) are increasingly used as evaluators in natural language processing tasks, a paradigm known as LLM-as-a-Judge (LaaJ). While effective in general domains, LaaJs pose significant challenges in domain-specific contexts, where annotated data is scarce and expert evaluation is costly. In such cases, meta-evaluation is often performed using metrics that have not been validated for the specific domain in which they are applied. As a result, it becomes difficult to determine which metrics effectively identify LaaJ quality, and further, what threshold indicates sufficient evaluator performance. In this work, we introduce LaaJMeter, a simulation-based framework for controlled meta-evaluation of LaaJs. LaaJMeter enables engineers to generate synthetic data representing virtual models and judges, allowing systematic analysis of evaluation metrics under realistic conditions. This helps practitioners validate and refine LaaJs for specific evaluation tasks: they can test whether their metrics correctly distinguish between better and worse (virtual) LaaJs, and estimate appropriate thresholds for evaluator adequacy. We demonstrate the utility of LaaJMeter in a code translation task involving a legacy programming language, showing how different metrics vary in sensitivity to evaluator quality. Our results highlight the limitations of common metrics and the importance of principled metric selection. LaaJMeter provides a scalable and extensible solution for assessing LaaJs in low-resource settings, contributing to the broader effort to ensure trustworthy and reproducible evaluation in NLP.

[41] Estimating Machine Translation Difficulty

Lorenzo Proietti, Stefano Perrella, Vilém Zouhar, Roberto Navigli, Tom Kocmi

Main category: cs.CL

TL;DR: The paper introduces a method to estimate translation difficulty to improve machine translation evaluation and research, proposing new metrics and models (Sentinel-src) that outperform existing approaches.

DetailsMotivation: High-quality machine translations make it hard to distinguish between models and identify areas for improvement, necessitating tools to pinpoint challenging texts.

Method: Formalizes translation difficulty estimation, introduces a new metric, and evaluates baselines and novel approaches, including dedicated models (Sentinel-src).

Result: Sentinel-src models outperform heuristic-based and LLM-as-a-judge methods, and practical utility is shown by creating more challenging benchmarks.

Conclusion: Dedicated models like Sentinel-src-24 and Sentinel-src-25 effectively identify difficult texts, aiding in more discriminative evaluations and guiding future research.

Abstract: Machine translation quality has began achieving near-perfect translations in some setups. These high-quality outputs make it difficult to distinguish between state-of-the-art models and to identify areas for future improvement. Automatically identifying texts where machine translation systems struggle holds promise for developing more discriminative evaluations and guiding future research. We formalize the task of translation difficulty estimation, defining a text’s difficulty based on the expected quality of its translations. We introduce a new metric to evaluate difficulty estimators and use it to assess both baselines and novel approaches. Finally, we demonstrate the practical utility of difficulty estimators by using them to construct more challenging machine translation benchmarks. Our results show that dedicated models (dubbed Sentinel-src) outperform both heuristic-based methods (e.g. word rarity or syntactic complexity) and LLM-as-a-judge approaches. We release two improved models for difficulty estimation, Sentinel-src-24 and Sentinel-src-25, which can be used to scan large collections of texts and select those most likely to challenge contemporary machine translation systems.

[42] Efficient Forward-Only Data Valuation for Pretrained LLMs and VLMs

Wenlong Deng, Jiaming Zhang, Qi Zeng, Christos Thrampoulidis, Boying Gong, Xiaoxiao Li

Main category: cs.CL

TL;DR: For-Value is a scalable, forward-only framework for efficient data valuation in large models, eliminating costly gradient computations.

DetailsMotivation: Enhancing transparency and accountability in large language and vision-language models by quantifying individual training sample influence.

Method: For-Value uses a closed-form expression based on a single forward pass, leveraging modern foundation models’ representations.

Result: Matches or outperforms gradient-based methods in identifying impactful examples and detecting mislabeled data.

Conclusion: For-Value provides a practical, efficient solution for data valuation in large-scale models.

Abstract: Quantifying the influence of individual training samples is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing data valuation methods often rely on Hessian information or model retraining, making them computationally prohibitive for billion-parameter models. In this work, we introduce For-Value, a forward-only data valuation framework that enables scalable and efficient influence estimation for both LLMs and VLMs. By leveraging the rich representations of modern foundation models, For-Value computes influence scores using a simple closed-form expression based solely on a single forward pass, thereby eliminating the need for costly gradient computations. Our theoretical analysis demonstrates that For-Value accurately estimates per-sample influence by capturing alignment in hidden representations and prediction errors between training and validation samples. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in identifying impactful fine-tuning examples and effectively detecting mislabeled data.

[43] PakBBQ: A Culturally Adapted Bias Benchmark for QA

Abdullah Hashmat, Muhammad Arham Mirza, Agha Ali Raza

Main category: cs.CL

TL;DR: PakBBQ is a culturally adapted dataset for evaluating bias in LLMs, focusing on Pakistan. It shows disambiguation improves accuracy, Urdu reduces bias more than English, and negative framing mitigates stereotypes.

DetailsMotivation: Address the lack of fairness evaluation in LLMs for low-resource languages and regional contexts, particularly in Pakistan.

Method: Introduce PakBBQ, a dataset with 214 templates and 17180 QA pairs in English and Urdu, covering 8 bias dimensions. Evaluate multilingual LLMs under ambiguous/disambiguated contexts and negative/non-negative framing.

Result: Disambiguation improves accuracy by 12%, Urdu shows stronger counter-bias than English, and negative framing reduces stereotypes.

Conclusion: Contextualized benchmarks and prompt engineering are crucial for bias mitigation in low-resource settings.

Abstract: With the widespread adoption of Large Language Models (LLMs) across various applications, it is empirical to ensure their fairness across all user communities. However, most LLMs are trained and evaluated on Western centric data, with little attention paid to low-resource languages and regional contexts. To address this gap, we introduce PakBBQ, a culturally and regionally adapted extension of the original Bias Benchmark for Question Answering (BBQ) dataset. PakBBQ comprises over 214 templates, 17180 QA pairs across 8 categories in both English and Urdu, covering eight bias dimensions including age, disability, appearance, gender, socio-economic status, religious, regional affiliation, and language formality that are relevant in Pakistan. We evaluate multiple multilingual LLMs under both ambiguous and explicitly disambiguated contexts, as well as negative versus non negative question framings. Our experiments reveal (i) an average accuracy gain of 12% with disambiguation, (ii) consistently stronger counter bias behaviors in Urdu than in English, and (iii) marked framing effects that reduce stereotypical responses when questions are posed negatively. These findings highlight the importance of contextualized benchmarks and simple prompt engineering strategies for bias mitigation in low resource settings.

[44] Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models

Igor Halperin

Main category: cs.CL

TL;DR: The paper introduces Semantic Divergence Metrics (SDM), a lightweight framework for detecting Faithfulness Hallucinations in LLMs by measuring semantic divergence between prompts and responses.

DetailsMotivation: Addressing hallucinations in LLMs, particularly confabulations, where responses are arbitrary and misaligned with user queries.

Method: Uses joint clustering on embeddings to create a shared topic space, computes information-theoretic metrics (e.g., Jensen-Shannon divergence, Wasserstein distance) to quantify divergence, and introduces the Semantic Box for classification.

Result: SDM improves detection of hallucinations by being prompt-aware and measuring consistency across paraphrased prompts. The Semantic Box aids in classifying LLM response types.

Conclusion: SDM provides a robust framework for detecting and classifying hallucinations in LLMs, enhancing reliability and interpretability of model outputs.

Abstract: The proliferation of Large Language Models (LLMs) is challenged by hallucinations, critical failure modes where models generate non-factual, nonsensical or unfaithful text. This paper introduces Semantic Divergence Metrics (SDM), a novel lightweight framework for detecting Faithfulness Hallucinations – events of severe deviations of LLMs responses from input contexts. We focus on a specific implementation of these LLM errors, {confabulations, defined as responses that are arbitrary and semantically misaligned with the user’s query. Existing methods like Semantic Entropy test for arbitrariness by measuring the diversity of answers to a single, fixed prompt. Our SDM framework improves upon this by being more prompt-aware: we test for a deeper form of arbitrariness by measuring response consistency not only across multiple answers but also across multiple, semantically-equivalent paraphrases of the original prompt. Methodologically, our approach uses joint clustering on sentence embeddings to create a shared topic space for prompts and answers. A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue. We then compute a suite of information-theoretic metrics to measure the semantic divergence between prompts and responses. Our practical score, $\mathcal{S}_H$, combines the Jensen-Shannon divergence and Wasserstein distance to quantify this divergence, with a high score indicating a Faithfulness hallucination. Furthermore, we identify the KL divergence KL(Answer $||$ Prompt) as a powerful indicator of \textbf{Semantic Exploration}, a key signal for distinguishing different generative behaviors. These metrics are further combined into the Semantic Box, a diagnostic framework for classifying LLM response types, including the dangerous, confident confabulation.

[45] Understanding Textual Emotion Through Emoji Prediction

Ethan Gordon, Nishank Kuppa, Rigved Tummala, Sriram Anasuri

Main category: cs.CL

TL;DR: The paper compares four deep learning models (feed-forward, CNN, transformer, BERT) for emoji prediction, highlighting BERT’s overall performance and CNN’s strength with rare emojis.

DetailsMotivation: To improve emoji prediction from short text sequences for better human-computer interaction.

Method: Uses four architectures (feed-forward, CNN, transformer, BERT) on the TweetEval dataset, addressing class imbalance with focal loss and regularization.

Result: BERT performs best overall, while CNN excels with rare emoji classes.

Conclusion: Architecture selection and hyperparameter tuning are crucial for sentiment-aware emoji prediction, enhancing interaction.

Abstract: This project explores emoji prediction from short text sequences using four deep learning architectures: a feed-forward network, CNN, transformer, and BERT. Using the TweetEval dataset, we address class imbalance through focal loss and regularization techniques. Results show BERT achieves the highest overall performance due to its pre-training advantage, while CNN demonstrates superior efficacy on rare emoji classes. This research shows the importance of architecture selection and hyperparameter tuning for sentiment-aware emoji prediction, contributing to improved human-computer interaction.

[46] Using Large Language Models to Measure Symptom Severity in Patients At Risk for Schizophrenia

Andrew X. Chen, Guillermo Horga, Sean Escola

Main category: cs.CL

TL;DR: Large language models (LLMs) predict BPRS scores from clinical interview transcripts in CHR patients, showing high concordance and reliability, even in foreign languages and longitudinal contexts.

DetailsMotivation: The BPRS is underused in clinical practice due to its lengthy structured interview requirement. LLMs offer a potential solution to streamline and standardize symptom assessment.

Method: LLMs were used to predict BPRS scores from unstructured clinical interview transcripts in 409 CHR patients from the AMP-SCZ cohort, evaluating zero-shot, foreign language, and longitudinal performance.

Result: High concordance (median 0.84) and ICC (0.73) were achieved, comparable to human reliability. Performance remained strong in foreign languages (concordance 0.88, ICC 0.70) and with longitudinal data.

Conclusion: LLMs can effectively predict BPRS scores, offering a scalable and standardized tool for monitoring CHR patients, even in diverse linguistic and longitudinal settings.

Abstract: Patients who are at clinical high risk (CHR) for schizophrenia need close monitoring of their symptoms to inform appropriate treatments. The Brief Psychiatric Rating Scale (BPRS) is a validated, commonly used research tool for measuring symptoms in patients with schizophrenia and other psychotic disorders; however, it is not commonly used in clinical practice as it requires a lengthy structured interview. Here, we utilize large language models (LLMs) to predict BPRS scores from clinical interview transcripts in 409 CHR patients from the Accelerating Medicines Partnership Schizophrenia (AMP-SCZ) cohort. Despite the interviews not being specifically structured to measure the BPRS, the zero-shot performance of the LLM predictions compared to the true assessment (median concordance: 0.84, ICC: 0.73) approaches human inter- and intra-rater reliability. We further demonstrate that LLMs have substantial potential to improve and standardize the assessment of CHR patients via their accuracy in assessing the BPRS in foreign languages (median concordance: 0.88, ICC: 0.70), and integrating longitudinal information in a one-shot or few-shot learning approach.

[47] A Computational Approach to Analyzing Language Change and Variation in the Constructed Language Toki Pona

Daniel Huang, Hyoun-A Joo

Main category: cs.CL

TL;DR: The study analyzes Toki Pona’s language change and variation using computational methods, finding sociolinguistic influences and natural evolution.

DetailsMotivation: To understand how Toki Pona, a constructed language, changes and varies like natural languages.

Method: Computational and corpus-based analysis of fluid word classes and transitivity across different corpora.

Result: Sociolinguistic factors influence Toki Pona similarly to natural languages, showing community-driven evolution.

Conclusion: Constructed languages like Toki Pona evolve naturally through community usage, resembling natural language processes.

Abstract: This study explores language change and variation in Toki Pona, a constructed language with approximately 120 core words. Taking a computational and corpus-based approach, the study examines features including fluid word classes and transitivity in order to examine (1) changes in preferences of content words for different syntactic positions over time and (2) variation in usage across different corpora. The results suggest that sociolinguistic factors influence Toki Pona in the same way as natural languages, and that even constructed linguistic systems naturally evolve as communities use them.

[48] Inductive Bias Extraction and Matching for LLM Prompts

Christian M. Angel, Francis Ferraro

Main category: cs.CL

TL;DR: Using LLM outputs to refine prompts improves performance by aligning with the model’s inductive bias, boosting classification and ranking ratings.

DetailsMotivation: LLMs are sensitive to prompt wording, and their inductive bias affects performance. This paper explores leveraging LLM outputs to create better prompts.

Method: Inductive Bias Extraction and Matching: using LLM outputs as part of prompts to align wording with the model’s bias.

Result: Improvements of up to 19% in classification Likert ratings and 27% in ranking Likert ratings.

Conclusion: Refining prompts by matching LLM inductive bias significantly enhances performance in classification and ranking tasks.

Abstract: The active research topic of prompt engineering makes it evident that LLMs are sensitive to small changes in prompt wording. A portion of this can be ascribed to the inductive bias that is present in the LLM. By using an LLM’s output as a portion of its prompt, we can more easily create satisfactory wording for prompts. This has the effect of creating a prompt that matches the inductive bias in model. Empirically, we show that using this Inductive Bias Extraction and Matching strategy improves LLM Likert ratings used for classification by up to 19% and LLM Likert ratings used for ranking by up to 27%.

[49] Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race

Gustavo Bonil, Simone Hashiguti, Jhessica Silva, João Gondim, Helena Maia, Nádia Silva, Helio Pedrini, Sandra Avila

Main category: cs.CL

TL;DR: The study proposes a qualitative framework to detect biases in LLMs, revealing racial and gender biases in generated stories and highlighting limitations in current bias correction methods.

DetailsMotivation: To address the limitations of quantitative bias detection methods in LLMs and uncover nuanced biases in natural language outputs.

Method: Manual qualitative analysis of LLM-generated short stories featuring Black and white women to investigate gender and racial biases.

Result: Black women were tied to ancestry and resistance, while white women were portrayed in self-discovery, reflecting essentialization. Bias corrections were superficial.

Conclusion: Qualitative methods are crucial for identifying and mitigating biases in LLMs, emphasizing the need for interdisciplinary, ethical AI development.

Abstract: With the advance of Artificial Intelligence (AI), Large Language Models (LLMs) have gained prominence and been applied in diverse contexts. As they evolve into more sophisticated versions, it is essential to assess whether they reproduce biases, such as discrimination and racialization, while maintaining hegemonic discourses. Current bias detection approaches rely mostly on quantitative, automated methods, which often overlook the nuanced ways in which biases emerge in natural language. This study proposes a qualitative, discursive framework to complement such methods. Through manual analysis of LLM-generated short stories featuring Black and white women, we investigate gender and racial biases. We contend that qualitative methods such as the one proposed here are fundamental to help both developers and users identify the precise ways in which biases manifest in LLM outputs, thus enabling better conditions to mitigate them. Results show that Black women are portrayed as tied to ancestry and resistance, while white women appear in self-discovery processes. These patterns reflect how language models replicate crystalized discursive representations, reinforcing essentialization and a sense of social immobility. When prompted to correct biases, models offered superficial revisions that maintained problematic meanings, revealing limitations in fostering inclusive narratives. Our results demonstrate the ideological functioning of algorithms and have significant implications for the ethical use and development of AI. The study reinforces the need for critical, interdisciplinary approaches to AI design and deployment, addressing how LLM-generated discourses reflect and perpetuate inequalities.

[50] ReviewRL: Towards Automated Scientific Review with RL

Sihang Zeng, Kai Tian, Kaiyan Zhang, Yuru wang, Junqi Gao, Runze Liu, Sa Yang, Jingxuan Li, Xinwei Long, Jiaheng Ma, Biqing Qi, Bowen Zhou

Main category: cs.CL

TL;DR: ReviewRL is a reinforcement learning framework for generating high-quality scientific paper reviews, outperforming existing methods by incorporating relevant literature, supervised fine-tuning, and a composite reward function.

DetailsMotivation: Addressing challenges in peer review like increasing submission volumes and reviewer fatigue, and improving automated review quality.

Method: Combines retrieval-augmented context generation, supervised fine-tuning, and reinforcement learning with a composite reward function.

Result: Outperforms existing methods in experiments on ICLR 2025 papers.

Conclusion: ReviewRL provides a foundational framework for RL-driven automatic critique generation in scientific discovery.

Abstract: Peer review is essential for scientific progress but faces growing challenges due to increasing submission volumes and reviewer fatigue. Existing automated review approaches struggle with factual accuracy, rating consistency, and analytical depth, often generating superficial or generic feedback lacking the insights characteristic of high-quality human reviews. We introduce ReviewRL, a reinforcement learning framework for generating comprehensive and factually grounded scientific paper reviews. Our approach combines: (1) an ArXiv-MCP retrieval-augmented context generation pipeline that incorporates relevant scientific literature, (2) supervised fine-tuning that establishes foundational reviewing capabilities, and (3) a reinforcement learning procedure with a composite reward function that jointly enhances review quality and rating accuracy. Experiments on ICLR 2025 papers demonstrate that ReviewRL significantly outperforms existing methods across both rule-based metrics and model-based quality assessments. ReviewRL establishes a foundational framework for RL-driven automatic critique generation in scientific discovery, demonstrating promising potential for future development in this domain. The implementation of ReviewRL will be released at GitHub.

[51] From Surface to Semantics: Semantic Structure Parsing for Table-Centric Document Analysis

Xuan Li, Jialiang Dong, Raymond Wong

Main category: cs.CL

TL;DR: DOTABLER is a framework for deep semantic parsing of tables in documents, improving context-consistent analysis and table retrieval.

DetailsMotivation: Existing methods lack deep semantic parsing of tables and their contextual associations, limiting advanced tasks like cross-paragraph data interpretation.

Method: DOTABLER uses a custom dataset and fine-tunes pre-trained models to identify semantic links between tables and context, enabling table-centric parsing and retrieval.

Result: Achieves over 90% Precision and F1 scores on real-world PDFs, outperforming models like GPT-4o.

Conclusion: DOTABLER excels in semantic table-context analysis and deep document parsing, offering superior performance.

Abstract: Documents are core carriers of information and knowl-edge, with broad applications in finance, healthcare, and scientific research. Tables, as the main medium for structured data, encapsulate key information and are among the most critical document components. Existing studies largely focus on surface-level tasks such as layout analysis, table detection, and data extraction, lacking deep semantic parsing of tables and their contextual associations. This limits advanced tasks like cross-paragraph data interpretation and context-consistent analysis. To address this, we propose DOTABLER, a table-centric semantic document parsing framework designed to uncover deep semantic links between tables and their context. DOTABLER leverages a custom dataset and domain-specific fine-tuning of pre-trained models, integrating a complete parsing pipeline to identify context segments semantically tied to tables. Built on this semantic understanding, DOTABLER implements two core functionalities: table-centric document structure parsing and domain-specific table retrieval, delivering comprehensive table-anchored semantic analysis and precise extraction of semantically relevant tables. Evaluated on nearly 4,000 pages with over 1,000 tables from real-world PDFs, DOTABLER achieves over 90% Precision and F1 scores, demonstrating superior performance in table-context semantic analysis and deep document parsing compared to advanced models such as GPT-4o.

[52] Beyond Semantic Understanding: Preserving Collaborative Frequency Components in LLM-based Recommendation

Minhao Wang, Yunhang He, Cong Xu, Zhangchi Zhu, Wei Zhang

Main category: cs.CL

TL;DR: FreLLM4Rec balances semantic and collaborative signals in LLM-based recommenders using spectral techniques, improving performance by up to 8.00% in NDCG@10.

DetailsMotivation: LLM-based recommenders weaken collaborative signals, unlike traditional models, limiting their effectiveness.

Method: FreLLM4Rec uses a Global Graph Low-Pass Filter (G-LPF) and Temporal Frequency Modulation (TFM) to preserve collaborative signals.

Result: FreLLM4Rec mitigates signal attenuation and outperforms baselines by up to 8.00% in NDCG@10.

Conclusion: The approach provides insights into LLM processing of collaborative signals and improves LLM-based recommendation systems.

Abstract: Recommender systems in concert with Large Language Models (LLMs) present promising avenues for generating semantically-informed recommendations. However, LLM-based recommenders exhibit a tendency to overemphasize semantic correlations within users’ interaction history. When taking pretrained collaborative ID embeddings as input, LLM-based recommenders progressively weaken the inherent collaborative signals as the embeddings propagate through LLM backbones layer by layer, as opposed to traditional Transformer-based sequential models in which collaborative signals are typically preserved or even enhanced for state-of-the-art performance. To address this limitation, we introduce FreLLM4Rec, an approach designed to balance semantic and collaborative information from a spectral perspective. Item embeddings that incorporate both semantic and collaborative information are first purified using a Global Graph Low-Pass Filter (G-LPF) to preliminarily remove irrelevant high-frequency noise. Temporal Frequency Modulation (TFM) then actively preserves collaborative signal layer by layer. Note that the collaborative preservation capability of TFM is theoretically guaranteed by establishing a connection between the optimal but hard-to-implement local graph fourier filters and the suboptimal yet computationally efficient frequency-domain filters. Extensive experiments on four benchmark datasets demonstrate that FreLLM4Rec successfully mitigates collaborative signal attenuation and achieves competitive performance, with improvements of up to 8.00% in NDCG@10 over the best baseline. Our findings provide insights into how LLMs process collaborative information and offer a principled approach for improving LLM-based recommendation systems.

[53] Cross-Prompt Encoder for Low-Performing Languages

Beso Mikaberidze, Teimuraz Saghinadze, Simon Ostermann, Philipp Muller

Main category: cs.CL

TL;DR: Soft prompts, like the Cross-Prompt Encoder (XPE), improve multilingual task performance, especially for low-performing languages, by combining lightweight encoding and multi-source training. A hybrid Dual Soft Prompt mechanism further enhances adaptability.

DetailsMotivation: To explore the untapped potential of soft prompts for cross-language transfer and improve performance on low-performing languages in PEFT.

Method: Introduces XPE, a lightweight prompt encoder trained on diverse languages, and a Dual Soft Prompt mechanism combining encoder-based and standard soft prompts.

Result: XPE excels for low-performing languages, while the hybrid approach offers broader adaptability, as shown on the SIB-200 benchmark.

Conclusion: Soft prompts, particularly XPE and hybrid designs, are effective for multilingual adaptation, balancing shared structure and language-specific needs.

Abstract: Soft prompts have emerged as a powerful alternative to adapters in parameter-efficient fine-tuning (PEFT), enabling large language models (LLMs) to adapt to downstream tasks without architectural changes or parameter updates. While prior work has focused on stabilizing training via parameter interaction in small neural prompt encoders, their broader potential for transfer across languages remains unexplored. In this paper, we demonstrate that a prompt encoder can play a central role in improving performance on low-performing languages-those that achieve poor accuracy even under full-model fine-tuning. We introduce the Cross-Prompt Encoder (XPE), which combines a lightweight encoding architecture with multi-source training on typologically diverse languages - a design that enables the model to capture abstract and transferable patterns across languages. To complement XPE, we propose a Dual Soft Prompt mechanism that combines an encoder-based prompt with a directly trained standard soft prompt. This hybrid design proves especially effective for target languages that benefit from both broadly shared structure and language-specific alignment. Experiments on the SIB-200 benchmark reveal a consistent trade-off: XPE is most effective for low-performing languages, while hybrid variants offer broader adaptability across multilingual settings.

[54] Making Qwen3 Think in Korean with Reinforcement Learning

Jungyup Lee, Jemin Kim, Sang Park, SeungJae Lee

Main category: cs.CL

TL;DR: A two-stage fine-tuning approach enhances Qwen3 14B’s Korean reasoning and problem-solving, using SFT and GRPO with an oracle judge for stability.

DetailsMotivation: To improve the Qwen3 14B model's native Korean reasoning and problem-solving abilities while maintaining general proficiency.

Method: 1. Supervised fine-tuning (SFT) on a Korean reasoning dataset. 2. Reinforcement learning with GRPO, stabilized by an oracle judge.

Result: Improved Korean reasoning, math, and coding performance, with stable training and internal Korean chain-of-thought.

Conclusion: The approach successfully enhances Korean reasoning and problem-solving without compromising general abilities.

Abstract: We present a two-stage fine-tuning approach to make the large language model Qwen3 14B “think” natively in Korean. In the first stage, supervised fine-tuning (SFT) on a high-quality Korean reasoning dataset establishes a strong foundation in Korean logical reasoning, yielding notable improvements in Korean-language tasks and even some gains in general reasoning ability. In the second stage, we employ reinforcement learning with a customized Group Relative Policy Optimization (GRPO) algorithm to further enhance both Korean reasoning alignment and overall problem-solving performance. We address critical stability challenges in GRPO training - such as reward hacking and policy collapse - by introducing an oracle judge model that calibrates the reward signal. Our approach achieves stable learning (avoiding the collapse observed in naive GRPO) and leads to steady, incremental performance gains. The final RL-tuned model demonstrates substantially improved results on advanced reasoning benchmarks (particularly math and coding tasks) while maintaining knowledge and language proficiency, successfully conducting its internal chain-of-thought entirely in Korean.

[55] Advancing Cross-lingual Aspect-Based Sentiment Analysis with LLMs and Constrained Decoding for Sequence-to-Sequence Models

Jakub Šmíd, Pavel Přibáň, Pavel Král

Main category: cs.CL

TL;DR: A novel sequence-to-sequence method for cross-lingual ABSA eliminates translation tools, improving performance by 10% and handling complex tasks. It outperforms English-centric LLMs.

DetailsMotivation: Addressing the limitations of current cross-lingual ABSA methods, which rely on translation tools and focus on simpler tasks, especially for low-resource languages.

Method: Proposes a sequence-to-sequence approach with constrained decoding to avoid translation tools, tested against multilingual and English-centric LLMs.

Result: Achieves up to 10% improvement in cross-lingual ABSA performance and handles more complex tasks effectively.

Conclusion: The method offers a practical, efficient alternative to translation-dependent techniques, broadening cross-lingual ABSA applicability.

Abstract: Aspect-based sentiment analysis (ABSA) has made significant strides, yet challenges remain for low-resource languages due to the predominant focus on English. Current cross-lingual ABSA studies often centre on simpler tasks and rely heavily on external translation tools. In this paper, we present a novel sequence-to-sequence method for compound ABSA tasks that eliminates the need for such tools. Our approach, which uses constrained decoding, improves cross-lingual ABSA performance by up to 10%. This method broadens the scope of cross-lingual ABSA, enabling it to handle more complex tasks and providing a practical, efficient alternative to translation-dependent techniques. Furthermore, we compare our approach with large language models (LLMs) and show that while fine-tuned multilingual LLMs can achieve comparable results, English-centric LLMs struggle with these tasks.

[56] Large Language Models for Summarizing Czech Historical Documents and Beyond

Václav Tran, Jakub Šmíd, Jiří Martínek, Ladislav Lenc, Pavel Král

Main category: cs.CL

TL;DR: The paper addresses Czech text summarization, focusing on historical documents, using models like Mistral and mT5, achieving state-of-the-art results and introducing a new dataset.

DetailsMotivation: Czech summarization, especially for historical texts, is underexplored due to linguistic challenges and lack of datasets.

Method: Employed large language models (Mistral, mT5) for Czech summarization.

Result: Achieved state-of-the-art results on SumeCzech and introduced a new dataset for historical Czech documents.

Conclusion: The work advances Czech summarization and opens research opportunities for historical text processing.

Abstract: Text summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information. While summarization has been significantly explored in English and other high-resource languages, Czech text summarization, particularly for historical documents, remains underexplored due to linguistic complexities and a scarcity of annotated datasets. Large language models such as Mistral and mT5 have demonstrated excellent results on many natural language processing tasks and languages. Therefore, we employ these models for Czech summarization, resulting in two key contributions: (1) achieving new state-of-the-art results on the modern Czech summarization dataset SumeCzech using these advanced models, and (2) introducing a novel dataset called Posel od \v{C}erchova for summarization of historical Czech documents with baseline results. Together, these contributions provide a great potential for advancing Czech text summarization and open new avenues for research in Czech historical text processing.

[57] Improving Generative Cross-lingual Aspect-Based Sentiment Analysis with Constrained Decoding

Jakub Šmíd, Pavel Přibáň, Pavel Král

Main category: cs.CL

TL;DR: The paper introduces a constrained decoding method with sequence-to-sequence models for cross-lingual ABSA, improving performance by 5% and supporting multi-tasking. It outperforms state-of-the-art methods across seven languages and six tasks, while also evaluating LLMs in various scenarios.

DetailsMotivation: Addressing the neglect of low-resource languages in ABSA and the limitations of current cross-lingual approaches, which rely on unreliable translation tools and handle less complex tasks.

Method: Proposes constrained decoding with sequence-to-sequence models, eliminating translation tools and enabling multi-tasking. Evaluates across seven languages and six ABSA tasks, including LLMs in zero-shot, few-shot, and fine-tuning settings.

Result: Achieves 5% improvement in cross-lingual performance and over 10% boost in multi-tasking. Outperforms state-of-the-art methods and sets new benchmarks. LLMs perform poorly in zero-shot/few-shot but competitively when fine-tuned.

Conclusion: The study advances cross-lingual ABSA, offering practical recommendations and insights into the strengths and limitations of current approaches.

Abstract: While aspect-based sentiment analysis (ABSA) has made substantial progress, challenges remain for low-resource languages, which are often overlooked in favour of English. Current cross-lingual ABSA approaches focus on limited, less complex tasks and often rely on external translation tools. This paper introduces a novel approach using constrained decoding with sequence-to-sequence models, eliminating the need for unreliable translation tools and improving cross-lingual performance by 5% on average for the most complex task. The proposed method also supports multi-tasking, which enables solving multiple ABSA tasks with a single model, with constrained decoding boosting results by more than 10%. We evaluate our approach across seven languages and six ABSA tasks, surpassing state-of-the-art methods and setting new benchmarks for previously unexplored tasks. Additionally, we assess large language models (LLMs) in zero-shot, few-shot, and fine-tuning scenarios. While LLMs perform poorly in zero-shot and few-shot settings, fine-tuning achieves competitive results compared to smaller multilingual models, albeit at the cost of longer training and inference times. We provide practical recommendations for real-world applications, enhancing the understanding of cross-lingual ABSA methodologies. This study offers valuable insights into the strengths and limitations of cross-lingual ABSA approaches, advancing the state-of-the-art in this challenging research domain.

[58] Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

Huizhen Shu, Xuying Li, Qirui Wang, Yuji Kosuga, Mengqiu Tian, Zhuo Li

Main category: cs.CL

TL;DR: A new black-box attack method, Sparse Feature Perturbation Framework (SFPF), uses sparse autoencoders to generate adversarial texts by perturbing critical features, evading defenses while maintaining malicious intent.

DetailsMotivation: To address the challenge of generating adversarial examples for LLMs to understand vulnerabilities and improve robustness.

Method: SFPF leverages sparse autoencoders to identify and perturb critical features in text, clustering high-activation features for selective perturbation.

Result: Adversarial texts bypass state-of-the-art defenses, revealing NLP system vulnerabilities, though effectiveness varies by prompt and layer.

Conclusion: SFPF offers a balanced red-teaming strategy but requires further validation for generalizability across architectures and larger models.

Abstract: With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes sparse autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while amplifying safety signals, thereby increasing their potential to evade existing defenses. Our method enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Experimental results demonstrate that adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms, revealing persistent vulnerabilities in current NLP systems.However, the method’s effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated.

[59] Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

Chiyu Zhang, Lu Zhou, Xiaogang Xu, Jiafei Wu, Liming Fang, Zhe Liu

Main category: cs.CL

TL;DR: The paper proposes MDH, a hybrid framework combining LLMs and human oversight for detecting malicious content in datasets, and introduces two new jailbreak strategies: D-Attack and DH-CoT.

DetailsMotivation: Existing methods for evaluating jailbreak attacks rely on labor-intensive manual annotation or inconsistent LLMs, necessitating a balanced approach.

Method: The MDH framework integrates LLM-based annotation with minimal human oversight for dataset cleaning and jailbreak detection. Two new strategies, D-Attack and DH-CoT, are introduced to exploit developer messages.

Result: MDH improves accuracy and efficiency in malicious content detection. The new strategies enhance jailbreak success rates.

Conclusion: The hybrid framework and novel strategies address limitations in current jailbreak evaluation methods, with practical tools and datasets made publicly available.

Abstract: Evaluating jailbreak attacks is challenging when prompts are not overtly harmful or fail to induce harmful outputs. Unfortunately, many existing red-teaming datasets contain such unsuitable prompts. To evaluate attacks accurately, these datasets need to be assessed and cleaned for maliciousness. However, existing malicious content detection methods rely on either manual annotation, which is labor-intensive, or large language models (LLMs), which have inconsistent accuracy in harmful types. To balance accuracy and efficiency, we propose a hybrid evaluation framework named MDH (Malicious content Detection based on LLMs with Human assistance) that combines LLM-based annotation with minimal human oversight, and apply it to dataset cleaning and detection of jailbroken responses. Furthermore, we find that well-crafted developer messages can significantly boost jailbreak success, leading us to propose two new strategies: D-Attack, which leverages context simulation, and DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets, judgements, and detection results will be released in github repository: https://github.com/AlienZhang1996/DH-CoT.

[60] ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

Juyuan Wang, Rongchen Zhao, Wei Wei, Yufeng Wang, Mo Yu, Jie Zhou, Jin Xu, Liyan Xu

Main category: cs.CL

TL;DR: ComoRAG improves narrative comprehension in long stories by using iterative, dynamic retrieval and memory consolidation, outperforming traditional RAG methods by up to 11%.

DetailsMotivation: Traditional RAG methods struggle with dynamic, interconnected relations in long narratives due to their stateless, single-step retrieval.

Method: ComoRAG employs iterative reasoning cycles with a dynamic memory workspace, generating probing queries and integrating new evidence into a global memory pool.

Result: ComoRAG achieves up to 11% relative gains over baselines on long-context narrative benchmarks (200K+ tokens).

Conclusion: ComoRAG offers a cognitively motivated, stateful reasoning paradigm for retrieval-based long-context comprehension, excelling in complex queries.

Abstract: Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM’s diminished reasoning over extended context and high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods can fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition when reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based long context comprehension towards stateful reasoning. Our code is publicly released at https://github.com/EternityJune25/ComoRAG

[61] Evaluating LLMs on Chinese Idiom Translation

Cai Yang, Yao Dou, David Heineman, Xiaofeng Wu, Wei Xu

Main category: cs.CL

TL;DR: The paper introduces IdiomEval, a framework for evaluating Chinese idiom translation, revealing poor performance of modern systems like GPT-4 and Google Translate, and proposes improved models for error detection.

DetailsMotivation: Chinese idioms are complex and often mistranslated by current machine translation systems, necessitating a dedicated evaluation framework.

Method: The authors create IdiomEval, annotate 900 translations from nine systems, and analyze errors. They also develop improved models for detecting idiom translation errors.

Result: Modern systems fail at idiom translation, with GPT-4 making errors in 28% of cases. Existing metrics poorly measure idiom quality (Pearson correlation <0.48). Improved models achieve F1=0.68 for error detection.

Conclusion: The study highlights the challenges in Chinese idiom translation and presents a framework and models to improve evaluation and error detection.

Abstract: Idioms, whose figurative meanings usually differ from their literal interpretations, are common in everyday language, especially in Chinese, where they often contain historical references and follow specific structural patterns. Despite recent progress in machine translation with large language models, little is known about Chinese idiom translation. In this work, we introduce IdiomEval, a framework with a comprehensive error taxonomy for Chinese idiom translation. We annotate 900 translation pairs from nine modern systems, including GPT-4o and Google Translate, across four domains: web, news, Wikipedia, and social media. We find these systems fail at idiom translation, producing incorrect, literal, partial, or even missing translations. The best-performing system, GPT-4, makes errors in 28% of cases. We also find that existing evaluation metrics measure idiom quality poorly with Pearson correlation below 0.48 with human ratings. We thus develop improved models that achieve F$_1$ scores of 0.68 for detecting idiom translation errors.

[62] Computational Economics in Large Language Models: Exploring Model Behavior and Incentive Design under Resource Constraints

Sandeep Reddy, Kabir Khan, Rohit Patil, Ananya Chakraborty, Faizan A. Khan, Swati Kulkarni, Arjun Verma, Neha Singh

Main category: cs.CL

TL;DR: A framework called ‘computational economics’ optimizes LLMs by treating them as resource-constrained economies, improving efficiency and interpretability.

DetailsMotivation: Address the high computational costs of LLMs by optimizing resource allocation.

Method: Introduce an incentive-driven training paradigm with a differentiable computation cost term to encourage sparse activations.

Result: Achieves a 40% reduction in FLOPS and lower latency while maintaining accuracy, outperforming post-hoc pruning.

Conclusion: Economic principles can guide the design of efficient, adaptive, and transparent LLMs under resource constraints.

Abstract: Large language models (LLMs) are limited by substantial computational cost. We introduce a “computational economics” framework that treats an LLM as an internal economy of resource-constrained agents (attention heads and neuron blocks) that must allocate scarce computation to maximize task utility. First, we show empirically that when computation is scarce, standard LLMs reallocate attention toward high-value tokens while preserving accuracy. Building on this observation, we propose an incentive-driven training paradigm that augments the task loss with a differentiable computation cost term, encouraging sparse and efficient activations. On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method yields a family of models that trace a Pareto frontier and consistently dominate post-hoc pruning; for a similar accuracy we obtain roughly a forty percent reduction in FLOPS and lower latency, together with more interpretable attention patterns. These results indicate that economic principles offer a principled route to designing efficient, adaptive, and more transparent LLMs under strict resource constraints.

[63] DiFaR: Enhancing Multimodal Misinformation Detection with Diverse, Factual, and Relevant Rationales

Herun Wan, Jiaying Wu, Minnan Luo, Xiangzheng Kong, Zihan Ma, Zhi Zeng

Main category: cs.CL

TL;DR: DiFaR is a framework that generates diverse, factual, and relevant rationales from LVLMs to improve misinformation detection, outperforming baselines by up to 5.9%.

DetailsMotivation: Current methods for generating rationales from LVLMs face challenges like lack of diversity, factual inaccuracies, and irrelevant content, limiting their effectiveness.

Method: DiFaR uses five chain-of-thought prompts to create varied reasoning traces and a post-hoc filtering module to select high-quality rationale sentences based on factuality and relevance.

Result: DiFaR outperforms baselines by up to 5.9% and enhances existing detectors by 8.7%, with improved rationale quality confirmed by metrics and human evaluations.

Conclusion: DiFaR effectively addresses the limitations of rationale generation, significantly boosting misinformation detection performance.

Abstract: Generating textual rationales from large vision-language models (LVLMs) to support trainable multimodal misinformation detectors has emerged as a promising paradigm. However, its effectiveness is fundamentally limited by three core challenges: (i) insufficient diversity in generated rationales, (ii) factual inaccuracies due to hallucinations, and (iii) irrelevant or conflicting content that introduces noise. We introduce DiFaR, a detector-agnostic framework that produces diverse, factual, and relevant rationales to enhance misinformation detection. DiFaR employs five chain-of-thought prompts to elicit varied reasoning traces from LVLMs and incorporates a lightweight post-hoc filtering module to select rationale sentences based on sentence-level factuality and relevance scores. Extensive experiments on four popular benchmarks demonstrate that DiFaR outperforms four baseline categories by up to 5.9% and boosts existing detectors by as much as 8.7%. Both automatic metrics and human evaluations confirm that DiFaR significantly improves rationale quality across all three dimensions.

[64] When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models

Huyu Wu, Meng Tang, Xinhan Zheng, Haiyun Jiang

Main category: cs.CL

TL;DR: The paper investigates text dominance in Multimodal Large Language Models (MLLMs), proposes metrics to measure it, identifies causes, and suggests a token compression method to mitigate the issue.

DetailsMotivation: Text dominance in MLLMs leads to underutilization of non-text modalities, limiting their potential. The study aims to systematically analyze and address this imbalance.

Method: The authors introduce two metrics (MDI and AEI) to measure text dominance and analyze its causes. They propose a token compression method to rebalance attention.

Result: Text dominance is pervasive across modalities. The proposed method reduces MDI significantly, e.g., from 10.23 to 0.86 in LLaVA-7B.

Conclusion: The study provides insights and tools to develop more balanced MLLMs, addressing text dominance through systematic analysis and practical solutions.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a diverse range of multimodal tasks. However, these models suffer from a core problem known as text dominance: they depend heavily on text for their inference, while underutilizing other modalities. While prior work has acknowledged this phenomenon in vision-language tasks, often attributing it to data biases or model architectures. In this paper, we conduct the first systematic investigation of text dominance across diverse data modalities, including images, videos, audio, time-series, and graphs. To measure this imbalance, we propose two evaluation metrics: the Modality Dominance Index (MDI) and the Attention Efficiency Index (AEI). Our comprehensive analysis reveals that text dominance is both significant and pervasive across all tested modalities. Our in-depth analysis identifies three underlying causes: attention dilution from severe token redundancy in non-textual modalities, the influence of fusion architecture design, and task formulations that implicitly favor textual inputs. Furthermore, we propose a simple token compression method that effectively rebalances model attention. Applying this method to LLaVA-7B, for instance, drastically reduces its MDI from 10.23 to a well-balanced value of 0.86. Our analysis and methodological framework offer a foundation for the development of more equitable and comprehensive multimodal language models.

[65] When Explainability Meets Privacy: An Investigation at the Intersection of Post-hoc Explainability and Differential Privacy in the Context of Natural Language Processing

Mahdi Dhaini, Stephen Meisenbacher, Ege Erdogan, Florian Matthes, Gjergji Kasneci

Main category: cs.CL

TL;DR: The paper explores the intersection of explainability and privacy in NLP, investigating whether they can coexist or conflict, using methods like Differential Privacy and post-hoc explainability.

DetailsMotivation: To address the gap in understanding the relationship between explainability and privacy in NLP, as research has largely treated them separately.

Method: Empirical investigation using Differential Privacy (DP) and post-hoc explainability methods, analyzing their interplay in NLP tasks.

Result: The study reveals a complex relationship between privacy and explainability, influenced by task nature and method choices, showing potential for coexistence.

Conclusion: The paper provides practical recommendations for future work, highlighting that privacy and explainability can coexist in NLP under certain conditions.

Abstract: In the study of trustworthy Natural Language Processing (NLP), a number of important research fields have emerged, including that of \textit{explainability} and \textit{privacy}. While research interest in both explainable and privacy-preserving NLP has increased considerably in recent years, there remains a lack of investigation at the intersection of the two. This leaves a considerable gap in understanding of whether achieving \textit{both} explainability and privacy is possible, or whether the two are at odds with each other. In this work, we conduct an empirical investigation into the privacy-explainability trade-off in the context of NLP, guided by the popular overarching methods of \textit{Differential Privacy} (DP) and Post-hoc Explainability. Our findings include a view into the intricate relationship between privacy and explainability, which is formed by a number of factors, including the nature of the downstream task and choice of the text privatization and explainability method. In this, we highlight the potential for privacy and explainability to co-exist, and we summarize our findings in a collection of practical recommendations for future work at this important intersection.

[66] eDIF: A European Deep Inference Fabric for Remote Interpretability of LLM

Irma Heithoff. Marc Guggenberger, Sandra Kalogiannis, Susanne Mayer, Fabian Maag, Sigurd Schacht, Carsten Lanquillon

Main category: cs.CL

TL;DR: The paper explores the feasibility of deploying a European Deep Inference Fabric (eDIF) to support mechanistic interpretability research on large language models (LLMs), highlighting its technical performance, usability, and scientific utility.

DetailsMotivation: The initiative aims to democratize advanced LLM interpretability infrastructure in Europe, making it accessible to the research community.

Method: A GPU-based cluster hosted at Ansbach University, interconnected with partner institutions, was used for remote model inspection via the NNsight API. A pilot study with 16 researchers evaluated the platform.

Result: The study showed increased user engagement, stable performance, and positive feedback on remote experimentation. Limitations like slow data downloads were noted.

Conclusion: The project advances LLM interpretability infrastructure in Europe, setting the stage for broader deployment and community collaboration.

Abstract: This paper presents a feasibility study on the deployment of a European Deep Inference Fabric (eDIF), an NDIF-compatible infrastructure designed to support mechanistic interpretability research on large language models. The need for widespread accessibility of LLM interpretability infrastructure in Europe drives this initiative to democratize advanced model analysis capabilities for the research community. The project introduces a GPU-based cluster hosted at Ansbach University of Applied Sciences and interconnected with partner institutions, enabling remote model inspection via the NNsight API. A structured pilot study involving 16 researchers from across Europe evaluated the platform’s technical performance, usability, and scientific utility. Users conducted interventions such as activation patching, causal tracing, and representation analysis on models including GPT-2 and DeepSeek-R1-70B. The study revealed a gradual increase in user engagement, stable platform performance throughout, and a positive reception of the remote experimentation capabilities. It also marked the starting point for building a user community around the platform. Identified limitations such as prolonged download durations for activation data as well as intermittent execution interruptions are addressed in the roadmap for future development. This initiative marks a significant step towards widespread accessibility of LLM interpretability infrastructure in Europe and lays the groundwork for broader deployment, expanded tooling, and sustained community collaboration in mechanistic interpretability research.

[67] Neural Machine Translation for Coptic-French: Strategies for Low-Resource Ancient Languages

Nasma Chaoui, Richard Khoury

Main category: cs.CL

TL;DR: A study on translating Coptic to French, evaluating methods like pivot vs. direct translation, pre-training, multi-version fine-tuning, and noise robustness. Fine-tuning with varied and noise-aware data improves quality.

DetailsMotivation: To systematically study and improve translation strategies for Coptic, a historical language, into French, addressing gaps in existing methods.

Method: Evaluates pivot vs. direct translation, pre-training, multi-version fine-tuning, and noise robustness using aligned biblical corpora. Fine-tuning with varied and noise-aware data is key.

Result: Fine-tuning with stylistically-varied and noise-aware training data significantly enhances translation quality.

Conclusion: Provides practical insights for developing translation tools for historical languages, emphasizing the importance of diverse and robust training data.

Abstract: This paper presents the first systematic study of strategies for translating Coptic into French. Our comprehensive pipeline systematically evaluates: pivot versus direct translation, the impact of pre-training, the benefits of multi-version fine-tuning, and model robustness to noise. Utilizing aligned biblical corpora, we demonstrate that fine-tuning with a stylistically-varied and noise-aware training corpus significantly enhances translation quality. Our findings provide crucial practical insights for developing translation tools for historical languages in general.

[68] Continuous Bangla Sign Language Translation: Mitigating the Expense of Gloss Annotation with the Assistance of Graph

Safaeid Hossain Arib, Rabeya Akter, Sejuti Rahman

Main category: cs.CL

TL;DR: The paper introduces a fusion of graph-based methods and transformer architecture for gloss-free sign language translation, achieving state-of-the-art results on multiple datasets.

DetailsMotivation: To address communication barriers for the deaf and hard of hearing by improving sign language translation methods, which are often underestimated in spoken-language-dominated societies.

Method: Integrates graph-based methods (STGCN-LSTM) with transformer architecture for gloss-free translation, exploring various fusion strategies.

Result: Achieves superior performance on datasets (RWTH-PHOENIX-2014T, CSL-Daily, How2Sign, BornilDB v1.0) with notable BLEU-4 score improvements.

Conclusion: The method sets a benchmark for future research, highlighting the importance of gloss-free translation for better communication accessibility.

Abstract: Millions of individuals worldwide are affected by deafness and hearing impairment. Sign language serves as a sophisticated means of communication for the deaf and hard of hearing. However, in societies that prioritize spoken languages, sign language often faces underestimation, leading to communication barriers and social exclusion. The Continuous Bangla Sign Language Translation project aims to address this gap by enhancing translation methods. While recent approaches leverage transformer architecture for state-of-the-art results, our method integrates graph-based methods with the transformer architecture. This fusion, combining transformer and STGCN-LSTM architectures, proves more effective in gloss-free translation. Our contributions include architectural fusion, exploring various fusion strategies, and achieving a new state-of-the-art performance on diverse sign language datasets, namely RWTH-PHOENIX-2014T, CSL-Daily, How2Sign, and BornilDB v1.0. Our approach demonstrates superior performance compared to current translation outcomes across all datasets, showcasing notable improvements of BLEU-4 scores of 4.01, 2.07, and 0.5, surpassing those of GASLT, GASLT and slt_how2sign in RWTH-PHOENIX-2014T, CSL-Daily, and How2Sign, respectively. Also, we introduce benchmarking on the BornilDB v1.0 dataset for the first time. Our method sets a benchmark for future research, emphasizing the importance of gloss-free translation to improve communication accessibility for the deaf and hard of hearing.

[69] Learning from Natural Language Feedback for Personalized Question Answering

Alireza Salemi, Hamed Zamani

Main category: cs.CL

TL;DR: The paper introduces VAC, a framework using natural language feedback (NLF) instead of scalar rewards to personalize large language models (LLMs) for question answering, achieving superior results.

DetailsMotivation: Scalar rewards in current LLM personalization methods provide weak feedback, limiting learning and personalization quality.

Method: VAC replaces scalar rewards with NLF, generated from user profiles and question narratives, to iteratively refine responses. Training alternates between optimizing the feedback model and fine-tuning the policy model.

Result: VAC outperforms state-of-the-art methods on the LaMP-QA benchmark and human evaluations confirm higher response quality.

Conclusion: NLF is more effective than scalar rewards for optimizing personalized question answering.

Abstract: Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.

[70] Thinking Inside the Mask: In-Place Prompting in Diffusion LLMs

Xiangqi Jin, Yuxuan Wang, Yifeng Gao, Zichen Wen, Biqing Qi, Dongrui Liu, Linfeng Zhang

Main category: cs.CL

TL;DR: ICE is a novel framework for dLLMs, enabling in-place prompting and early exit, improving accuracy and speed.

DetailsMotivation: Traditional LLMs lack flexibility due to prefix-only prompting and sequential generation. dLLMs offer bidirectional attention and iterative refinement, prompting the need for more adaptable methods like ICE.

Method: ICE transforms prefix-only prompting into in-place prompting for dLLMs, integrating prompts in masked token positions and using a confidence-aware early exit mechanism.

Result: ICE achieves up to 17.29% accuracy improvement and 4.12× speedup on GSM8K, and up to 276.67× acceleration on MMLU while maintaining performance.

Conclusion: ICE effectively enhances dLLMs by enabling flexible in-place prompting and reducing computational overhead, demonstrating significant improvements in accuracy and efficiency.

Abstract: Despite large language models (LLMs) have achieved remarkable success, their prefix-only prompting paradigm and sequential generation process offer limited flexibility for bidirectional information. Diffusion large language models (dLLMs) present new opportunities through their bidirectional attention mechanisms and iterative refinement processes, enabling more flexible in-place prompting strategies. We introduce ICE (In-Place Chain-of-Thought Prompting with Early Exit), a novel framework that transforms prefix-only prompting into in-place prompting specifically designed for dLLMs. ICE integrates in-place prompts directly within masked token positions during iterative refinement and employs a confidence-aware early exit mechanism to significantly reduce computational overhead. Extensive experiments demonstrate ICE’s effectiveness, achieving up to 17.29% accuracy improvement with 4.12$\times$ speedup on GSM8K, and up to 276.67$\times$ acceleration on MMLU while maintaining competitive performance.

[71] Beyond “Not Novel Enough”: Enriching Scholarly Critique with LLM-Assisted Feedback

Osama Mohammed Afzal, Preslav Nakov, Tom Hope, Iryna Gurevych

Main category: cs.CL

TL;DR: The paper introduces an automated method for novelty assessment in peer review, achieving high alignment with human reviewers and outperforming existing LLM baselines.

DetailsMotivation: Novelty assessment is understudied in peer review, especially in high-volume fields like NLP, where reviewer capacity is strained.

Method: A structured approach models expert reviewer behavior through content extraction, related work retrieval/synthesis, and evidence-based comparison.

Result: The method achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions, outperforming LLM baselines.

Conclusion: Structured LLM-assisted approaches can enhance peer review rigor and transparency while complementing human expertise.

Abstract: Novelty assessment is a central yet understudied aspect of peer review, particularly in high volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence based assessment. Our method is informed by a large scale analysis of human written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions - substantially outperforming existing LLM based baselines. The method produces detailed, literature aware analyses and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM assisted approaches to support more rigorous and transparent peer review without displacing human expertise. Data and code are made available.

[72] Reinforced Language Models for Sequential Decision Making

Jim Dilkes, Vahid Yazdanpanah, Sebastian Stein

Main category: cs.CL

TL;DR: MS-GRPO improves smaller LLMs for multi-step tasks, outperforming larger models.

DetailsMotivation: Address limitations of small LLMs in sequential decision-making by improving post-training methods for multi-step tasks.

Method: Introduces MS-GRPO, a post-training algorithm using TSMG and LAP frameworks, with credit assignment and advantage-weighted sampling.

Result: Post-trained 3B model outperforms 72B baseline by 50% on Frozen Lake.

Conclusion: Targeted post-training is efficient for enhancing LLM decision-making without relying on model scale.

Abstract: Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.

[73] Psyche-R1: Towards Reliable Psychological LLMs through Unified Empathy, Expertise, and Reasoning

Chongyuan Dai, Jinpeng Hu, Hongchang Shi, Zhuo Li, Xun Yang, Meng Wang

Main category: cs.CL

TL;DR: Psyche-R1 is a Chinese psychological LLM integrating empathy, expertise, and reasoning, outperforming benchmarks with a novel data pipeline and hybrid training.

DetailsMotivation: Addressing the shortage of mental health professionals by enhancing LLMs with reasoning and empathy for reliable psychological support.

Method: Developed a data synthesis pipeline for 75k psychological questions with rationales and 73k empathetic dialogues, using hybrid training (GRPO for reasoning, SFT for empathy).

Result: Psyche-R1 (7B) matches performance of larger models (671B DeepSeek-R1) in psychological benchmarks.

Conclusion: Psyche-R1 demonstrates the potential of integrating reasoning and empathy in LLMs for mental health applications.

Abstract: Amidst a shortage of qualified mental health professionals, the integration of large language models (LLMs) into psychological applications offers a promising way to alleviate the growing burden of mental health disorders. Recent reasoning-augmented LLMs have achieved remarkable performance in mathematics and programming, while research in the psychological domain has predominantly emphasized emotional support and empathetic dialogue, with limited attention to reasoning mechanisms that are beneficial to generating reliable responses. Therefore, in this paper, we propose Psyche-R1, the first Chinese psychological LLM that jointly integrates empathy, psychological expertise, and reasoning, built upon a novel data curation pipeline. Specifically, we design a comprehensive data synthesis pipeline that produces over 75k high-quality psychological questions paired with detailed rationales, generated through chain-of-thought (CoT) reasoning and iterative prompt-rationale optimization, along with 73k empathetic dialogues. Subsequently, we employ a hybrid training strategy wherein challenging samples are identified through a multi-LLM cross-selection strategy for group relative policy optimization (GRPO) to improve reasoning ability, while the remaining data is used for supervised fine-tuning (SFT) to enhance empathetic response generation and psychological domain knowledge. Extensive experiment results demonstrate the effectiveness of the Psyche-R1 across several psychological benchmarks, where our 7B Psyche-R1 achieves comparable results to 671B DeepSeek-R1.

[74] From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms

Zhaokun Jiang, Ziyin Zhang

Main category: cs.CL

TL;DR: A multi-dimensional framework for automated interpreting quality assessment, emphasizing explainability, outperforms traditional methods by integrating feature engineering, data augmentation, and SHAP analysis.

DetailsMotivation: Address gaps in existing research: insufficient language use quality examination, data scarcity/ imbalance, and lack of model prediction explanations.

Method: Proposes a framework combining feature engineering, data augmentation, and explainable ML (using SHAP analysis) with transparent features.

Result: Strong predictive performance on an English-Chinese dataset; identifies key features for fidelity, fluency, and language use.

Conclusion: The framework offers a scalable, transparent alternative to human evaluation, providing detailed feedback for learners and supporting self-regulated learning.

Abstract: Recent advancements in machine learning have spurred growing interests in automated interpreting quality assessment. Nevertheless, existing research suffers from insufficient examination of language use quality, unsatisfactory modeling effectiveness due to data scarcity and imbalance, and a lack of efforts to explain model predictions. To address these gaps, we propose a multi-dimensional modeling framework that integrates feature engineering, data augmentation, and explainable machine learning. This approach prioritizes explainability over ``black box’’ predictions by utilizing only construct-relevant, transparent features and conducting Shapley Value (SHAP) analysis. Our results demonstrate strong predictive performance on a novel English-Chinese consecutive interpreting dataset, identifying BLEURT and CometKiwi scores to be the strongest predictive features for fidelity, pause-related features for fluency, and Chinese-specific phraseological diversity metrics for language use. Overall, by placing particular emphasis on explainability, we present a scalable, reliable, and transparent alternative to traditional human evaluation, facilitating the provision of detailed diagnostic feedback for learners and supporting self-regulated learning advantages not afforded by automated scores in isolation.

[75] SSRL: Self-Search Reinforcement Learning

Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, Li Kang, Gang Chen, Cheng Huang, Zhizhou He, Bingning Wang, Lei Bai, Ning Ding, Bowen Zhou

Main category: cs.CL

TL;DR: LLMs can simulate agentic search tasks in RL, reducing reliance on external search engines. Self-Search RL (SSRL) enhances this capability, improving knowledge utilization and reducing hallucination.

DetailsMotivation: To reduce costly interactions with external search engines in RL by leveraging LLMs' intrinsic search capabilities.

Method: Quantify LLMs’ search ability via Self-Search, then enhance it with SSRL using format-based and rule-based rewards.

Result: SSRL-trained models achieve high performance, reduce hallucination, and integrate well with external tools.

Conclusion: LLMs can effectively support scalable RL training by leveraging internal knowledge and reducing external dependencies.

Abstract: We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs’ Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.

[76] A Survey on Diffusion Language Models

Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen

Main category: cs.CL

TL;DR: Diffusion Language Models (DLMs) offer parallel token generation via iterative denoising, reducing latency and enabling fine-grained control, with performance now rivaling autoregressive models. This survey provides a comprehensive overview, taxonomy, and analysis of DLMs, covering their evolution, techniques, inference optimizations, multimodal extensions, and challenges.

DetailsMotivation: To present a holistic overview of DLMs, highlighting their advantages over autoregressive models and their growing potential in NLP tasks.

Method: The survey traces DLM evolution, compares them with other paradigms (autoregressive, masked), and analyzes foundational principles, state-of-the-art models, pre-training strategies, post-training methods, and inference optimizations.

Result: DLMs achieve comparable performance to autoregressive models with faster inference, and the survey provides a detailed taxonomy, techniques, and applications.

Conclusion: DLMs are a promising alternative to autoregressive models, but challenges like efficiency and long-sequence handling remain. Future research directions are outlined to advance the field.

Abstract: Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

[77] Knowledge-based Consistency Testing of Large Language Models

Sai Sathiesh Rajan, Ezekiel Soremekun, Sudipta Chattopadhyay

Main category: cs.CL

TL;DR: KonTest is an automated framework that tests LLM inconsistencies and knowledge gaps using a knowledge graph, revealing significant errors and gaps, and proposes mitigation via weighted ensemble models.

DetailsMotivation: To systematically expose and measure inconsistencies and knowledge gaps in LLMs, addressing their reliability and accuracy.

Method: Uses a knowledge graph to construct test cases, probing inconsistencies via semantically-equivalent queries and oracles, and mitigates gaps with a weighted LLM ensemble.

Result: KonTest found 19.2% error-inducing inputs and a 16.5% knowledge gap across LLMs, reducing gaps by 32.48% with mitigation. GPT3.5 was 60%-68% effective in knowledge construction.

Conclusion: KonTest effectively identifies and mitigates LLM inconsistencies and knowledge gaps, though GPT3.5 is less suitable for knowledge-based testing.

Abstract: In this work, we systematically expose and measure the inconsistency and knowledge gaps of Large Language Models (LLMs). Specifically, we propose an automated testing framework (called KonTest) which leverages a knowledge graph to construct test cases. KonTest probes and measures the inconsistencies in the LLM’s knowledge of the world via a combination of semantically-equivalent queries and test oracles (metamorphic or ontological oracle). KonTest further mitigates knowledge gaps via a weighted LLM model ensemble. Using four state-of-the-art LLMs (Falcon, Gemini, GPT3.5, and Llama2), we show that KonTest generates 19.2% error inducing inputs (1917 errors from 9979 test inputs). It also reveals a 16.5% knowledge gap across all tested LLMs. A mitigation method informed by KonTest’s test suite reduces LLM knowledge gap by 32.48%. Our ablation study further shows that GPT3.5 is not suitable for knowledge-based consistency testing because it is only 60%-68% effective in knowledge construction.

[78] This Candidate is [MASK]. Prompt-based Sentiment Extraction and Reference Letters

Fabian Slonimczyk

Main category: cs.CL

TL;DR: The paper introduces a prompt-based method for sentiment extraction from text using pre-trained LLMs, demonstrating its superiority over traditional methods in economics/finance. Applied to reference letters, it links sentiment to job outcomes and reveals gender biases.

DetailsMotivation: To simplify and improve sentiment analysis in economics/finance by leveraging pre-trained LLMs, addressing limitations of existing methods like bag-of-words or fine-tuned models.

Method: Prompt-based sentiment extraction using pre-trained LLMs, applied to reference letters to analyze sentiment and its impact on job outcomes. Comparisons made with other sentiment analysis methods.

Result: Higher sentiment in letters correlates with better job outcomes; disagreement among writers harms candidates. Prompt-based method outperforms alternatives. Gender biases in letter content affect outcomes.

Conclusion: Prompt-based sentiment extraction is effective and superior to other methods, revealing actionable insights like gender biases in reference letters.

Abstract: I propose a relatively simple way to deploy pre-trained large language models (LLMs) in order to extract sentiment and other useful features from text data. The method, which I refer to as prompt-based sentiment extraction, offers multiple advantages over other methods used in economics and finance. I apply my prompt-based strategy to a hand-collected corpus of confidential reference letters (RLs). I show that the sentiment contents of RLs is clearly reflected in job market outcomes. Candidates with higher average sentiment in their letters perform markedly better regardless of the measure of success chosen. Moreover, I show that disagreement among letter writers negatively affects the job market candidate’s performance. I compare my sentiment extraction approach to other commonly used methods for sentiment analysis: “bag-of-words” approaches, fine-tuned language models, and querying advanced chatbots. I find that no other method can reproduce the results obtained by prompt-based sentiment extraction. Finally, I slightly modify the method to obtain “gendered” sentiment scores (as in Eberhardt et al., 2023). I show that letters of reference written for female candidates emphasize “grindstone” personality traits, whereas male candidates’ letters emphasize “standout” traits. These gender differences negatively affect women’s job market outcomes.

[79] Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding

Fabian David Schmidt, Ivan Vulić, Goran Glavaš, David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: Fleurs-SLU introduces a multilingual SLU benchmark for low-resource languages, evaluating end-to-end models, cascaded systems, and speech-LLMs, showing cascaded systems’ robustness and the potential of speech-LLMs.

DetailsMotivation: Address the lack of SLU evaluation for low-resource languages without reliable ASR or writing systems, focusing on deeper tasks beyond intent classification.

Method: Develop Fleurs-SLU with 692 hours for topical classification (102 languages) and 944 hours for listening comprehension (92 languages). Evaluate end-to-end models, cascaded ASR+LLM systems, and speech-LLMs.

Result: Cascaded systems are robust; pretrained speech encoders compete in topical classification. Speech-LLMs match/surpass cascaded systems. Strong correlation between ASR, speech-to-text translation, and SLU performance.

Conclusion: Multilingual SLU benefits from robust ASR and semantic representations, with cascaded systems and speech-LLMs showing promise for low-resource languages.

Abstract: Spoken language understanding (SLU) is indispensable for half of all living languages that lack a formal writing system. Unlike for high-resource languages, for these languages, we cannot offload semantic understanding of speech to the cascade of automatic speech recognition (ASR) and text-based large language models (LLMs). Even if low-resource languages possess a writing system, ASR for these languages remains unreliable due to limited bimodal speech and text training data. Nonetheless, the evaluation of multilingual SLU is limited to shallow tasks such as intent classification or language identification. This is why we present Fleurs-SLU, a multilingual SLU benchmark that encompasses (i) 692 hours of speech for topical utterance classification in 102 languages and (ii) multiple-choice question answering via listening comprehension spanning 944 hours of speech across 92 languages. We extensively evaluate end-to-end speech classification models, cascaded systems that combine speech-to-text transcription with subsequent LLM-based classification, and multimodal speech-LLMs on Fleurs-SLU. Our results show that cascaded systems are more robust in multilingual SLU, though well-pretrained speech encoders can perform competitively in topical speech classification. Closed-source speech-LLMs match or surpass the performance of cascaded systems. We observe a strong correlation between robust multilingual ASR, effective speech-to-text translation, and strong multilingual SLU, indicating mutual benefits between acoustic and semantic speech representations.

[80] Measuring Diversity in Synthetic Datasets

Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian

Main category: cs.CL

TL;DR: The paper introduces DCScore, a novel method for measuring the diversity of synthetic datasets generated by LLMs, focusing on classification tasks. It offers theoretical and empirical validation, showing improved correlation with diversity metrics and reduced computational costs.

DetailsMotivation: Accurately measuring the diversity of synthetic datasets from LLMs is crucial for robust NLP model performance but remains challenging.

Method: DCScore evaluates diversity as a sample classification task, leveraging mutual relationships among samples and satisfying diversity-related axioms.

Result: DCScore shows stronger correlation with diversity metrics and reduces computational costs compared to existing methods.

Conclusion: DCScore is an effective and efficient method for evaluating synthetic dataset diversity, supported by theoretical and empirical evidence.

Abstract: Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets-an aspect crucial for robust model performance-remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing methods. Code is available at: https://github.com/bluewhalelab/dcscore.

[81] LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint

Qianli Ma, Dongrui Liu, Qian Chen, Linfeng Zhang, Jing Shao

Main category: cs.CL

TL;DR: LED-Merging is a three-stage framework that addresses safety-utility conflicts in model merging by locating task-specific neurons, dynamically selecting critical ones, and isolating conflicting updates, reducing harmful responses while preserving utility.

DetailsMotivation: Fine-tuning LLMs for specialized tasks is costly, and existing model merging methods degrade safety safeguards due to neuron misidentification and cross-task interference.

Method: LED-Merging uses gradient-based attribution to locate task-specific neurons, multi-model importance fusion to elect critical neurons, and parameter isolation to disjoint conflicting updates.

Result: LED-Merging reduces harmful response rates by 31.4% on Llama-3-8B-Instruct while preserving 95% of utility performance, achieving 52.39% accuracy on GSM8K.

Conclusion: LED-Merging resolves safety-utility conflicts, offering a lightweight, training-free solution for reliable multi-task LLMs.

Abstract: Fine-tuning pre-trained Large Language Models (LLMs) for specialized tasks incurs substantial computational and data costs. While model merging offers a training-free solution to integrate multiple task-specific models, existing methods suffer from safety-utility conflicts where enhanced general capabilities degrade safety safeguards. We identify two root causes: $\textbf{neuron misidentification}$ due to simplistic parameter magnitude-based selection, and $\textbf{cross-task neuron interference}$ during merging. To address these challenges, we propose $\textbf{LED-Merging}$, a three-stage framework that $\textbf{L}$ocates task-specific neurons via gradient-based attribution, dynamically $\textbf{E}$lects critical neurons through multi-model importance fusion, and $\textbf{D}$isjoints conflicting updates through parameter isolation. Extensive experiments on Llama-3-8B, Mistral-7B, and Llama2-13B demonstrate that LED-Merging effectively reduces harmful response rates, showing a 31.4% decrease on Llama-3-8B-Instruct on HarmBench, while simultaneously preserving 95% of utility performance, such as achieving 52.39% accuracy on GSM8K. LED-Merging resolves safety-utility conflicts and provides a lightweight, training-free paradigm for constructing reliable multi-task LLMs. Code is available at $\href{https://github.com/MqLeet/LED-Merging}{GitHub}$.

[82] TikZero: Zero-Shot Text-Guided Graphics Program Synthesis

Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, Simone Paolo Ponzetto

Main category: cs.CL

TL;DR: TikZero enables zero-shot text-guided graphics program synthesis by bridging image representations, outperforming baselines and matching larger models like GPT-4o.

DetailsMotivation: Aligning text captions with graphics programs for high geometric precision and editability is challenging due to scarce aligned training data.

Method: TikZero decouples graphics program generation from text understanding using image representations as an intermediary, enabling independent training on disparate data sources.

Result: TikZero outperforms baselines and matches or exceeds larger models like GPT-4o when leveraging caption-aligned graphics programs.

Conclusion: TikZero effectively reconciles disparate data sources for high-quality graphics program synthesis, offering a scalable solution.

Abstract: Automatically synthesizing figures from text captions is a compelling capability. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like TikZ, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, TikZero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models are publicly available.

[83] Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning

Donghao Huang, Zhaoxia Wang

Main category: cs.CL

TL;DR: DeepSeek-R1, an open-source reasoning model, outperforms GPT-4o in few-shot sentiment analysis with higher accuracy and efficiency, while offering better explainability.

DetailsMotivation: The study addresses the challenge of balancing accuracy, efficiency, and explainability in sentiment analysis using LLMs, comparing DeepSeek-R1 with GPT-4o variants.

Method: The study evaluates DeepSeek-R1 and its distilled variants against GPT-4o and GPT-4o-mini, testing few-shot learning curves and performance on sentiment tasks.

Result: DeepSeek-R1 achieves 91.39% F1 score on 5-class sentiment and 99.31% accuracy on binary tasks with 5 shots, showing eightfold efficiency improvement over GPT-4o.

Conclusion: DeepSeek-R1 is a powerful, interpretable open-source alternative to GPT-4o, excelling in accuracy, efficiency, and explainability despite reduced throughput.

Abstract: Large language models (LLMs) have transformed sentiment analysis, yet balancing accuracy, efficiency, and explainability remains a critical challenge. This study presents the first comprehensive evaluation of DeepSeek-R1–an open-source reasoning model–against OpenAI’s GPT-4o and GPT-4o-mini. We test the full 671B model and its distilled variants, systematically documenting few-shot learning curves. Our experiments show DeepSeek-R1 achieves a 91.39% F1 score on 5-class sentiment and 99.31% accuracy on binary tasks with just 5 shots, an eightfold improvement in few-shot efficiency over GPT-4o. Architecture-specific distillation effects emerge, where a 32B Qwen2.5-based model outperforms the 70B Llama-based variant by 6.69 percentage points. While its reasoning process reduces throughput, DeepSeek-R1 offers superior explainability via transparent, step-by-step traces, establishing it as a powerful, interpretable open-source alternative.

[84] Building Instruction-Tuning Datasets from Human-Written Instructions with Open-Weight Large Language Models

Youmi Ma, Sakae Mizuki, Kazuki Fujii, Taishi Nakamura, Masanari Ohi, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Koki Maeda, Kakeru Hattori, Takumi Okamoto, Shigeki Ishida, Rio Yokota, Hiroya Takamura, Naoaki Okazaki

Main category: cs.CL

TL;DR: Instruction-tuning with human-written instructions and LLM-generated responses outperforms prior methods, even in other languages, though cultural knowledge gaps remain.

DetailsMotivation: To determine if human-originated signals are still necessary for instruction tuning and to improve LLM performance.

Method: Pair human-written instructions with LLM-generated responses to create datasets, then fine-tune LLMs on this data.

Result: LLMs fine-tuned on the new datasets outperform existing ones, including in Japanese, but lack culture-specific knowledge.

Conclusion: Human-originated signals enhance instruction tuning, and the approach is adaptable to other languages, though cultural knowledge gaps need addressing.

Abstract: Instruction tuning is crucial for enabling Large Language Models (LLMs) to solve real-world tasks. Prior work has shown the effectiveness of instruction-tuning data synthesized solely from LLMs, raising a fundamental question: Do we still need human-originated signals for instruction tuning? This work answers the question affirmatively: we build state-of-the-art instruction-tuning datasets sourced from human-written instructions, by simply pairing them with LLM-generated responses. LLMs fine-tuned on our datasets consistently outperform those fine-tuned on existing ones. Our data construction approach can be easily adapted to other languages; we build datasets for Japanese and confirm that LLMs tuned with our data reach state-of-the-art performance. Analyses suggest that instruction-tuning in a new language allows LLMs to follow instructions, while the tuned models exhibit a notable lack of culture-specific knowledge in that language. The datasets and fine-tuned models will be publicly available. Our datasets, synthesized with open-weight LLMs, are openly distributed under permissive licenses, allowing for diverse use cases.

[85] ToolACE-R: Model-aware Iterative Training and Adaptive Refinement for Tool Learning

Xingshan Zeng, Weiwen Liu, Xu Huang, Zezhong Wang, Lingzhi Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Ruiming Tang, Qun Liu

Main category: cs.CL

TL;DR: ToolACE-R is a framework for tool learning in LLMs, featuring iterative training and adaptive refinement to maximize model potential and improve tool invocation.

DetailsMotivation: Existing approaches focus on data synthesis for fine-tuning but overlook fully stimulating model potential. ToolACE-R addresses this gap.

Method: ToolACE-R includes model-aware iterative training, self-refinement corpus, and adaptive self-refinement for test-time scaling.

Result: Experiments show ToolACE-R achieves competitive performance and improves tool invocation efficiently.

Conclusion: ToolACE-R is effective and generalizable, offering a promising direction for efficient and scalable tool learning.

Abstract: Tool learning, which allows Large Language Models (LLMs) to leverage external tools for solving complex user tasks, has emerged as a promising avenue for extending model capabilities. However, existing approaches primarily focus on data synthesis for fine-tuning LLMs to invoke tools effectively, largely ignoring how to fully stimulate the potential of the model. In this paper, we propose ToolACE-R, a novel framework that includes both model-aware iterative training and adaptive refinement for tool learning. ToolACE-R features a model-aware iterative training procedure that progressively adjust training samples based on the model’s evolving capabilities to maximize its potential. Additionally, it incorporates self-refinement training corpus which emphasizes LLM’s ability to iteratively refine their tool calls, optimizing performance without requiring external feedback. Furthermore, we introduce adaptive self-refinement mechanism for efficient test-time scaling, where the trained model can autonomously determine when to stop the process based on iterative self-refinement. We conduct extensive experiments across several benchmark datasets, showing that ToolACE-R achieves competitive performance compared to advanced API-based models. The performance of tool invocation can be further improved efficiently through adaptive self-refinement. These results highlight the effectiveness and generalizability of ToolACE-R, offering a promising direction for more efficient and scalable tool learning.

[86] CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring

Clayton Cohn, Ashwin T S, Naveeduddin Mohammed, Gautam Biswas

Main category: cs.CL

TL;DR: CoTAL improves GPT-4’s scoring performance in formative assessments by combining Chain-of-Thought Prompting and Active Learning, achieving up to 38.9% gains over baselines.

DetailsMotivation: To explore the generalization of prompt engineering in educational contexts and improve LLM-based formative assessment scoring.

Method: Uses Evidence-Centered Design, human-in-the-loop prompt engineering, Chain-of-Thought Prompting, and iterative feedback from teachers and students.

Result: CoTAL enhances GPT-4’s scoring accuracy by up to 38.9% and is judged effective by teachers and students.

Conclusion: CoTAL is a promising approach for automating and refining formative assessments, with feedback improving grading and explanations.

Abstract: Large language models (LLMs) have created new opportunities to assist teachers and support student learning. While researchers have explored various prompt engineering approaches in educational contexts, the degree to which these approaches generalize across domains–such as science, computing, and engineering–remains underexplored. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) to align assessments and rubrics with curriculum goals, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates chain-of-thought (CoT) prompting and teacher and student feedback to iteratively refine questions, rubrics, and LLM prompts. Our findings demonstrate that CoTAL improves GPT-4’s scoring performance across domains, achieving gains of up to 38.9% over a non-prompt-engineered baseline (i.e., without labeled examples, chain-of-thought prompting, or iterative refinement). Teachers and students judge CoTAL to be effective at scoring and explaining responses, and their feedback produces valuable insights that enhance grading accuracy and explanation quality.

[87] Curse of High Dimensionality Issue in Transformer for Long-context Modeling

Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan

Main category: cs.CL

TL;DR: The paper introduces Dynamic Group Attention (DGA) to reduce computational inefficiencies in Transformer-based LLMs by optimizing attention sparsity through a group coding strategy.

DetailsMotivation: Addressing the computational inefficiencies in long-context modeling due to redundant attention computations in Transformer-based LLMs.

Method: Reformulates sequence modeling as a supervised task, analyzes attention sparsity, and proposes DGA to aggregate less important tokens using group coding.

Result: DGA significantly reduces computational costs while maintaining competitive performance.

Conclusion: DGA effectively optimizes attention computation by leveraging group coding, improving efficiency without sacrificing performance.

Abstract: Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textit{redundant} attention computations: while attention weights are often \textit{sparse}, all tokens consume \textit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose \textit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.Code is available at https://github.com/bolixinyu/DynamicGroupAttention.

[88] AF-MAT: Aspect-aware Flip-and-Fuse xLSTM for Aspect-based Sentiment Analysis

Adamu Lawan, Juhua Pu, Haruna Yunusa, Muhammad Lawan, Mahmoud Basi, Muhammad Adam

Main category: cs.CL

TL;DR: AF-MAT, an xLSTM-based framework for ABSA, introduces Aspect-aware mLSTM and FlipMix for multi-scale context modeling, outperforming SOTA methods.

DetailsMotivation: Existing ABSA methods face trade-offs between efficiency and performance, with limitations in long-range dependency capture and computational cost. xLSTM's potential in ABSA is unexplored.

Method: AF-MAT combines AA-mLSTM (aspect-aware gating) and FlipMix (pf-Conv1D and ff-mLSTM for multi-scale context). MC2F dynamically fuses features.

Result: AF-MAT achieves higher accuracy than SOTA baselines on three benchmark datasets.

Conclusion: AF-MAT effectively addresses ABSA challenges by leveraging xLSTM’s strengths and novel mechanisms, demonstrating superior performance.

Abstract: Aspect-based Sentiment Analysis (ABSA) is a crucial NLP task that extracts fine-grained opinions and sentiments from text, such as product reviews and customer feedback. Existing methods often trade off efficiency for performance: traditional LSTM or RNN models struggle to capture long-range dependencies, transformer-based methods are computationally costly, and Mamba-based approaches rely on CUDA and weaken local dependency modeling. The recently proposed Extended Long Short-Term Memory (xLSTM) model offers a promising alternative by effectively capturing long-range dependencies through exponential gating and enhanced memory variants, sLSTM for modeling local dependencies, and mLSTM for scalable, parallelizable memory. However, xLSTM’s application in ABSA remains unexplored. To address this, we introduce Aspect-aware Flip-and-Fuse xLSTM (AF-MAT), a framework that leverages xLSTM’s strengths. AF-MAT features an Aspect-aware matrix LSTM (AA-mLSTM) mechanism that introduces a dedicated aspect gate, enabling the model to selectively emphasize tokens semantically relevant to the target aspect during memory updates. To model multi-scale context, we incorporate a FlipMix block that sequentially applies a partially flipped Conv1D (pf-Conv1D) to capture short-range dependencies in reverse order, followed by a fully flipped mLSTM (ff-mLSTM) to model long-range dependencies via full sequence reversal. Additionally, we propose MC2F, a lightweight Multihead Cross-Feature Fusion based on mLSTM gating, which dynamically fuses AA-mLSTM outputs (queries and keys) with FlipMix outputs (values) for adaptive representation integration. Experiments on three benchmark datasets demonstrate that AF-MAT outperforms state-of-the-art baselines, achieving higher accuracy in ABSA tasks.

[89] Meanings are like Onions: a Layered Approach to Metaphor Processing

Silvia Cappa, Anna Sofia Lippolis, Stefano Zoia

Main category: cs.CL

TL;DR: A stratified model for metaphor processing is proposed, treating meaning as a multi-layered structure with content analysis, conceptual blending, and pragmatic intentionality, aiming for deeper computational metaphor interpretation.

DetailsMotivation: To address the complexity of metaphorical meaning beyond flat mappings, integrating cognitive and pragmatic layers for richer computational understanding.

Method: A three-dimensional framework: (1) content analysis, (2) conceptual blending, and (3) pragmatic intentionality, unified into a formal model.

Result: The model enables deeper, context-sensitive metaphor interpretation in computational systems, moving beyond surface associations.

Conclusion: The framework advances computational metaphor processing by incorporating cognitive and pragmatic layers, supporting richer reasoning.

Abstract: Metaphorical meaning is not a flat mapping between concepts, but a complex cognitive phenomenon that integrates multiple levels of interpretation. In this paper, we propose a stratified model of metaphor processing that treats meaning as an onion: a multi-layered structure comprising (1) content analysis, (2) conceptual blending, and (3) pragmatic intentionality. This three-dimensional framework allows for a richer and more cognitively grounded approach to metaphor interpretation in computational systems. At the first level, metaphors are annotated through basic conceptual elements. At the second level, we model conceptual combinations, linking components to emergent meanings. Finally, at the third level, we introduce a pragmatic vocabulary to capture speaker intent, communicative function, and contextual effects, aligning metaphor understanding with pragmatic theories. By unifying these layers into a single formal framework, our model lays the groundwork for computational methods capable of representing metaphorical meaning beyond surface associations, toward deeper, more context-sensitive reasoning.

[90] CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, Robby T. Tan

Main category: cs.CL

TL;DR: The paper introduces CodeJudgeBench to evaluate LLMs as judges in coding tasks, finding thinking models outperform non-thinking ones but highlighting issues with randomness and sensitivity in judgments.

DetailsMotivation: The motivation is to address the lack of benchmarks for evaluating LLMs as judges in coding tasks, which is crucial for benchmarking and improving response quality.

Method: The method involves creating CodeJudgeBench to test 26 LLM-as-a-Judge models across three coding tasks: code generation, code repair, and unit test generation.

Result: Results show thinking models outperform non-thinking ones, but all models exhibit randomness and sensitivity to response order and source. Pairwise comparison and retaining comments improve performance.

Conclusion: The conclusion highlights the potential of LLMs as judges in coding tasks but underscores concerns about reliability and consistency, suggesting further research into optimal prompting strategies.

Abstract: Large Language Models (LLMs) have significantly advanced the state-of-the-art in various coding tasks. Beyond directly answering user queries, LLMs can also serve as judges, assessing and comparing the quality of responses generated by other models. Such an evaluation capability is crucial both for benchmarking different LLMs and for improving response quality through response ranking. However, despite the growing adoption of the LLM-as-a-Judge paradigm, its effectiveness in coding scenarios remains underexplored due to the absence of dedicated benchmarks. To address this gap, we introduce CodeJudgeBench, a benchmark explicitly designed to evaluate the performance of LLM-as-a-Judge models across three critical coding tasks: code generation, code repair, and unit test generation. Through comprehensive benchmarking of 26 LLM-as-a-Judge models, we find that recent thinking models significantly outperform non-thinking models on our carefully designed code judging tasks. Notably, even relatively small thinking models, such as Qwen3-8B, can outperform specially trained LLM-as-a-Judge models up to 70B in size. Nevertheless, all models still exhibit significant randomness in their judgment of coding tasks. For pairwise judging tasks, simply changing the order in which responses are presented can substantially impact accuracy. In addition, when judging code and unit tests written by different LLMs, LLM-as-a-Judge models also show variance in performance. This sensitivity raises concerns about the reliability and consistency of LLM-as-a-Judge in coding scenarios. Lastly, we study optimal prompting strategies for LLM-as-a-Judge. We find that using pair-wise comparison outperforms scalar point-wise judging. Furthermore, retaining comments and reasoning in the full, unprocessed LLM response leads to improved judge performance.

[91] DeepWriter: A Fact-Grounded Multimodal Writing Assistant Based On Offline Knowledge Base

Song Mao, Lejun Cheng, Pinlong Cai, Guohang Yan, Ding Wang, Botian Shi

Main category: cs.CL

TL;DR: DeepWriter is a customizable, multimodal writing assistant for specialized domains, leveraging a curated offline knowledge base to generate high-quality, factually grounded documents.

DetailsMotivation: LLMs lack deep domain-specific knowledge and hallucinate, while existing solutions like RAG and online search methods face inconsistency and unreliable content issues.

Method: DeepWriter uses task decomposition, outline generation, multimodal retrieval, and section-by-section composition with reflection, alongside a hierarchical knowledge representation.

Result: Experiments show DeepWriter surpasses baselines in factual accuracy and content quality for financial report generation.

Conclusion: DeepWriter addresses LLM limitations in specialized domains by combining structured knowledge and multimodal retrieval for professional-grade outputs.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various applications. However, their use as writing assistants in specialized domains like finance, medicine, and law is often hampered by a lack of deep domain-specific knowledge and a tendency to hallucinate. Existing solutions, such as Retrieval-Augmented Generation (RAG), can suffer from inconsistency across multiple retrieval steps, while online search-based methods often degrade quality due to unreliable web content. To address these challenges, we introduce DeepWriter, a customizable, multimodal, long-form writing assistant that operates on a curated, offline knowledge base. DeepWriter leverages a novel pipeline that involves task decomposition, outline generation, multimodal retrieval, and section-by-section composition with reflection. By deeply mining information from a structured corpus and incorporating both textual and visual elements, DeepWriter generates coherent, factually grounded, and professional-grade documents. We also propose a hierarchical knowledge representation to enhance retrieval efficiency and accuracy. Our experiments on financial report generation demonstrate that DeepWriter produces high-quality, verifiable articles that surpasses existing baselines in factual accuracy and generated content quality.

[92] Echoes of Automation: The Increasing Use of LLMs in Newsmaking

Abolfazl Ansari, Delvin Ce Zhang, Nafis Irtiza Tripto, Dongwon Lee

Main category: cs.CL

TL;DR: The study analyzes AI-generated content in news articles, finding increased GenAI use, especially in local and college media, with impacts on writing style and formality.

DetailsMotivation: To address concerns about journalistic integrity and authorship due to the rise of Generative AI (GenAI) and LLMs.

Method: Analysis of over 40,000 news articles using three AI-text detectors (Binoculars, Fast-Detect GPT, GPTZero) and linguistic analysis.

Result: Substantial increase in GenAI use, especially in local and college news; AI often used in introductions, while conclusions remain manual. GenAI boosts word richness and readability but reduces formality.

Conclusion: GenAI adoption in journalism is growing, altering writing styles and formality, with implications for journalistic integrity.

Abstract: The rapid rise of Generative AI (GenAI), particularly LLMs, poses concerns for journalistic integrity and authorship. This study examines AI-generated content across over 40,000 news articles from major, local, and college news media, in various media formats. Using three advanced AI-text detectors (e.g., Binoculars, Fast-Detect GPT, and GPTZero), we find substantial increase of GenAI use in recent years, especially in local and college news. Sentence-level analysis reveals LLMs are often used in the introduction of news, while conclusions usually written manually. Linguistic analysis shows GenAI boosts word richness and readability but lowers formality, leading to more uniform writing styles, particularly in local media.

[93] Improved Personalized Headline Generation via Denoising Fake Interests from Implicit Feedback

Kejin Liu, Junhong Lian, Xiang Ao, Ningtao Wang, Xing Fu, Yu Cheng, Weiqiang Wang, Xinyu Liu

Main category: cs.CL

TL;DR: PHG-DIF is a framework for personalized headline generation that removes click noise and models user interests dynamically, achieving SOTA results.

DetailsMotivation: Existing methods fail to address noise in historical clickstreams, leading to inaccurate headlines.

Method: PHG-DIF uses dual-stage filtering to remove noise and multi-level temporal fusion for dynamic interest modeling.

Result: PHG-DIF significantly improves headline quality and achieves SOTA on the DT-PENS dataset.

Conclusion: The framework effectively mitigates click noise and enhances personalized headline generation.

Abstract: Accurate personalized headline generation hinges on precisely capturing user interests from historical behaviors. However, existing methods neglect personalized-irrelevant click noise in entire historical clickstreams, which may lead to hallucinated headlines that deviate from genuine user preferences. In this paper, we reveal the detrimental impact of click noise on personalized generation quality through rigorous analysis in both user and news dimensions. Based on these insights, we propose a novel Personalized Headline Generation framework via Denoising Fake Interests from Implicit Feedback (PHG-DIF). PHG-DIF first employs dual-stage filtering to effectively remove clickstream noise, identified by short dwell times and abnormal click bursts, and then leverages multi-level temporal fusion to dynamically model users’ evolving and multi-faceted interests for precise profiling. Moreover, we release DT-PENS, a new benchmark dataset comprising the click behavior of 1,000 carefully curated users and nearly 10,000 annotated personalized headlines with historical dwell time annotations. Extensive experiments demonstrate that PHG-DIF substantially mitigates the adverse effects of click noise and significantly improves headline quality, achieving state-of-the-art (SOTA) results on DT-PENS. Our framework implementation and dataset are available at https://github.com/liukejin-up/PHG-DIF.

[94] The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs

Denis Janiak, Jakub Binkowski, Albert Sawczyn, Bogdan Gabrys, Ravid Shwartz-Ziv, Tomasz Kajdanowicz

Main category: cs.CL

TL;DR: Current hallucination detection methods for LLMs rely on ROUGE, which misaligns with human judgments. Human studies show ROUGE has high recall but low precision, leading to misleading performance estimates. Simple heuristics can rival complex methods, highlighting flaws in evaluation practices. Semantically aware metrics like LLM-as-Judge are needed for accurate assessment.

DetailsMotivation: The unreliability of ROUGE for evaluating hallucination detection methods in LLMs, as it misaligns with human judgments and leads to misleading performance estimates.

Method: Comprehensive human studies comparing ROUGE with human-aligned metrics (e.g., LLM-as-Judge) and analyzing the effectiveness of simple heuristics like response length.

Result: ROUGE has high recall but extremely low precision, causing performance drops of up to 45.9% when using human-aligned metrics. Simple heuristics can match complex detection methods.

Conclusion: Adopting semantically aware and robust evaluation frameworks (e.g., LLM-as-Judge) is crucial for accurately assessing hallucination detection methods and ensuring LLM trustworthiness.

Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment. Despite numerous hallucination detection methods, their evaluations often rely on ROUGE, a metric based on lexical overlap that misaligns with human judgments. Through comprehensive human studies, we demonstrate that while ROUGE exhibits high recall, its extremely low precision leads to misleading performance estimates. In fact, several established detection methods show performance drops of up to 45.9% when assessed using human-aligned metrics like LLM-as-Judge. Moreover, our analysis reveals that simple heuristics based on response length can rival complex detection techniques, exposing a fundamental flaw in current evaluation practices. We argue that adopting semantically aware and robust evaluation frameworks is essential to accurately gauge the true performance of hallucination detection methods, ultimately ensuring the trustworthiness of LLM outputs.

[95] BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them

Sekh Mainul Islam, Nadav Borenstein, Siddhesh Milind Pawar, Haeun Yu, Arnav Arora, Isabelle Augenstein

Main category: cs.CL

TL;DR: BiasGym is a framework for injecting, analyzing, and mitigating biases in LLMs, using BiasInject for bias injection and BiasScope for identification and mitigation.

DetailsMotivation: Understanding and mitigating subtle biases in LLMs is challenging but essential for fairness and interpretability.

Method: BiasGym uses token-based fine-tuning (BiasInject) to inject biases and BiasScope to analyze and mitigate them without degrading model performance.

Result: The framework effectively reduces real-world stereotypes and probes fictional associations, demonstrating its utility for safety and interpretability.

Conclusion: BiasGym provides a cost-effective and generalizable solution for bias analysis and mitigation in LLMs.

Abstract: Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. Biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce BiasGym, a simple, cost-effective, and generalizable framework for reliably injecting, analyzing, and mitigating conceptual associations within LLMs. BiasGym consists of two components: BiasInject, which injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and BiasScope, which leverages these injected signals to identify and steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during token-based fine-tuning. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from Italy being reckless drivers') and in probing fictional associations (e.g., people from a fictional country having blue skin’), showing its utility for both safety interventions and interpretability research.

[96] ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

Keyu Chen, Zhifeng Shen, Daohai Yu, Haoqian Wu, Wei Wen, Jianfeng He, Ruizhi Qiao, Xing Sun

Main category: cs.CL

TL;DR: ASPD introduces parallel decoding for LLMs, improving speed by up to 3.19x without quality loss.

DetailsMotivation: Addressing the inference latency in LLMs caused by sequential token prediction by leveraging intrinsic parallelism.

Method: Proposes Adaptive Serial-Parallel Decoding (ASPD) with automated parallelizable data extraction and a Hybrid Decoding Engine.

Result: Achieves up to 3.19x speedup on Vicuna Bench with <1% quality difference.

Conclusion: ASPD sets a benchmark for efficient LLM parallel inference, enabling deployment in latency-sensitive applications.

Abstract: The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By re-examining the outputs of autoregressive models, we observed that some segments exhibit parallelizable structures, which we term intrinsic parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel decoding) can significantly improve the overall inference speed of LLMs. In this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism. More specifically, we introduce a non-invasive pipeline that automatically extracts and validates parallelizable structures from the responses of autoregressive models. To empower efficient adaptive serial-parallel decoding, we implement a Hybrid Decoding Engine which enables seamless transitions between serial and parallel decoding modes while maintaining a reusable KV cache, maximizing computational efficiency. Extensive evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical Reasoning, demonstrate that ASPD achieves unprecedented performance in both effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up to 3.19x speedup (1.85x on average) while maintaining response quality within 1% difference compared to autoregressive models, realizing significant acceleration without compromising generation quality. Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.

[97] Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models

Ting Cai, Stephen Sheen, AnHai Doan

Main category: cs.CL

TL;DR: The paper introduces Columbo, an LLM-based solution for expanding table column abbreviations, outperforming prior methods by 4-29%. It addresses limitations in synthetic data and accuracy measures, using real-world datasets and synonym-aware metrics.

DetailsMotivation: The problem of expanding table column abbreviations is critical for downstream tasks in enterprises, sciences, and government. Prior work uses synthetic data and flawed accuracy measures, limiting practical applicability.

Method: The paper introduces 4 new real-world datasets, proposes synonym-aware accuracy measures, and develops Columbo, an LLM-based solution leveraging context, rules, chain-of-thought reasoning, and token-level analysis.

Result: Columbo outperforms the current state-of-the-art (NameGuess) by 4-29% across 5 datasets and is deployed in production on EDI, a major environmental sciences data portal.

Conclusion: The paper advances the field by addressing key limitations of prior work, introducing robust datasets and metrics, and demonstrating Columbo’s superior performance in real-world applications.

Abstract: Expanding the abbreviated column names of tables, such as “esal” to “employee salary”, is critical for numerous downstream data tasks. This problem arises in enterprises, domain sciences, government agencies, and more. In this paper we make three contributions that significantly advances the state of the art. First, we show that synthetic public data used by prior work has major limitations, and we introduce 4 new datasets in enterprise/science domains, with real-world abbreviations. Second, we show that accuracy measures used by prior work seriously undercount correct expansions, and we propose new synonym-aware measures that capture accuracy much more accurately. Finally, we develop Columbo, a powerful LLM-based solution that exploits context, rules, chain-of-thought reasoning, and token-level analysis. Extensive experiments show that Columbo significantly outperforms NameGuess, the current most advanced solution, by 4-29%, over 5 datasets. Columbo has been used in production on EDI, a major data portal for environmental sciences.

[98] PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou

Main category: cs.CL

TL;DR: PRELUDE is a benchmark for evaluating long-context understanding by assessing prequel consistency with original narratives, revealing gaps in AI performance.

DetailsMotivation: To address the need for global comprehension and deep reasoning in long-context tasks, as existing benchmarks fall short.

Method: Uses prequel plausibility tasks requiring evidence from multiple narrative parts; evaluates AI models like LLMs and commercial services.

Result: AI models lag behind humans by >15%, with a 30% gap in reasoning accuracy due to flawed logic.

Conclusion: Highlights significant challenges and room for improvement in long-context understanding and reasoning for AI.

Abstract: We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character’s prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks – as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.

[99] Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

Fares Antaki, David Mikhail, Daniel Milad, Danny A Mammo, Sumit Sharma, Sunil K Srivastava, Bing Yu Chen, Samir Touma, Mertcan Sevgi, Jonathan El-Khoury, Pearse A Keane, Qingyu Chen, Yih Chung Tham, Renaud Duval

Main category: cs.CL

TL;DR: GPT-5-high outperforms other models in accuracy and rationale quality for medical QA tasks, with GPT-5-mini-low offering a cost-efficient alternative.

DetailsMotivation: To determine the optimal configurations of GPT-5 for maximizing accuracy and cost-efficiency in complex medical question-answering tasks.

Method: Evaluated 12 GPT-5 configurations, o1-high, o3-high, and GPT-4o on 260 ophthalmology questions, measuring accuracy, rationale quality, and cost.

Result: GPT-5-high achieved the highest accuracy (0.965) and rationale quality, outperforming other models except o3-high. GPT-5-mini-low was cost-efficient.

Conclusion: GPT-5 configurations can be optimized for performance and cost, with GPT-5-high leading in accuracy and GPT-5-mini-low balancing cost and performance.

Abstract: Large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may improve performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. We evaluated 12 configurations of OpenAI’s GPT-5 series (three model tiers across four reasoning effort settings) alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course (BCSC) dataset. The primary outcome was multiple-choice accuracy; secondary outcomes included head-to-head ranking via a Bradley-Terry model, rationale quality assessment using a reference-anchored, pairwise LLM-as-a-judge framework, and analysis of accuracy-cost trade-offs using token-based cost estimates. GPT-5-high achieved the highest accuracy (0.965; 95% CI, 0.942-0.985), outperforming all GPT-5-nano variants (P < .001), o1-high (P = .04), and GPT-4o (P < .001), but not o3-high (0.958; 95% CI, 0.931-0.981). GPT-5-high ranked first in both accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high). Cost-accuracy analysis identified several GPT-5 configurations on the Pareto frontier, with GPT-5-mini-low offering the most favorable low-cost, high-performance balance. These results benchmark GPT-5 on a high-quality ophthalmology dataset, demonstrate the influence of reasoning effort on accuracy, and introduce an autograder framework for scalable evaluation of LLM-generated answers against reference standards in ophthalmology.

cs.CV

[100] Stochastic-based Patch Filtering for Few-Shot Learning

Javier Rodenas, Eduardo Aguilar, Petia Radeva

Main category: cs.CV

TL;DR: SPFF improves few-shot learning for food images by filtering patch embeddings stochastically to focus on class-relevant features, outperforming existing methods.

DetailsMotivation: Food images' visual complexity and variability cause misclassification in few-shot learning by distracting from key elements.

Method: Proposes SPFF, which stochastically filters patch embeddings to retain those most correlated with class representation, using a similarity matrix for classification.

Result: SPFF effectively focuses on class-specific features and filters irrelevant patches, outperforming state-of-the-art methods on Food-101, VireoFood-172, and UECFood-256.

Conclusion: SPFF addresses the challenges of food image classification in few-shot learning by enhancing focus on relevant features, achieving superior performance.

Abstract: Food images present unique challenges for few-shot learning models due to their visual complexity and variability. For instance, a pasta dish might appear with various garnishes on different plates and in diverse lighting conditions and camera perspectives. This problem leads to losing focus on the most important elements when comparing the query with support images, resulting in misclassification. To address this issue, we propose Stochastic-based Patch Filtering for Few-Shot Learning (SPFF) to attend to the patch embeddings that show greater correlation with the class representation. The key concept of SPFF involves the stochastic filtering of patch embeddings, where patches less similar to the class-aware embedding are more likely to be discarded. With patch embedding filtered according to the probability of appearance, we use a similarity matrix that quantifies the relationship between the query image and its respective support images. Through a qualitative analysis, we demonstrate that SPFF effectively focuses on patches where class-specific food features are most prominent while successfully filtering out non-relevant patches. We validate our approach through extensive experiments on few-shot classification benchmarks: Food-101, VireoFood-172 and UECFood-256, outperforming the existing SoA methods.

[101] DINOv3

Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Hervé Jégou, Patrick Labatut, Piotr Bojanowski

Main category: cs.CV

TL;DR: DINOv3 introduces a self-supervised vision foundation model that scales data and model size, uses Gram anchoring to improve dense features, and achieves state-of-the-art performance without fine-tuning.

DetailsMotivation: To eliminate manual data annotation and enable scalable, versatile visual representation learning across diverse domains.

Method: Scales data and model size, introduces Gram anchoring for dense feature stability, and applies post-hoc strategies for flexibility.

Result: Outperforms specialized state-of-the-art models across various vision tasks without fine-tuning.

Conclusion: DINOv3 advances self-supervised learning with scalable, high-quality dense features for diverse vision tasks.

Abstract: Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images – using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models’ flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

[102] Empowering Morphing Attack Detection using Interpretable Image-Text Foundation Model

Sushrut Patwardhan, Raghavendra Ramachandra, Sushma Venkatesh

Main category: cs.CV

TL;DR: A multimodal learning approach for morphing attack detection in face recognition systems, leveraging CLIP for zero-shot evaluation and textual description prediction.

DetailsMotivation: Ensuring reliable face verification by detecting morphing attacks, which are a growing threat in biometric systems.

Method: Uses Contrastive Language-Image Pretraining (CLIP) for zero-shot evaluation, with ten engineered textual prompts for human-understandable descriptions.

Result: Demonstrates generalizable morphing attack detection and predicts relevant text snippets, evaluated on a dataset with five morphing techniques across three mediums.

Conclusion: The proposed framework effectively detects morphing attacks and provides interpretable textual descriptions, enhancing face recognition system reliability.

Abstract: Morphing attack detection has become an essential component of face recognition systems for ensuring a reliable verification scenario. In this paper, we present a multimodal learning approach that can provide a textual description of morphing attack detection. We first show that zero-shot evaluation of the proposed framework using Contrastive Language-Image Pretraining (CLIP) can yield not only generalizable morphing attack detection, but also predict the most relevant text snippet. We present an extensive analysis of ten different textual prompts that include both short and long textual prompts. These prompts are engineered by considering the human understandable textual snippet. Extensive experiments were performed on a face morphing dataset that was developed using a publicly available face biometric dataset. We present an evaluation of SOTA pre-trained neural networks together with the proposed framework in the zero-shot evaluation of five different morphing generation techniques that are captured in three different mediums.

[103] Interpretable Oracle Bone Script Decipherment through Radical and Pictographic Analysis with LVLMs

Kaixin Peng, Mengyang Zhao, Haiyang Yu, Teng Fu, Bin Li

Main category: cs.CV

TL;DR: The paper proposes an interpretable method for deciphering Oracle Bone Script (OBS) using Large Vision-Language Models, combining radical and pictographic analysis to improve generalization and zero-shot performance.

DetailsMotivation: Current deep learning methods for OBS decipherment lack interpretability and struggle with zero-shot settings and undeciphered OBS.

Method: A progressive training strategy and Radical-Pictographic Dual Matching mechanism are introduced, leveraging radical and pictographic analysis.

Result: The method achieves state-of-the-art Top-10 accuracy and superior zero-shot performance, with logical analysis outputs.

Conclusion: The approach offers archaeologically valuable insights and has potential applications in digital humanities and historical research.

Abstract: As the oldest mature writing system, Oracle Bone Script (OBS) has long posed significant challenges for archaeological decipherment due to its rarity, abstractness, and pictographic diversity. Current deep learning-based methods have made exciting progress on the OBS decipherment task, but existing approaches often ignore the intricate connections between glyphs and the semantics of OBS. This results in limited generalization and interpretability, especially when addressing zero-shot settings and undeciphered OBS. To this end, we propose an interpretable OBS decipherment method based on Large Vision-Language Models, which synergistically combines radical analysis and pictograph-semantic understanding to bridge the gap between glyphs and meanings of OBS. Specifically, we propose a progressive training strategy that guides the model from radical recognition and analysis to pictographic analysis and mutual analysis, thus enabling reasoning from glyph to meaning. We also design a Radical-Pictographic Dual Matching mechanism informed by the analysis results, significantly enhancing the model’s zero-shot decipherment performance. To facilitate model training, we propose the Pictographic Decipherment OBS Dataset, which comprises 47,157 Chinese characters annotated with OBS images and pictographic analysis texts. Experimental results on public benchmarks demonstrate that our approach achieves state-of-the-art Top-10 accuracy and superior zero-shot decipherment capabilities. More importantly, our model delivers logical analysis processes, possibly providing archaeologically valuable reference results for undeciphered OBS, and thus has potential applications in digital humanities and historical research. The dataset and code will be released in https://github.com/PKXX1943/PD-OBS.

[104] Deep Learning Enables Large-Scale Shape and Appearance Modeling in Total-Body DXA Imaging

Arianna Bunnell, Devon Cataldi, Yannik Glaser, Thomas K. Wolfgruber, Steven Heymsfield, Alan B. Zonderman, Thomas L. Kelly, Peter Sadowski, John A. Shepherd

Main category: cs.CV

TL;DR: A deep learning method for automatic fiducial point placement on TBDXA scans is developed and validated, achieving high accuracy and demonstrating value for shape and appearance modeling in health research.

DetailsMotivation: To automate and improve the accuracy of fiducial point placement on TBDXA scans for body composition assessment and health research.

Method: A deep learning approach is trained on 1,683 manually-annotated TBDXA scans and validated on an external dataset. The method is then applied to 35,928 scans for shape and appearance modeling (SAM).

Result: The method achieves 99.5% accuracy in keypoint placement. SAM features correlate with health markers, supporting existing evidence and generating new hypotheses.

Conclusion: The developed method is highly accurate and useful for health research, with resources made publicly available.

Abstract: Total-body dual X-ray absorptiometry (TBDXA) imaging is a relatively low-cost whole-body imaging modality, widely used for body composition assessment. We develop and validate a deep learning method for automatic fiducial point placement on TBDXA scans using 1,683 manually-annotated TBDXA scans. The method achieves 99.5% percentage correct keypoints in an external testing dataset. To demonstrate the value for shape and appearance modeling (SAM), our method is used to place keypoints on 35,928 scans for five different TBDXA imaging modes, then associations with health markers are tested in two cohorts not used for SAM model generation using two-sample Kolmogorov-Smirnov tests. SAM feature distributions associated with health biomarkers are shown to corroborate existing evidence and generate new hypotheses on body composition and shape’s relationship to various frailty, metabolic, inflammation, and cardiometabolic health markers. Evaluation scripts, model weights, automatic point file generation code, and triangulation files are available at https://github.com/hawaii-ai/dxa-pointplacement.

[105] MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding

Chang Liu, Ye Pan, Chenyang Ding, Susanto Rahardja, Xiaokang Yang

Main category: cs.CV

TL;DR: MEDTalk is a framework for dynamic emotional 3D facial animation, disentangling content and emotion for realistic expressions, integrating multimodal inputs for control, and enabling industrial integration.

DetailsMotivation: Existing approaches for audio-driven facial animation rely on static emotion labels, limiting diversity and naturalness. MEDTalk addresses this by enabling fine-grained, dynamic emotional control.

Method: MEDTalk disentangles content and emotion embeddings via cross-reconstruction, integrates audio and text for dynamic expression adjustment, and uses multimodal inputs (text, images) for personalized control.

Result: The framework generates realistic emotional expressions, synchronizes lip movements, and allows seamless integration into industrial pipelines like MetaHuman.

Conclusion: MEDTalk advances emotional facial animation by enabling dynamic, fine-grained control and multimodal input integration, enhancing realism and usability in production.

Abstract: Audio-driven emotional 3D facial animation aims to generate synchronized lip movements and vivid facial expressions. However, most existing approaches focus on static and predefined emotion labels, limiting their diversity and naturalness. To address these challenges, we propose MEDTalk, a novel framework for fine-grained and dynamic emotional talking head generation. Our approach first disentangles content and emotion embedding spaces from motion sequences using a carefully designed cross-reconstruction process, enabling independent control over lip movements and facial expressions. Beyond conventional audio-driven lip synchronization, we integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions. Furthermore, to enhance control and personalization, we incorporate multimodal inputs-including text descriptions and reference expression images-to guide the generation of user-specified facial expressions. With MetaHuman as the priority, our generated results can be conveniently integrated into the industrial production pipeline. The code is available at: https://github.com/SJTU-Lucy/MEDTalk.

[106] MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

Thanh-Dat Truong, Christophe Bobda, Nitin Agarwal, Khoa Luu

Main category: cs.CV

TL;DR: The paper introduces MANGO, a Multimodal Attention-based Normalizing Flow method, to improve explicit and interpretable multimodal fusion learning using novel cross-attention mechanisms.

DetailsMotivation: Current multimodal fusion methods rely on implicit learning via Transformers, failing to capture essential features and complex correlations in multimodal data.

Method: Proposes an Invertible Cross-Attention (ICA) layer and three new cross-attention mechanisms (MMCA, IMCA, LICA) for Normalizing Flow-based multimodal fusion.

Result: Achieves state-of-the-art performance on semantic segmentation, image-to-image translation, and movie genre classification tasks.

Conclusion: MANGO provides a scalable, interpretable, and effective solution for multimodal learning, outperforming existing methods.

Abstract: Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach\footnote{The source code of this work will be publicly available.} to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.

[107] UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving

Yuping Wang, Xiangyu Huang, Xiaokang Sun, Mingxuan Yan, Shuo Xing, Zhengzhong Tu, Jiachen Li

Main category: cs.CV

TL;DR: UniOcc is a unified benchmark and toolkit for occupancy forecasting and prediction, integrating data from real-world datasets and simulators with innovative per-voxel flows and label-free evaluation metrics.

DetailsMotivation: To address the limitations of existing studies relying on pseudo labels and lack of unified evaluation, UniOcc aims to provide a robust, comprehensive framework for occupancy tasks.

Method: UniOcc unifies data from multiple sources (nuScenes, Waymo, CARLA, OpenCOOD), provides 2D/3D occupancy labels, annotates per-voxel flows, and introduces label-free evaluation metrics.

Result: Experiments show that diverse training data and explicit flow information significantly improve occupancy prediction and forecasting performance.

Conclusion: UniOcc offers a scalable, unified solution for occupancy tasks, with publicly available data and code.

Abstract: We introduce UniOcc, a comprehensive, unified benchmark and toolkit for occupancy forecasting (i.e., predicting future occupancies based on historical information) and occupancy prediction (i.e., predicting current-frame occupancy from camera images. UniOcc unifies the data from multiple real-world datasets (i.e., nuScenes, Waymo) and high-fidelity driving simulators (i.e., CARLA, OpenCOOD), providing 2D/3D occupancy labels and annotating innovative per-voxel flows. Unlike existing studies that rely on suboptimal pseudo labels for evaluation, UniOcc incorporates novel evaluation metrics that do not depend on ground-truth labels, enabling robust assessment on additional aspects of occupancy quality. Through extensive experiments on state-of-the-art models, we demonstrate that large-scale, diverse training data and explicit flow information significantly enhance occupancy prediction and forecasting performance. Our data and code are available at https://uniocc.github.io/.

[108] MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning

Quang-Trung Truong, Yuk-Kwan Wong, Vo Hoang Kim Tuyen Dang, Rinaldi Gotama, Duc Thanh Nguyen, Sai-Kit Yeung

Main category: cs.CV

TL;DR: A two-stage marine object-oriented video captioning pipeline is proposed to address challenges in marine video understanding, leveraging video, text, and segmentation masks for improved analysis and generation.

DetailsMotivation: Existing video captioning datasets fail to generalize to marine environments, lacking insights into marine life due to dynamic objects, camera motion, and underwater complexity.

Method: A two-stage pipeline using video, text, and segmentation masks, with video splitting to detect salient object transitions for richer captioning.

Result: Improved marine video understanding, analysis, and generation, with a released dataset and code.

Conclusion: The proposed pipeline and benchmark enhance marine video captioning, addressing domain-specific challenges effectively.

Abstract: Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which significantly enrich the semantics of captioning content. Our dataset and code have been released at https://msc.hkustvgd.com.

[109] Improving watermelon (Citrullus lanatus) disease classification with generative artificial intelligence (GenAI)-based synthetic and real-field images via a custom EfficientNetV2-L model

Nitin Rai, Nathan S. Boyd, Gary E. Vallad, Arnold W. Schumann

Main category: cs.CV

TL;DR: The study explores combining real and synthetic images to improve watermelon disease classification using EfficientNetV2-L, showing hybrid datasets enhance performance.

DetailsMotivation: To evaluate if integrating real and synthetic images boosts disease classification accuracy, reducing reliance on resource-heavy real data collection.

Method: Divided training data into five treatments (real-only, synthetic-only, and hybrid ratios) and trained using EfficientNetV2-L with fine-tuning.

Result: Hybrid datasets (H2-H4) outperformed real-only (H0), with H3-H4 achieving perfect weighted F1-scores (1.00).

Conclusion: Synthetic images alone aren’t enough; a hybrid approach maximizes performance for crop disease classification.

Abstract: The current advancements in generative artificial intelligence (GenAI) models have paved the way for new possibilities for generating high-resolution synthetic images, thereby offering a promising alternative to traditional image acquisition for training computer vision models in agriculture. In the context of crop disease diagnosis, GenAI models are being used to create synthetic images of various diseases, potentially facilitating model creation and reducing the dependency on resource-intensive in-field data collection. However, limited research has been conducted on evaluating the effectiveness of integrating real with synthetic images to improve disease classification performance. Therefore, this study aims to investigate whether combining a limited number of real images with synthetic images can enhance the prediction accuracy of an EfficientNetV2-L model for classifying watermelon \textit{(Citrullus lanatus)} diseases. The training dataset was divided into five treatments: H0 (only real images), H1 (only synthetic images), H2 (1:1 real-to-synthetic), H3 (1:10 real-to-synthetic), and H4 (H3 + random images to improve variability and model generalization). All treatments were trained using a custom EfficientNetV2-L architecture with enhanced fine-tuning and transfer learning techniques. Models trained on H2, H3, and H4 treatments demonstrated high precision, recall, and F1-score metrics. Additionally, the weighted F1-score increased from 0.65 (on H0) to 1.00 (on H3-H4) signifying that the addition of a small number of real images with a considerable volume of synthetic images improved model performance and generalizability. Overall, this validates the findings that synthetic images alone cannot adequately substitute for real images; instead, both must be used in a hybrid manner to maximize model performance for crop disease classification.

[110] SynSpill: Improved Industrial Spill Detection With Synthetic Data

Aaditya Baranwal, Abdul Mueez, Jason Voelker, Guneet Bhatia, Shruti Vyas

Main category: cs.CV

TL;DR: The paper introduces a synthetic data generation framework to improve Vision-Language Models (VLMs) and object detectors in niche, safety-critical domains like industrial spill detection, where real data is scarce.

DetailsMotivation: Performance of VLMs degrades in niche domains due to rare, sensitive, and hard-to-annotate data. Privacy and infrequency of incidents make conventional fine-tuning impractical.

Method: A scalable framework with a synthetic data generation pipeline (SynSpill dataset) is proposed for Parameter-Efficient Fine-Tuning (PEFT) of VLMs and detectors like YOLO and DETR.

Result: Synthetic data boosts performance of VLMs and detectors, making them comparable. VLMs generalize better without synthetic data, but both improve significantly with it.

Conclusion: High-fidelity synthetic data bridges the domain gap in safety-critical applications, offering a cost-effective, scalable solution for industrial environments with scarce real data.

Abstract: Large-scale Vision-Language Models (VLMs) have transformed general-purpose visual recognition through strong zero-shot capabilities. However, their performance degrades significantly in niche, safety-critical domains such as industrial spill detection, where hazardous events are rare, sensitive, and difficult to annotate. This scarcity – driven by privacy concerns, data sensitivity, and the infrequency of real incidents – renders conventional fine-tuning of detectors infeasible for most industrial settings. We address this challenge by introducing a scalable framework centered on a high-quality synthetic data generation pipeline. We demonstrate that this synthetic corpus enables effective Parameter-Efficient Fine-Tuning (PEFT) of VLMs and substantially boosts the performance of state-of-the-art object detectors such as YOLO and DETR. Notably, in the absence of synthetic data (SynSpill dataset), VLMs still generalize better to unseen spill scenarios than these detectors. When SynSpill is used, both VLMs and detectors achieve marked improvements, with their performance becoming comparable. Our results underscore that high-fidelity synthetic data is a powerful means to bridge the domain gap in safety-critical applications. The combination of synthetic generation and lightweight adaptation offers a cost-effective, scalable pathway for deploying vision systems in industrial environments where real data is scarce/impractical to obtain. Project Page: https://synspill.vercel.app

[111] FIND-Net – Fourier-Integrated Network with Dictionary Kernels for Metal Artifact Reduction

Farid Tasharofi, Fuxin Fan, Melika Qahqaie, Mareike Thies, Andreas Maier

Main category: cs.CV

TL;DR: FIND-Net, a novel MAR framework, integrates frequency and spatial domain processing to reduce metal artifacts in CT scans while preserving structural details, outperforming existing methods.

DetailsMotivation: Metal artifacts in CT imaging degrade quality, complicating diagnosis. Current deep learning methods struggle to balance artifact suppression and detail preservation.

Method: FIND-Net uses Fast Fourier Convolution layers and trainable Gaussian filtering, treating MAR as a hybrid task in spatial and frequency domains.

Result: FIND-Net shows significant improvements: 3.07% MAE reduction, 0.18% SSIM increase, and 0.90% PSNR improvement over state-of-the-art methods.

Conclusion: FIND-Net advances MAR performance with superior structural preservation and clinical applicability, validated on synthetic and real-world CT scans.

Abstract: Metal artifacts, caused by high-density metallic implants in computed tomography (CT) imaging, severely degrade image quality, complicating diagnosis and treatment planning. While existing deep learning algorithms have achieved notable success in Metal Artifact Reduction (MAR), they often struggle to suppress artifacts while preserving structural details. To address this challenge, we propose FIND-Net (Fourier-Integrated Network with Dictionary Kernels), a novel MAR framework that integrates frequency and spatial domain processing to achieve superior artifact suppression and structural preservation. FIND-Net incorporates Fast Fourier Convolution (FFC) layers and trainable Gaussian filtering, treating MAR as a hybrid task operating in both spatial and frequency domains. This approach enhances global contextual understanding and frequency selectivity, effectively reducing artifacts while maintaining anatomical structures. Experiments on synthetic datasets show that FIND-Net achieves statistically significant improvements over state-of-the-art MAR methods, with a 3.07% MAE reduction, 0.18% SSIM increase, and 0.90% PSNR improvement, confirming robustness across varying artifact complexities. Furthermore, evaluations on real-world clinical CT scans confirm FIND-Net’s ability to minimize modifications to clean anatomical regions while effectively suppressing metal-induced distortions. These findings highlight FIND-Net’s potential for advancing MAR performance, offering superior structural preservation and improved clinical applicability. Code is available at https://github.com/Farid-Tasharofi/FIND-Net

[112] EntropyGS: An Efficient Entropy Coding on 3D Gaussian Splatting

Yuning Huang, Jiahao Pang, Fengqing Zhu, Dong Tian

Main category: cs.CV

TL;DR: 3D Gaussian Splatting (3DGS) is analyzed for compression, revealing attribute distributions and proposing EntropyGS, a method achieving 30x rate reduction with maintained quality.

DetailsMotivation: Compression of 3DGS Gaussians is necessary due to storage/transmission needs, and attribute analysis reveals distribution patterns.

Method: Analyze Gaussian attributes, propose EntropyGS for factorized entropy coding with adaptive quantization.

Result: 30x rate reduction on benchmarks with similar rendering quality and fast encoding/decoding.

Conclusion: EntropyGS efficiently compresses 3DGS data while preserving visual quality.

Abstract: As an emerging novel view synthesis approach, 3D Gaussian Splatting (3DGS) demonstrates fast training/rendering with superior visual quality. The two tasks of 3DGS, Gaussian creation and view rendering, are typically separated over time or devices, and thus storage/transmission and finally compression of 3DGS Gaussians become necessary. We begin with a correlation and statistical analysis of 3DGS Gaussian attributes. An inspiring finding in this work reveals that spherical harmonic AC attributes precisely follow Laplace distributions, while mixtures of Gaussian distributions can approximate rotation, scaling, and opacity. Additionally, harmonic AC attributes manifest weak correlations with other attributes except for inherited correlations from a color space. A factorized and parameterized entropy coding method, EntropyGS, is hereinafter proposed. During encoding, distribution parameters of each Gaussian attribute are estimated to assist their entropy coding. The quantization for entropy coding is adaptively performed according to Gaussian attribute types. EntropyGS demonstrates about 30x rate reduction on benchmark datasets while maintaining similar rendering quality compared to input 3DGS data, with a fast encoding and decoding time.

[113] CellSymphony: Deciphering the molecular and phenotypic orchestration of cells with single-cell pathomics

Paul H. Acosta, Pingjun Chen, Simon P. Castillo, Maria Esther Salvatierra, Yinyin Yuan, Xiaoxi Pan

Main category: cs.CV

TL;DR: CellSymphony integrates Xenium spatial transcriptomics and histology images using foundation models for accurate cell type annotation and microenvironment analysis.

DetailsMotivation: To address the challenge of robustly extracting cell-level features from histology images and integrating them with spatial transcriptomics data.

Method: Uses a multimodal framework (CellSymphony) to fuse spatial gene expression and morphological context via foundation model-derived embeddings.

Result: Achieves accurate cell type annotation and identifies distinct microenvironmental niches in three cancer types.

Conclusion: Demonstrates the potential of foundation models and multimodal fusion for understanding tissue ecosystems.

Abstract: Xenium, a new spatial transcriptomics platform, enables subcellular-resolution profiling of complex tumor tissues. Despite the rich morphological information in histology images, extracting robust cell-level features and integrating them with spatial transcriptomics data remains a critical challenge. We introduce CellSymphony, a flexible multimodal framework that leverages foundation model-derived embeddings from both Xenium transcriptomic profiles and histology images at true single-cell resolution. By learning joint representations that fuse spatial gene expression with morphological context, CellSymphony achieves accurate cell type annotation and uncovers distinct microenvironmental niches across three cancer types. This work highlights the potential of foundation models and multimodal fusion for deciphering the physiological and phenotypic orchestration of cells within complex tissue ecosystems.

[114] Deep Learning for Crack Detection: A Review of Learning Paradigms, Generalizability, and Datasets

Xinan Zhang, Haolin Wang, Yung-An Hsieh, Zhongyu Yang, Anthony Yezzi, Yi-Chang Tsai

Main category: cs.CV

TL;DR: A review of deep learning trends in crack detection, highlighting shifts in learning paradigms, generalizability, and dataset diversity, along with a new 3D dataset and benchmarking.

DetailsMotivation: To analyze emerging trends in deep learning for crack detection and introduce a new dataset (3DCrack) to support future research.

Method: Systematic review of trends and benchmarking experiments with deep learning methodologies, including foundation models.

Result: Insights into evolving methodologies and future directions, with baselines established for crack detection using the new dataset.

Conclusion: The review and dataset aim to advance deep learning-based crack detection, highlighting key trends and future research opportunities.

Abstract: Crack detection plays a crucial role in civil infrastructures, including inspection of pavements, buildings, etc., and deep learning has significantly advanced this field in recent years. While numerous technical and review papers exist in this domain, emerging trends are reshaping the landscape. These shifts include transitions in learning paradigms (from fully supervised learning to semi-supervised, weakly-supervised, unsupervised, few-shot, domain adaptation and fine-tuning foundation models), improvements in generalizability (from single-dataset performance to cross-dataset evaluation), and diversification in dataset reacquisition (from RGB images to specialized sensor-based data). In this review, we systematically analyze these trends and highlight representative works. Additionally, we introduce a new dataset collected with 3D laser scans, 3DCrack, to support future research and conduct extensive benchmarking experiments to establish baselines for commonly used deep learning methodologies, including recent foundation models. Our findings provide insights into the evolving methodologies and future directions in deep learning-based crack detection. Project page: https://github.com/nantonzhang/Awesome-Crack-Detection

[115] MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs

Haonan Ge, Yiwei Wang, Ming-Hsuan Yang, Yujun Cai

Main category: cs.CV

TL;DR: MRFD is a training-free method to reduce hallucinations in LVLMs by fusing multi-region predictions using JSD-based weights.

DetailsMotivation: LVLMs often produce hallucinations due to limited ability to verify visual information across image regions.

Method: MRFD uses cross-attention to identify salient regions, generates per-region responses, and fuses them using JSD-based weights for consistency.

Result: MRFD reduces hallucinations and improves factuality without model updates.

Conclusion: MRFD effectively enhances factual grounding in LVLMs by leveraging inter-region consistency.

Abstract: Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations – text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.

[116] Leveraging Motion Estimation for Efficient Bayer-Domain Computer Vision

Haichao Wang, Xinyue Xi, Jiangtao Wen, Yuxing Han

Main category: cs.CV

TL;DR: A novel framework eliminates the ISP and uses motion estimation for efficient video vision tasks directly in the Bayer domain, reducing FLOPs by 70% with minimal accuracy loss.

DetailsMotivation: Traditional ISP and VCN pipelines are computationally expensive and power-hungry, prompting the need for a more efficient approach.

Method: Introduces MEVC, integrating motion estimation into convolutional layers to reduce redundancy across frames, enabling raw Bayer input processing.

Result: Achieves over 70% reduction in FLOPs with minimal accuracy degradation across multiple video vision tasks.

Conclusion: The framework generalizes across convolution-based models and is the first to effectively reuse motion estimation for raw sensor data processing.

Abstract: Existing computer vision processing pipeline acquires visual information using an image sensor that captures pixel information in the Bayer pattern. The raw sensor data are then processed using an image signal processor (ISP) that first converts Bayer pixel data to RGB on a pixel by pixel basis, followed by video convolutional network (VCN) processing on a frame by frame basis. Both ISP and VCN are computationally expensive with high power consumption and latency. In this paper, we propose a novel framework that eliminates the ISP and leverages motion estimation to accelerate video vision tasks directly in the Bayer domain. We introduce Motion Estimation-based Video Convolution (MEVC), which integrates sliding-window motion estimation into each convolutional layer, enabling prediction and residual-based refinement that reduces redundant computations across frames. This design bridges the structural gap between block-based motion estimation and spatial convolution, enabling accurate, low-cost processing. Our end-to-end pipeline supports raw Bayer input and achieves over 70% reduction in FLOPs with minimal accuracy degradation across video semantic segmentation, depth estimation, and object detection benchmarks, using both synthetic Bayer-converted and real Bayer video datasets. This framework generalizes across convolution-based models and marks the first effective reuse of motion estimation for accelerating video computer vision directly from raw sensor data.

[117] Pose-Robust Calibration Strategy for Point-of-Gaze Estimation on Mobile Phones

Yujie Zhao, Jiabei Zeng, Shiguang Shan

Main category: cs.CV

TL;DR: The paper addresses the challenge of generalizing appearance-based point-of-gaze (PoG) estimation across individuals by proposing a dynamic calibration strategy to improve robustness against head pose variations.

DetailsMotivation: Person-specific calibration is needed for accurate PoG estimation, but current methods are sensitive to head pose changes.

Method: The authors create a benchmark (MobilePoG) and analyze how calibration point diversity and head pose variations affect accuracy. They propose a dynamic calibration strategy where users move their phones during calibration.

Result: Experiments show that including diverse head poses during calibration improves estimator robustness. The proposed dynamic strategy outperforms conventional methods.

Conclusion: The dynamic calibration strategy enhances PoG estimation accuracy by reducing sensitivity to head pose variations, offering a user-friendly solution.

Abstract: Although appearance-based point-of-gaze (PoG) estimation has improved, the estimators still struggle to generalize across individuals due to personal differences. Therefore, person-specific calibration is required for accurate PoG estimation. However, calibrated PoG estimators are often sensitive to head pose variations. To address this, we investigate the key factors influencing calibrated estimators and explore pose-robust calibration strategies. Specifically, we first construct a benchmark, MobilePoG, which includes facial images from 32 individuals focusing on designated points under either fixed or continuously changing head poses. Using this benchmark, we systematically analyze how the diversity of calibration points and head poses influences estimation accuracy. Our experiments show that introducing a wider range of head poses during calibration improves the estimator’s ability to handle pose variation. Building on this insight, we propose a dynamic calibration strategy in which users fixate on calibration points while moving their phones. This strategy naturally introduces head pose variation during a user-friendly and efficient calibration process, ultimately producing a better calibrated PoG estimator that is less sensitive to head pose variations than those using conventional calibration strategies. Codes and datasets are available at our project page.

[118] High Fidelity Text to Image Generation with Contrastive Alignment and Structural Guidance

Danyi Gao

Main category: cs.CV

TL;DR: A method for high-fidelity text-driven image generation improves semantic alignment and structural consistency using contrastive learning and structural guidance, outperforming existing methods.

DetailsMotivation: Existing text-driven image generation methods face bottlenecks in semantic alignment accuracy and structural consistency, limiting their performance.

Method: Integrates text-image contrastive constraints with structural guidance mechanisms, optimizing contrastive loss, structural consistency loss, and semantic preservation loss.

Result: Outperforms benchmarks on COCO-2014 dataset in CLIP Score, FID, and SSIM, balancing semantic alignment and structural fidelity without added complexity.

Conclusion: The method effectively generates semantically clear and structurally complete images, advancing text-image modeling and generation.

Abstract: This paper addresses the performance bottlenecks of existing text-driven image generation methods in terms of semantic alignment accuracy and structural consistency. A high-fidelity image generation method is proposed by integrating text-image contrastive constraints with structural guidance mechanisms. The approach introduces a contrastive learning module that builds strong cross-modal alignment constraints to improve semantic matching between text and image. At the same time, structural priors such as semantic layout maps or edge sketches are used to guide the generator in spatial-level structural modeling. This enhances the layout completeness and detail fidelity of the generated images. Within the overall framework, the model jointly optimizes contrastive loss, structural consistency loss, and semantic preservation loss. A multi-objective supervision mechanism is adopted to improve the semantic consistency and controllability of the generated content. Systematic experiments are conducted on the COCO-2014 dataset. Sensitivity analyses are performed on embedding dimensions, text length, and structural guidance strength. Quantitative metrics confirm the superior performance of the proposed method in terms of CLIP Score, FID, and SSIM. The results show that the method effectively bridges the gap between semantic alignment and structural fidelity without increasing computational complexity. It demonstrates a strong ability to generate semantically clear and structurally complete images, offering a viable technical path for joint text-image modeling and image generation.

[119] VIFSS: View-Invariant and Figure Skating-Specific Pose Representation Learning for Temporal Action Segmentation

Ryota Tanaka, Tomohiro Suzuki, Keisuke Fujii

Main category: cs.CV

TL;DR: The paper proposes a new Temporal Action Segmentation (TAS) framework for figure skating jumps, addressing data insufficiency and 3D procedural structure limitations. It introduces a view-invariant pose representation and a fine-grained annotation scheme, achieving high accuracy.

DetailsMotivation: Accurate recognition of figure skating jumps is essential for performance evaluation but requires expert knowledge. Existing TAS methods lack sufficient data and ignore 3D aspects and procedural structure.

Method: The framework includes a View-Invariant, Figure Skating-Specific pose representation (VIFSS) with contrastive pre-training and action classification fine-tuning. It uses the FS-Jump3D dataset and a fine-grained annotation scheme for procedural learning.

Result: The method achieves over 92% F1@50 on element-level TAS, excelling in recognizing jump types and rotation levels. View-invariant pre-training is particularly effective with limited fine-tuning data.

Conclusion: The proposed framework effectively addresses the limitations of existing TAS methods for figure skating jumps, demonstrating high accuracy and practicality in real-world scenarios.

Abstract: Understanding human actions from videos plays a critical role across various domains, including sports analytics. In figure skating, accurately recognizing the type and timing of jumps a skater performs is essential for objective performance evaluation. However, this task typically requires expert-level knowledge due to the fine-grained and complex nature of jump procedures. While recent approaches have attempted to automate this task using Temporal Action Segmentation (TAS), there are two major limitations to TAS for figure skating: the annotated data is insufficient, and existing methods do not account for the inherent three-dimensional aspects and procedural structure of jump actions. In this work, we propose a new TAS framework for figure skating jumps that explicitly incorporates both the three-dimensional nature and the semantic procedure of jump movements. First, we propose a novel View-Invariant, Figure Skating-Specific pose representation learning approach (VIFSS) that combines contrastive learning as pre-training and action classification as fine-tuning. For view-invariant contrastive pre-training, we construct FS-Jump3D, the first publicly available 3D pose dataset specialized for figure skating jumps. Second, we introduce a fine-grained annotation scheme that marks the entry (preparation)'' and landing’’ phases, enabling TAS models to learn the procedural structure of jumps. Extensive experiments demonstrate the effectiveness of our framework. Our method achieves over 92% F1@50 on element-level TAS, which requires recognizing both jump types and rotation levels. Furthermore, we show that view-invariant contrastive pre-training is particularly effective when fine-tuning data is limited, highlighting the practicality of our approach in real-world scenarios.

[120] JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics

Simindokht Jahangard, Mehrzad Mohammadi, Yi Shen, Zhixi Cai, Hamid Rezatofighi

Main category: cs.CV

TL;DR: The paper introduces JRDB-Reasoning, a benchmark for visual reasoning in human-crowded environments, addressing limitations in existing benchmarks by formalizing reasoning complexity and providing customizable questions with detailed annotations.

DetailsMotivation: Existing visual reasoning benchmarks lack clear definitions of reasoning complexity, customizable question generation, and structured annotations, limiting their utility for evaluating embodied AI agents.

Method: The authors formalize reasoning complexity, develop an adaptive query engine for generating customizable questions with intermediate annotations, and extend the JRDB dataset with human-object interaction and geometric relationship annotations.

Result: JRDB-Reasoning is created as a tailored benchmark, enabling fine-grained evaluation of visual reasoning frameworks and dynamic assessment of vision-language models.

Conclusion: The proposed benchmark and engine address gaps in visual reasoning evaluation, offering tools for more nuanced and adaptable assessments of AI models.

Abstract: Recent advances in Vision-Language Models (VLMs) and large language models (LLMs) have greatly enhanced visual reasoning, a key capability for embodied AI agents like robots. However, existing visual reasoning benchmarks often suffer from several limitations: they lack a clear definition of reasoning complexity, offer have no control to generate questions over varying difficulty and task customization, and fail to provide structured, step-by-step reasoning annotations (workflows). To bridge these gaps, we formalize reasoning complexity, introduce an adaptive query engine that generates customizable questions of varying complexity with detailed intermediate annotations, and extend the JRDB dataset with human-object interaction and geometric relationship annotations to create JRDB-Reasoning, a benchmark tailored for visual reasoning in human-crowded environments. Our engine and benchmark enable fine-grained evaluation of visual reasoning frameworks and dynamic assessment of visual-language models across reasoning levels.

[121] A Sub-Pixel Multimodal Optical Remote Sensing Images Matching Method

Tao Huang, Hongbo Pan, Nanxi Zhou, Shun Zhou

Main category: cs.CV

TL;DR: The paper introduces PCWLAD, a method for high-accuracy multimodal optical image matching, combining SSIM for coarse matching and WLAD for fine matching, achieving ~0.4 pixel accuracy.

DetailsMotivation: Addressing degraded matching accuracy due to nonlinear radiation and geometric deformation in multimodal optical images.

Method: Two-step approach: coarse matching with SSIM and fine matching with WLAD, using phase consistency and mutual structure filtering.

Result: PCWLAD outperformed eight existing methods, achieving ~0.4 pixel accuracy across three datasets.

Conclusion: PCWLAD is effective for multimodal image matching, with publicly available software and datasets.

Abstract: High-accuracy matching of multimodal optical images is the basis of geometric processing. However, the image matching accuracy is usually degraded by the nonlinear radiation and geometric deformation differences caused by different spectral responses. To address these problems, we proposed a phase consistency weighted least absolute deviation (PCWLAD) sub-pixel template matching method to improve the matching accuracy of multimodal optical images. This method consists of two main steps: coarse matching with the structural similarity index measure (SSIM) and fine matching with WLAD. In the coarse matching step, PCs are calculated without a noise filter to preserve the original structural details, and template matching is performed using the SSIM. In the fine matching step, we applied the radiometric and geometric transformation models between two multimodal PC templates based on the coarse matching. Furthermore, mutual structure filtering is adopted in the model to mitigate the impact of noise within the corresponding templates on the structural consistency, and the WLAD criterion is used to estimate the sub-pixel offset. To evaluate the performance of PCWLAD, we created three types of image datasets: visible to infrared Landsat images, visible to near-infrared close-range images, and visible to infrared uncrewed aerial vehicle (UAV) images. PCWLAD outperformed existing state-of-the-art eight methods in terms of correct matching rate (CMR) and root mean square error (RMSE) and reached an average matching accuracy of approximately 0.4 pixels across all three datasets. Our software and datasets are publicly available at https://github.com/huangtaocsu/PCWLAD.

[122] InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild

Yiyi Ma, Yuanzhi Liang, Xiu Li, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: InterSyn introduces a framework for realistic motion synthesis by integrating solo and multi-person dynamics through interleaved learning, outperforming existing methods in alignment and diversity.

DetailsMotivation: To address the limitations of previous methods that treat solo and multi-person dynamics separately, aiming for more natural and coordinated motion synthesis.

Method: Uses two modules: Interleaved Interaction Synthesis (INS) for unified modeling of solo and interactive behaviors, and Relative Coordination Refinement (REC) for refining mutual dynamics.

Result: Generates motion sequences with higher text-to-motion alignment and improved diversity, setting a new benchmark.

Conclusion: InterSyn advances motion synthesis by capturing nuanced coordination, with plans to open-source the code for further research.

Abstract: We present Interleaved Learning for Motion Synthesis (InterSyn), a novel framework that targets the generation of realistic interaction motions by learning from integrated motions that consider both solo and multi-person dynamics. Unlike previous methods that treat these components separately, InterSyn employs an interleaved learning strategy to capture the natural, dynamic interactions and nuanced coordination inherent in real-world scenarios. Our framework comprises two key modules: the Interleaved Interaction Synthesis (INS) module, which jointly models solo and interactive behaviors in a unified paradigm from a first-person perspective to support multiple character interactions, and the Relative Coordination Refinement (REC) module, which refines mutual dynamics and ensures synchronized motions among characters. Experimental results show that the motion sequences generated by InterSyn exhibit higher text-to-motion alignment and improved diversity compared with recent methods, setting a new benchmark for robust and natural motion synthesis. Additionally, our code will be open-sourced in the future to promote further research and development in this area.

[123] From Pixel to Mask: A Survey of Out-of-Distribution Segmentation

Wenjie Zhao, Jia Li, Yunhui Guo

Main category: cs.CV

TL;DR: A survey on OoD segmentation methods for autonomous driving, categorizing approaches and discussing challenges and future directions.

DetailsMotivation: Addressing the need for precise OoD object localization in safety-critical applications like autonomous driving.

Method: Categorizes OoD segmentation into four approaches: test-time segmentation, outlier exposure, reconstruction-based methods, and leveraging powerful models.

Result: Systematic review of advances, identification of challenges, and discussion of future research directions.

Conclusion: OoD segmentation is vital for autonomous driving, with ongoing challenges and opportunities for further research.

Abstract: Out-of-distribution (OoD) detection and segmentation have attracted growing attention as concerns about AI security rise. Conventional OoD detection methods identify the existence of OoD objects but lack spatial localization, limiting their usefulness in downstream tasks. OoD segmentation addresses this limitation by localizing anomalous objects at pixel-level granularity. This capability is crucial for safety-critical applications such as autonomous driving, where perception modules must not only detect but also precisely segment OoD objects, enabling targeted control actions and enhancing overall system robustness. In this survey, we group current OoD segmentation approaches into four categories: (i) test-time OoD segmentation, (ii) outlier exposure for supervised training, (iii) reconstruction-based methods, (iv) and approaches that leverage powerful models. We systematically review recent advances in OoD segmentation for autonomous-driving scenarios, identify emerging challenges, and discuss promising future research directions.

[124] Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances

Yuanzhi Liang, Yijie Fang, Rui Li, Ziqi Ni, Ruijie Su, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: A survey on using reinforcement learning (RL) to improve generative models for visual content, addressing misalignment issues with traditional objectives.

DetailsMotivation: Generative models often misalign with perceptual quality, semantic accuracy, or physical realism due to surrogate objectives like likelihood or reconstruction loss. RL provides a principled way to optimize non-differentiable, preference-driven goals.

Method: The paper reviews RL-based methods for visual content generation, tracing its evolution from classical control to a general-purpose optimization tool. It examines RL’s integration into image, video, and 3D/4D generation.

Result: RL enhances controllability, consistency, and human alignment in generative tasks, serving as both a fine-tuning mechanism and a structural component for high-level goals.

Conclusion: The survey highlights open challenges and future directions for RL in generative modeling, emphasizing its potential to align generation with complex objectives.

Abstract: Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.

[125] Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models

Andrew Bai, Justin Cui, Ruochen Wang, Cho-Jui Hsieh

Main category: cs.CV

TL;DR: Targeted training data selection improves vision-language benchmarks by focusing on matching concepts or skills, outperforming baselines by +0.9% on average.

DetailsMotivation: To address the dichotomy in vision-language benchmarks, where performance depends on similar skills or concepts, and optimize benchmark performance.

Method: Extract concepts/skills from benchmarks, determine their focus (concepts or skills), and select matching instructions for training.

Result: +0.9% improvement over baselines on average, with +1.5% on skill-focused benchmarks.

Conclusion: Recognizing the trade-off between concepts and skills in instruction selection is crucial for optimizing performance.

Abstract: Vision-language instruction tuning achieves two main purposes: learning visual concepts and learning visual skills. In this paper, we found that vision-language benchmarks fall into the dichotomy of mainly benefiting from training on instructions with similar skills or visual concepts. Inspired by the discovery, we designed a simple targeted training data selection method to optimize the performance of a given benchmark. We first extract the concepts/skills from the benchmark, determine whether the benchmark predominantly benefits from similar concepts or skills, and finally select instructions with the most matching concepts/skills. Experiments on 10+ benchmarks validate the effectiveness of our targeted data selection method, showing +0.9% over the best existing baseline averaged over all benchmarks and +1.5% on the skill-focused subset. Our findings underscore the importance of recognizing the inherent trade-off within instruction selection, which requires balancing the acquisition of conceptual knowledge against visual skill.

[126] Glo-DMU: A Deep Morphometry Framework of Ultrastructural Characterization in Glomerular Electron Microscopic Images

Zhentai Zhang, Danyi Weng, Guibin Zhang, Xiang Chen, Kaixing Long, Jian Geng, Yanmeng Lu, Lei Zhang, Zhitao Zhou, Lei Cao

Main category: cs.CV

TL;DR: The paper introduces Glo-DMU, a deep learning framework for automated analysis of glomerular ultrastructure in kidney diseases, addressing limitations of current methods by quantifying multiple features simultaneously.

DetailsMotivation: Current computational pathology methods focus on single ultrastructure recognition, limiting practical diagnostic utility. Glo-DMU aims to automate and enhance analysis by quantifying multiple key features.

Method: Glo-DMU uses three deep models: ultrastructure segmentation, glomerular filtration barrier classification, and electron-dense deposits detection, following renal biopsy protocols.

Result: Tested on 115 patients with 9 renal pathologies, Glo-DMU showed high consistency with manual pathological reports, offering automation, precision, and throughput.

Conclusion: Glo-DMU is an efficient, automated tool for renal pathologists, enabling simultaneous quantification of multiple ultrastructural features with high accuracy.

Abstract: Complex and diverse ultrastructural features can indicate the type, progression, and prognosis of kidney diseases. Recently, computational pathology combined with deep learning methods has shown tremendous potential in advancing automatic morphological analysis of glomerular ultrastructure. However, current research predominantly focuses on the recognition of individual ultrastructure, which makes it challenging to meet practical diagnostic needs. In this study, we propose the glomerular morphometry framework of ultrastructural characterization (Glo-DMU), which is grounded on three deep models: the ultrastructure segmentation model, the glomerular filtration barrier region classification model, and the electron-dense deposits detection model. Following the conventional protocol of renal biopsy diagnosis, this framework simultaneously quantifies the three most widely used ultrastructural features: the thickness of glomerular basement membrane, the degree of foot process effacement, and the location of electron-dense deposits. We evaluated the 115 patients with 9 renal pathological types in real-world diagnostic scenarios, demonstrating good consistency between automatic quantification results and morphological descriptions in the pathological reports. Glo-DMU possesses the characteristics of full automation, high precision, and high throughput, quantifying multiple ultrastructural features simultaneously, and providing an efficient tool for assisting renal pathologists.

[127] Improving OCR for Historical Texts of Multiple Languages

Hylke Westerdijk, Ben Blankenborg, Khondoker Ittehadul Islam

Main category: cs.CV

TL;DR: The paper details methodologies and results for OCR and Document Layout Analysis tasks using deep learning, including historical Hebrew text, 16th-18th-century documents, and modern English handwriting.

DetailsMotivation: To improve OCR and layout analysis for diverse document types (historical, early modern, and modern) using advanced deep learning techniques.

Method: Employed Kraken and TrOCR for Hebrew text, CRNN with DeepLabV3+ and Bidirectional LSTM for early modern documents, and CRNN with ResNet34 and CTC loss for modern handwriting.

Result: Enhanced character recognition and model refinement across tasks, providing actionable insights.

Conclusion: The study offers valuable insights and suggests future research directions for OCR and layout analysis.

Abstract: This paper presents our methodology and findings from three tasks across Optical Character Recognition (OCR) and Document Layout Analysis using advanced deep learning techniques. First, for the historical Hebrew fragments of the Dead Sea Scrolls, we enhanced our dataset through extensive data augmentation and employed the Kraken and TrOCR models to improve character recognition. In our analysis of 16th to 18th-century meeting resolutions task, we utilized a Convolutional Recurrent Neural Network (CRNN) that integrated DeepLabV3+ for semantic segmentation with a Bidirectional LSTM, incorporating confidence-based pseudolabeling to refine our model. Finally, for modern English handwriting recognition task, we applied a CRNN with a ResNet34 encoder, trained using the Connectionist Temporal Classification (CTC) loss function to effectively capture sequential dependencies. This report offers valuable insights and suggests potential directions for future research.

[128] AtomDiffuser: Time-Aware Degradation Modeling for Drift and Beam Damage in STEM Imaging

Hao Wang, Hongkui Zheng, Kai He, Abolfazl Razi

Main category: cs.CV

TL;DR: AtomDiffuser is a framework for disentangling spatial drift and beam-induced signal loss in STEM data, enabling interpretable atomic-resolution analysis.

DetailsMotivation: Existing methods struggle to separate and model the entangled degradation effects (spatial drift and beam-induced signal loss) in time-resolved STEM data, limiting accurate interpretation of atomic dynamics.

Method: AtomDiffuser predicts affine transformations and spatially varying decay maps between STEM frames, leveraging degradation as a temporally conditioned process.

Result: The framework generalizes well to real-world cryo-STEM data, supports high-resolution degradation inference, and aids in visualizing degradation patterns.

Conclusion: AtomDiffuser provides a robust tool for analyzing atomic-resolution structural evolutions in STEM data by explicitly modeling degradation effects.

Abstract: Scanning transmission electron microscopy (STEM) plays a critical role in modern materials science, enabling direct imaging of atomic structures and their evolution under external interferences. However, interpreting time-resolved STEM data remains challenging due to two entangled degradation effects: spatial drift caused by mechanical and thermal instabilities, and beam-induced signal loss resulting from radiation damage. These factors distort both geometry and intensity in complex, temporally correlated ways, making it difficult for existing methods to explicitly separate their effects or model material dynamics at atomic resolution. In this work, we present AtomDiffuser, a time-aware degradation modeling framework that disentangles sample drift and radiometric attenuation by predicting an affine transformation and a spatially varying decay map between any two STEM frames. Unlike traditional denoising or registration pipelines, our method leverages degradation as a physically heuristic, temporally conditioned process, enabling interpretable structural evolutions across time. Trained on synthetic degradation processes, AtomDiffuser also generalizes well to real-world cryo-STEM data. It further supports high-resolution degradation inference and drift alignment, offering tools for visualizing and quantifying degradation patterns that correlate with radiation-induced atomic instabilities.

[129] Contrast Sensitivity Function of Multimodal Vision-Language Models

Pablo Hernández-Cámara, Alexandra Gomez-Villa, Jose Manuel Jaén-Lorites, Jorge Vila-Tomás, Jesus Malo, Valero Laparra

Main category: cs.CV

TL;DR: The paper introduces a psychophysics-inspired method to estimate the contrast sensitivity function (CSF) of vision-language models (VLMs) by prompting them to judge pattern visibility. While some models approximate human CSF, none fully replicate it, and prompt phrasing significantly impacts responses.

DetailsMotivation: To assess how VLMs align with human perception of low-level visual features, particularly the CSF, which is crucial for understanding their visual sensitivity.

Method: A novel behavioral psychophysics-inspired approach using band-pass filtered noise images and diverse prompts to estimate CSF in VLMs.

Result: Some models partially mimic human CSF in shape or magnitude, but none fully replicate it. Prompt phrasing greatly influences model responses.

Conclusion: The study provides a framework for evaluating visual sensitivity in VLMs and highlights gaps between their representations and human perception, emphasizing prompt stability concerns.

Abstract: Assessing the alignment of multimodal vision-language models~(VLMs) with human perception is essential to understand how they perceive low-level visual features. A key characteristic of human vision is the contrast sensitivity function (CSF), which describes sensitivity to spatial frequency at low-contrasts. Here, we introduce a novel behavioral psychophysics-inspired method to estimate the CSF of chat-based VLMs by directly prompting them to judge pattern visibility at different contrasts for each frequency. This methodology is closer to the real experiments in psychophysics than the previously reported. Using band-pass filtered noise images and a diverse set of prompts, we assess model responses across multiple architectures. We find that while some models approximate human-like CSF shape or magnitude, none fully replicate both. Notably, prompt phrasing has a large effect on the responses, raising concerns about prompt stability. Our results provide a new framework for probing visual sensitivity in multimodal models and reveal key gaps between their visual representations and human perception.

[130] Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models

Hyundo Lee, Suhyung Choi, Byoung-Tak Zhang, Inwoo Hwang

Main category: cs.CV

TL;DR: The paper introduces a method to improve image generation by co-generating images and their intrinsic scene properties (e.g., depth, segmentation maps) to enhance spatial consistency and realism.

DetailsMotivation: Existing image generation models often produce spatially inconsistent images due to limited structural information. Leveraging intrinsic scene properties can address this issue.

Method: The approach extracts intrinsic properties using pre-trained estimators, aggregates them into a latent variable, and integrates them into Latent Diffusion Models (LDMs) to denoise both image and intrinsic domains while sharing mutual information.

Result: The method corrects spatial inconsistencies, produces more natural scene layouts, and maintains image fidelity and textual alignment.

Conclusion: Co-generating images and intrinsics improves spatial consistency and realism in image synthesis without degrading quality.

Abstract: Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then aggregate various intrinsic scene properties into a single latent variable using an autoencoder. Building upon pre-trained large-scale Latent Diffusion Models (LDMs), our method simultaneously denoises the image and intrinsic domains by carefully sharing mutual information so that the image and intrinsic reflect each other without degrading image quality. Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes while maintaining the fidelity and textual alignment of the base model (e.g., Stable Diffusion).

[131] Unlocking Robust Semantic Segmentation Performance via Label-only Elastic Deformations against Implicit Label Noise

Yechan Kim, Dongho Yoon, Younkwan Lee, Unse Fatima, Hong Kook Kim, Songjae Lee, Sanga Park, Jeong Ho Park, Seonjong Kang, Moongu Jeon

Main category: cs.CV

TL;DR: NSegment+ is a novel augmentation framework for semantic segmentation that decouples image and label transformations to address subtle label noise, improving model performance.

DetailsMotivation: Real-world datasets often have subtle label imperfections (e.g., ambiguous boundaries) that impair model performance, which typical augmentation methods may exacerbate.

Method: NSegment+ applies controlled elastic deformations only to segmentation labels while preserving original images, encouraging robust feature learning.

Result: Achieves mIoU gains of up to +2.29, +2.38, +1.75, and +3.39 on Vaihingen, LoveDA, Cityscapes, and PASCAL VOC, respectively.

Conclusion: Addressing implicit label noise is crucial, and NSegment+ effectively improves segmentation performance, with further gains possible when combined with other training techniques.

Abstract: While previous studies on image segmentation focus on handling severe (or explicit) label noise, real-world datasets also exhibit subtle (or implicit) label imperfections. These arise from inherent challenges, such as ambiguous object boundaries and annotator variability. Although not explicitly present, such mild and latent noise can still impair model performance. Typical data augmentation methods, which apply identical transformations to the image and its label, risk amplifying these subtle imperfections and limiting the model’s generalization capacity. In this paper, we introduce NSegment+, a novel augmentation framework that decouples image and label transformations to address such realistic noise for semantic segmentation. By introducing controlled elastic deformations only to segmentation labels while preserving the original images, our method encourages models to focus on learning robust representations of object structures despite minor label inconsistencies. Extensive experiments demonstrate that NSegment+ consistently improves performance, achieving mIoU gains of up to +2.29, +2.38, +1.75, and +3.39 in average on Vaihingen, LoveDA, Cityscapes, and PASCAL VOC, respectively-even without bells and whistles, highlighting the importance of addressing implicit label noise. These gains can be further amplified when combined with other training tricks, including CutMix and Label Smoothing.

[132] PQ-DAF: Pose-driven Quality-controlled Data Augmentation for Data-scarce Driver Distraction Detection

Haibin Sun, Xinghui Song

Main category: cs.CV

TL;DR: A framework (PQ-DAF) using pose-driven data augmentation and vision-language models improves few-shot driver distraction detection by enhancing cross-domain robustness.

DetailsMotivation: Addressing degraded generalization in driver distraction detection due to few-shot learning challenges and domain shifts.

Method: Proposes PQ-DAF with a Progressive Conditional Diffusion Model (PCDMs) for pose feature synthesis and a CogVLM-based quality filter.

Result: PQ-DAF significantly improves performance and generalization in few-shot scenarios.

Conclusion: PQ-DAF effectively tackles data scarcity and domain shift, enhancing real-world deployment of distraction detection models.

Abstract: Driver distraction detection is essential for improving traffic safety and reducing road accidents. However, existing models often suffer from degraded generalization when deployed in real-world scenarios. This limitation primarily arises from the few-shot learning challenge caused by the high cost of data annotation in practical environments, as well as the substantial domain shift between training datasets and target deployment conditions. To address these issues, we propose a Pose-driven Quality-controlled Data Augmentation Framework (PQ-DAF) that leverages a vision-language model for sample filtering to cost-effectively expand training data and enhance cross-domain robustness. Specifically, we employ a Progressive Conditional Diffusion Model (PCDMs) to accurately capture key driver pose features and synthesize diverse training examples. A sample quality assessment module, built upon the CogVLM vision-language model, is then introduced to filter out low-quality synthetic samples based on a confidence threshold, ensuring the reliability of the augmented dataset. Extensive experiments demonstrate that PQ-DAF substantially improves performance in few-shot driver distraction detection, achieving significant gains in model generalization under data-scarce conditions.

[133] Translation of Text Embedding via Delta Vector to Suppress Strongly Entangled Content in Text-to-Image Diffusion Models

Eunseo Koh, Seunghoo Hong, Tae-Young Kim, Simon S. Woo, Jae-Pil Heo

Main category: cs.CV

TL;DR: The paper proposes a method to suppress unwanted content in text-to-image diffusion models by modifying text embeddings with a delta vector, improving suppression of entangled concepts.

DetailsMotivation: Text-to-image models struggle with suppressing strongly entangled content (e.g., 'mustache' with 'Charlie Chaplin'), even when explicitly instructed to exclude it.

Method: Introduces a delta vector to modify text embeddings, obtained via zero-shot learning, and a Selective Suppression with Delta Vector (SSDV) method for cross-attention. Also optimizes delta vectors for personalized models.

Result: Outperforms existing methods in quantitative and qualitative metrics, enabling precise suppression of unwanted content.

Conclusion: The approach effectively addresses entanglement issues in text-to-image models, offering superior suppression capabilities.

Abstract: Text-to-Image (T2I) diffusion models have made significant progress in generating diverse high-quality images from textual prompts. However, these models still face challenges in suppressing content that is strongly entangled with specific words. For example, when generating an image of Charlie Chaplin", a mustache" consistently appears even if explicitly instructed not to include it, as the concept of mustache" is strongly entangled with Charlie Chaplin". To address this issue, we propose a novel approach to directly suppress such entangled content within the text embedding space of diffusion models. Our method introduces a delta vector that modifies the text embedding to weaken the influence of undesired content in the generated image, and we further demonstrate that this delta vector can be easily obtained through a zero-shot approach. Furthermore, we propose a Selective Suppression with Delta Vector (SSDV) method to adapt delta vector into the cross-attention mechanism, enabling more effective suppression of unwanted content in regions where it would otherwise be generated. Additionally, we enabled more precise suppression in personalized T2I models by optimizing delta vector, which previous baselines were unable to achieve. Extensive experimental results demonstrate that our approach significantly outperforms existing methods, both in terms of quantitative and qualitative metrics.

[134] SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection

Chaesong Park, Eunbin Seo, Jihyeon Hwang, Jongwoo Lim

Main category: cs.CV

TL;DR: SC-Lane is a slope-aware, temporally consistent framework for 3D lane detection, improving robustness and accuracy in heightmap estimation.

DetailsMotivation: Existing methods rely on fixed slope anchors, limiting adaptability to diverse road geometries. SC-Lane addresses this by dynamically integrating slope-specific features.

Method: Uses a Slope-Aware Adaptive Feature module for dynamic weight prediction and a Height Consistency Module for temporal coherence.

Result: Achieves state-of-the-art performance with an F-score of 64.3%, outperforming existing methods.

Conclusion: SC-Lane sets a rigorous standard for 3D lane detection, demonstrating significant improvements in height estimation and lane detection.

Abstract: In this paper, we introduce SC-Lane, a novel slope-aware and temporally consistent heightmap estimation framework for 3D lane detection. Unlike previous approaches that rely on fixed slope anchors, SC-Lane adaptively determines the fusion of slope-specific height features, improving robustness to diverse road geometries. To achieve this, we propose a Slope-Aware Adaptive Feature module that dynamically predicts the appropriate weights from image cues for integrating multi-slope representations into a unified heightmap. Additionally, a Height Consistency Module enforces temporal coherence, ensuring stable and accurate height estimation across consecutive frames, which is crucial for real-world driving scenarios. To evaluate the effectiveness of SC-Lane, we employ three standardized metrics-Mean Absolute Error(MAE), Root Mean Squared Error (RMSE), and threshold-based accuracy-which, although common in surface and depth estimation, have been underutilized for road height assessment. Using the LiDAR-derived heightmap dataset introduced in prior work [20], we benchmark our method under these metrics, thereby establishing a rigorous standard for future comparisons. Extensive experiments on the OpenLane benchmark demonstrate that SC-Lane significantly improves both height estimation and 3D lane detection, achieving state-of-the-art performance with an F-score of 64.3%, outperforming existing methods by a notable margin. For detailed results and a demonstration video, please refer to our project page:https://parkchaesong.github.io/sclane/

[135] NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer

Shanyuan Liu, Jian Zhu, Junda Lu, Yue Gong, Liuzhuozheng Li, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Dawei Leng, Yuhui Yin

Main category: cs.CV

TL;DR: NanoControl introduces a lightweight, efficient method for controllable text-to-image generation using DiTs, reducing computational costs while maintaining high performance.

DetailsMotivation: Existing methods for controllable text-to-image generation with DiTs rely on ControlNet, which adds excessive parameters and computational overhead.

Method: NanoControl uses a LoRA-style control module and KV-Context Augmentation to integrate control signals efficiently into the DiT backbone.

Result: The model achieves state-of-the-art performance with minimal parameter and GFLOPs increases (0.024% and 0.029%, respectively).

Conclusion: NanoControl offers a highly efficient and effective solution for controllable generation, outperforming traditional approaches.

Abstract: Diffusion Transformers (DiTs) have demonstrated exceptional capabilities in text-to-image synthesis. However, in the domain of controllable text-to-image generation using DiTs, most existing methods still rely on the ControlNet paradigm originally designed for UNet-based diffusion models. This paradigm introduces significant parameter overhead and increased computational costs. To address these challenges, we propose the Nano Control Diffusion Transformer (NanoControl), which employs Flux as the backbone network. Our model achieves state-of-the-art controllable text-to-image generation performance while incurring only a 0.024% increase in parameter count and a 0.029% increase in GFLOPs, thus enabling highly efficient controllable generation. Specifically, rather than duplicating the DiT backbone for control, we design a LoRA-style (low-rank adaptation) control module that directly learns control signals from raw conditioning inputs. Furthermore, we introduce a KV-Context Augmentation mechanism that integrates condition-specific key-value information into the backbone in a simple yet highly effective manner, facilitating deep fusion of conditional features. Extensive benchmark experiments demonstrate that NanoControl significantly reduces computational overhead compared to conventional control approaches, while maintaining superior generation quality and achieving improved controllability.

[136] STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes

Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Shiono, Yu Yamaguchi

Main category: cs.CV

TL;DR: STRIDE-QA is a large-scale VQA dataset for spatiotemporal reasoning in autonomous driving, addressing limitations of current VLMs by providing dense annotations and novel QA tasks.

DetailsMotivation: Current VLMs trained on static image-text pairs lack precise spatiotemporal reasoning for dynamic traffic scenes, limiting their effectiveness in autonomous driving.

Method: STRIDE-QA is constructed from 100 hours of multi-sensor driving data in Tokyo, offering 16 million QA pairs over 285K frames with dense annotations like 3D bounding boxes and segmentation masks.

Result: Fine-tuned VLMs on STRIDE-QA show significant improvements (55% spatial localization, 28% motion prediction) compared to near-zero scores of general-purpose VLMs.

Conclusion: STRIDE-QA provides a foundation for developing more reliable VLMs for safety-critical autonomous systems.

Abstract: Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16 million QA pairs over 285K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and ego-centric reasoning through three novel QA tasks that require spatial localization and temporal prediction. Our benchmarks demonstrate that existing VLMs struggle significantly, achieving near-zero scores on prediction consistency. In contrast, VLMs fine-tuned on STRIDE-QA exhibit dramatic performance gains, achieving 55% success in spatial localization and 28% consistency in future motion prediction, compared to near-zero scores from general-purpose VLMs. Therefore, STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems.

[137] CRISP: Contrastive Residual Injection and Semantic Prompting for Continual Video Instance Segmentation

Baichen Liu, Qi Lyu, Xudong Wang, Jiahua Dong, Lianqing Liu, Zhi Han

Main category: cs.CV

TL;DR: CRISP introduces a method for continual video instance segmentation using contrastive residual injection and semantic prompting to address instance-wise, category-wise, and task-wise confusion, outperforming existing methods.

DetailsMotivation: To address the challenges of plasticity, stability, and temporal consistency in continual video instance segmentation.

Method: Uses instance correlation loss, adaptive residual semantic prompts, and contrastive learning for semantic coherence. Introduces an initialization strategy for incremental prompts.

Result: CRISP outperforms existing methods on YouTube-VIS datasets, avoiding catastrophic forgetting and improving segmentation and classification.

Conclusion: CRISP is effective for continual video instance segmentation, offering a robust solution to confusion and forgetting.

Abstract: Continual video instance segmentation demands both the plasticity to absorb new object categories and the stability to retain previously learned ones, all while preserving temporal consistency across frames. In this work, we introduce Contrastive Residual Injection and Semantic Prompting (CRISP), an earlier attempt tailored to address the instance-wise, category-wise, and task-wise confusion in continual video instance segmentation. For instance-wise learning, we model instance tracking and construct instance correlation loss, which emphasizes the correlation with the prior query space while strengthening the specificity of the current task query. For category-wise learning, we build an adaptive residual semantic prompt (ARSP) learning framework, which constructs a learnable semantic residual prompt pool generated by category text and uses an adjustive query-prompt matching mechanism to build a mapping relationship between the query of the current task and the semantic residual prompt. Meanwhile, a semantic consistency loss based on the contrastive learning is introduced to maintain semantic coherence between object queries and residual prompts during incremental training. For task-wise learning, to ensure the correlation at the inter-task level within the query space, we introduce a concise yet powerful initialization strategy for incremental prompts. Extensive experiments on YouTube-VIS-2019 and YouTube-VIS-2021 datasets demonstrate that CRISP significantly outperforms existing continual segmentation methods in the long-term continual video instance segmentation task, avoiding catastrophic forgetting and effectively improving segmentation and classification performance. The code is available at https://github.com/01upup10/CRISP.

[138] DOD-SA: Infrared-Visible Decoupled Object Detection with Single-Modality Annotations

Hang Jin, Chenqiang Gao, Junjie Guo, Fangcen Liu, Kanghui Tian, Qinyao Chang

Main category: cs.CV

TL;DR: Proposes DOD-SA, a decoupled object detection framework using single-modality annotations, reducing annotation costs while maintaining performance.

DetailsMotivation: Existing methods require costly dual-modality annotations; DOD-SA aims to leverage single-modality annotations for robust infrared-visible object detection.

Method: Uses CoSD-TSNet with SM-Branch and DMD-Branch, pseudo-labeling, and PaST training strategy (pretraining, guiding, refining). Introduces PLA for label alignment.

Result: Outperforms SOTA on DroneVehicle dataset.

Conclusion: DOD-SA effectively reduces annotation costs while achieving competitive performance through innovative architecture and training strategies.

Abstract: Infrared-visible object detection has shown great potential in real-world applications, enabling robust all-day perception by leveraging the complementary information of infrared and visible images. However, existing methods typically require dual-modality annotations to output detection results for both modalities during prediction, which incurs high annotation costs. To address this challenge, we propose a novel infrared-visible Decoupled Object Detection framework with Single-modality Annotations, called DOD-SA. The architecture of DOD-SA is built upon a Single- and Dual-Modality Collaborative Teacher-Student Network (CoSD-TSNet), which consists of a single-modality branch (SM-Branch) and a dual-modality decoupled branch (DMD-Branch). The teacher model generates pseudo-labels for the unlabeled modality, simultaneously supporting the training of the student model. The collaborative design enables cross-modality knowledge transfer from the labeled modality to the unlabeled modality, and facilitates effective SM-to-DMD branch supervision. To further improve the decoupling ability of the model and the pseudo-label quality, we introduce a Progressive and Self-Tuning Training Strategy (PaST) that trains the model in three stages: (1) pretraining SM-Branch, (2) guiding the learning of DMD-Branch by SM-Branch, and (3) refining DMD-Branch. In addition, we design a Pseudo Label Assigner (PLA) to align and pair labels across modalities, explicitly addressing modality misalignment during training. Extensive experiments on the DroneVehicle dataset demonstrate that our method outperforms state-of-the-art (SOTA).

[139] SkeySpot: Automating Service Key Detection for Digital Electrical Layout Plans in the Construction Industry

Dhruv Dosi, Rohit Meena, Param Rajpura, Yogesh Kumar Meena

Main category: cs.CV

TL;DR: The paper introduces a dataset (DELP) and toolkit (SkeySpot) for automated symbol spotting in electrical layout plans, achieving high accuracy with YOLOv8 and supporting digitization for SMEs.

DetailsMotivation: Legacy floor plans lack machine-readability, making large-scale interpretation time-consuming and error-prone. Automated symbol spotting can streamline workflows like cost estimation and compliance.

Method: A labelled DELP dataset (45 plans, 2,450 instances, 34 classes) is created. Pretrained object detection models (e.g., YOLOv8) are benchmarked, with YOLOv8 achieving 82.5% mAP. SkeySpot is developed for real-time symbol detection.

Result: YOLOv8 performs best (82.5% mAP). SkeySpot enables real-time detection, classification, and quantification of symbols, producing standardized outputs for interoperability.

Conclusion: The approach reduces reliance on proprietary CAD systems and manual effort, making digitization accessible to SMEs while promoting standardization and sustainability in construction.

Abstract: Legacy floor plans, often preserved only as scanned documents, remain essential resources for architecture, urban planning, and facility management in the construction industry. However, the lack of machine-readable floor plans render large-scale interpretation both time-consuming and error-prone. Automated symbol spotting offers a scalable solution by enabling the identification of service key symbols directly from floor plans, supporting workflows such as cost estimation, infrastructure maintenance, and regulatory compliance. This work introduces a labelled Digitised Electrical Layout Plans (DELP) dataset comprising 45 scanned electrical layout plans annotated with 2,450 instances across 34 distinct service key classes. A systematic evaluation framework is proposed using pretrained object detection models for DELP dataset. Among the models benchmarked, YOLOv8 achieves the highest performance with a mean Average Precision (mAP) of 82.5%. Using YOLOv8, we develop SkeySpot, a lightweight, open-source toolkit for real-time detection, classification, and quantification of electrical symbols. SkeySpot produces structured, standardised outputs that can be scaled up for interoperable building information workflows, ultimately enabling compatibility across downstream applications and regulatory platforms. By lowering dependency on proprietary CAD systems and reducing manual annotation effort, this approach makes the digitisation of electrical layouts more accessible to small and medium-sized enterprises (SMEs) in the construction industry, while supporting broader goals of standardisation, interoperability, and sustainability in the built environment.

[140] From Images to Perception: Emergence of Perceptual Properties by Reconstructing Images

Pablo Hernández-Cámara, Jesus Malo, Valero Laparra

Main category: cs.CV

TL;DR: PerceptNet, a bio-inspired model, aligns with human perception in image tasks without human supervision, suggesting visual systems may optimize for moderate distortion levels.

DetailsMotivation: To explore if human visual perception emerges from image statistics and if bio-inspired models can learn perceptual metrics without human supervision.

Method: End-to-end optimization of PerceptNet for tasks like autoencoding, denoising, deblurring, and sparsity regularization.

Result: The V1-like encoder correlates highly with human perceptual judgments, especially for moderate noise, blur, and sparsity.

Conclusion: Visual systems may be tuned to handle moderate distortions, and bio-inspired models can learn perceptual metrics autonomously.

Abstract: A number of scientists suggested that human visual perception may emerge from image statistics, shaping efficient neural representations in early vision. In this work, a bio-inspired architecture that can accommodate several known facts in the retina-V1 cortex, the PerceptNet, has been end-to-end optimized for different tasks related to image reconstruction: autoencoding, denoising, deblurring, and sparsity regularization. Our results show that the encoder stage (V1-like layer) consistently exhibits the highest correlation with human perceptual judgments on image distortion despite not using perceptual information in the initialization or training. This alignment exhibits an optimum for moderate noise, blur and sparsity. These findings suggest that the visual system may be tuned to remove those particular levels of distortion with that level of sparsity and that biologically inspired models can learn perceptual metrics without human supervision.

[141] Trajectory-aware Shifted State Space Models for Online Video Super-Resolution

Qiang Zhu, Xiandong Meng, Yuxian Jiang, Fan Zhang, David Bull, Shuyuan Zhu, Bing Zeng

Main category: cs.CV

TL;DR: The paper introduces TS-Mamba, an online VSR method using Trajectory-aware Shifted SSMs for efficient spatio-temporal aggregation, achieving state-of-the-art performance with reduced complexity.

DetailsMotivation: Existing online VSR methods lack long-range temporal modeling, limiting performance. SSMs offer linear complexity and global receptive fields, inspiring the proposed TS-Mamba.

Method: TS-Mamba uses trajectory modeling to select similar tokens from previous frames and employs a TSMA module with shifted SSMs blocks for aggregation, enhanced by Hilbert scannings and shift operations.

Result: TS-Mamba outperforms six benchmark models on three VSR datasets, achieving state-of-the-art performance with a 22.7% reduction in computational complexity.

Conclusion: TS-Mamba effectively combines long-term trajectory modeling and low-complexity Mamba for efficient online VSR, demonstrating superior performance and computational efficiency.

Abstract: Online video super-resolution (VSR) is an important technique for many real-world video processing applications, which aims to restore the current high-resolution video frame based on temporally previous frames. Most of the existing online VSR methods solely employ one neighboring previous frame to achieve temporal alignment, which limits long-range temporal modeling of videos. Recently, state space models (SSMs) have been proposed with linear computational complexity and a global receptive field, which significantly improve computational efficiency and performance. In this context, this paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba), leveraging both long-term trajectory modeling and low-complexity Mamba to achieve efficient spatio-temporal information aggregation. Specifically, TS-Mamba first constructs the trajectories within a video to select the most similar tokens from the previous frames. Then, a Trajectory-aware Shifted Mamba Aggregation (TSMA) module consisting of proposed shifted SSMs blocks is employed to aggregate the selected tokens. The shifted SSMs blocks are designed based on Hilbert scannings and corresponding shift operations to compensate for scanning losses and strengthen the spatial continuity of Mamba. Additionally, we propose a trajectory-aware loss function to supervise the trajectory generation, ensuring the accuracy of token selection when training our model. Extensive experiments on three widely used VSR test datasets demonstrate that compared with six online VSR benchmark models, our TS-Mamba achieves state-of-the-art performance in most cases and over 22.7% complexity reduction (in MACs). The source code for TS-Mamba will be available at https://github.com.

[142] Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers

Hanna Herasimchyk, Robin Labryga, Tomislav Prusina

Main category: cs.CV

TL;DR: A multi-head vision transformer approach for multi-label plant species prediction, leveraging taxonomic hierarchies and innovative techniques like multi-scale tiling and dynamic thresholding, achieved strong results in the PlantCLEF 2025 challenge.

DetailsMotivation: Addressing the challenge of domain shift in multi-species plant prediction by training on single-species images and testing on multi-species quadrat images.

Method: Uses a pre-trained DINOv2 Vision Transformer with multiple classification heads, multi-scale tiling, dynamic threshold optimization, and ensemble strategies like bagging and Hydra architectures.

Result: Achieved 3rd best performance on the private leaderboard, tested on 1.4 million training images covering 7,806 species.

Conclusion: The approach demonstrates effectiveness in handling domain shifts and multi-label prediction, with code publicly available for reproducibility.

Abstract: We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images, addressing the PlantCLEF 2025 challenge. The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift. Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction, utilizing taxonomic hierarchies. Key contributions include multi-scale tiling to capture plants at different scales, dynamic threshold optimization based on mean prediction length, and ensemble strategies through bagging and Hydra model architectures. The approach incorporates various inference techniques including image cropping to remove non-plant artifacts, top-n filtering for prediction constraints, and logit thresholding strategies. Experiments were conducted on approximately 1.4 million training images covering 7,806 plant species. Results demonstrate strong performance, making our submission 3rd best on the private leaderboard. Our code is available at https://github.com/geranium12/plant-clef-2025/tree/v1.0.0.

[143] SingleStrip: learning skull-stripping from a single labeled example

Bella Specktor-Fadida, Malte Hoffmann

Main category: cs.CV

TL;DR: Combining domain randomization and self-training with autoencoder-based quality control enables effective skull-stripping segmentation using minimal labeled data.

DetailsMotivation: Manual labeling for volumetric images like brain MRI is labor-intensive, and existing domain-randomization techniques lack anatomical variability with few label maps. Semi-supervised self-training can mitigate label scarcity.

Method: 1. Automatically bin voxel intensities to synthesize images for initial training. 2. Train a convolutional autoencoder (AE) to assess pseudo-label quality via reconstruction error. 3. Fine-tune the network using top-ranking pseudo-labels.

Result: Achieves skull-stripping performance close to models trained with more labeled data. AE-based ranking correlates better with segmentation accuracy than consistency-based ranking.

Conclusion: The combined approach reduces labeling burden, aiding studies with limited labeled data or new imaging techniques.

Abstract: Deep learning segmentation relies heavily on labeled data, but manual labeling is laborious and time-consuming, especially for volumetric images such as brain magnetic resonance imaging (MRI). While recent domain-randomization techniques alleviate the dependency on labeled data by synthesizing diverse training images from label maps, they offer limited anatomical variability when very few label maps are available. Semi-supervised self-training addresses label scarcity by iteratively incorporating model predictions into the training set, enabling networks to learn from unlabeled data. In this work, we combine domain randomization with self-training to train three-dimensional skull-stripping networks using as little as a single labeled example. First, we automatically bin voxel intensities, yielding labels we use to synthesize images for training an initial skull-stripping model. Second, we train a convolutional autoencoder (AE) on the labeled example and use its reconstruction error to assess the quality of brain masks predicted for unlabeled data. Third, we select the top-ranking pseudo-labels to fine-tune the network, achieving skull-stripping performance on out-of-distribution data that approaches models trained with more labeled images. We compare AE-based ranking to consistency-based ranking under test-time augmentation, finding that the AE approach yields a stronger correlation with segmentation accuracy. Our results highlight the potential of combining domain randomization and AE-based quality control to enable effective semi-supervised segmentation from extremely limited labeled data. This strategy may ease the labeling burden that slows progress in studies involving new anatomical structures or emerging imaging techniques.

[144] Enhanced Sparse Point Cloud Data Processing for Privacy-aware Human Action Recognition

Maimunatu Tunau, Vincent Gbouna Zakka, Zhuangzhuang Dai

Main category: cs.CV

TL;DR: The paper evaluates and enhances three data processing methods (DBSCAN, Hungarian Algorithm, Kalman Filtering) for mmWave radar-based Human Action Recognition (HAR), analyzing their performance and combinations to improve accuracy and computational efficiency.

DetailsMotivation: Traditional vision-based HAR systems raise privacy concerns, prompting the need for privacy-preserving alternatives like mmWave radar. However, radar data is sparse and noisy, requiring effective processing methods.

Method: The study evaluates DBSCAN, Hungarian Algorithm, and Kalman Filtering individually, in pairs, and combined, using the MiliPoint dataset. It also proposes enhancements to these methods.

Result: The analysis provides insights into the performance and trade-offs of each method and their combinations, highlighting improvements in recognition accuracy.

Conclusion: The findings guide future mmWave-based HAR systems by clarifying the strengths and limitations of the evaluated methods and their integrations.

Abstract: Human Action Recognition (HAR) plays a crucial role in healthcare, fitness tracking, and ambient assisted living technologies. While traditional vision based HAR systems are effective, they pose privacy concerns. mmWave radar sensors offer a privacy preserving alternative but present challenges due to the sparse and noisy nature of their point cloud data. In the literature, three primary data processing methods: Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the Hungarian Algorithm, and Kalman Filtering have been widely used to improve the quality and continuity of radar data. However, a comprehensive evaluation of these methods, both individually and in combination, remains lacking. This paper addresses that gap by conducting a detailed performance analysis of the three methods using the MiliPoint dataset. We evaluate each method individually, all possible pairwise combinations, and the combination of all three, assessing both recognition accuracy and computational cost. Furthermore, we propose targeted enhancements to the individual methods aimed at improving accuracy. Our results provide crucial insights into the strengths and trade-offs of each method and their integrations, guiding future work on mmWave based HAR systems

[145] STAMP: Multi-pattern Attention-aware Multiple Instance Learning for STAS Diagnosis in Multi-center Histopathology Images

Liangrui Pan, xiaoyu Li, Guang Zhu, Guanting Li, Ruixin Wang, Jiadi Luo, Yaning Yang, Liang qingchun, Shaoliang Peng

Main category: cs.CV

TL;DR: A deep learning model (STAMP) is proposed for diagnosing Spread through Air Spaces (STAS) in lung adenocarcinoma, achieving high accuracy across multi-center datasets.

DetailsMotivation: STAS in lung adenocarcinoma is linked to poor outcomes, but diagnosis is challenging due to its complex pathology. Automated tools are needed to improve accuracy and efficiency.

Method: The study uses histopathological images from multiple hospitals and TCGA-LUAD, annotated by pathologists. STAMP, a multi-pattern attention-aware framework, combines dual-branch learning, transformer-based encoding, and similarity regularization.

Result: STAMP achieved AUCs of 0.8058, 0.8017, and 0.7928 on three datasets, outperforming clinical standards.

Conclusion: STAMP effectively diagnoses STAS, offering a scalable solution for clinical practice.

Abstract: Spread through air spaces (STAS) constitutes a novel invasive pattern in lung adenocarcinoma (LUAD), associated with tumor recurrence and diminished survival rates. However, large-scale STAS diagnosis in LUAD remains a labor-intensive endeavor, compounded by the propensity for oversight and misdiagnosis due to its distinctive pathological characteristics and morphological features. Consequently, there is a pressing clinical imperative to leverage deep learning models for STAS diagnosis. This study initially assembled histopathological images from STAS patients at the Second Xiangya Hospital and the Third Xiangya Hospital of Central South University, alongside the TCGA-LUAD cohort. Three senior pathologists conducted cross-verification annotations to construct the STAS-SXY, STAS-TXY, and STAS-TCGA datasets. We then propose a multi-pattern attention-aware multiple instance learning framework, named STAMP, to analyze and diagnose the presence of STAS across multi-center histopathology images. Specifically, the dual-branch architecture guides the model to learn STAS-associated pathological features from distinct semantic spaces. Transformer-based instance encoding and a multi-pattern attention aggregation modules dynamically selects regions closely associated with STAS pathology, suppressing irrelevant noise and enhancing the discriminative power of global representations. Moreover, a similarity regularization constraint prevents feature redundancy across branches, thereby improving overall diagnostic accuracy. Extensive experiments demonstrated that STAMP achieved competitive diagnostic results on STAS-SXY, STAS-TXY and STAS-TCGA, with AUCs of 0.8058, 0.8017, and 0.7928, respectively, surpassing the clinical level.

[146] TweezeEdit: Consistent and Efficient Image Editing with Path Regularization

Jianda Mao, Kaibo Wang, Yang Xiang, Kani Chen

Main category: cs.CV

TL;DR: TweezeEdit is a tuning- and inversion-free framework for efficient and consistent image editing using diffusion models, outperforming existing methods in semantic preservation and speed.

DetailsMotivation: Existing methods over-align with target prompts and inadequately preserve source image semantics, relying on inefficient inversion anchors.

Method: TweezeEdit regularizes the entire denoising path, uses gradient-driven regularization, and injects target prompt semantics via a consistency model.

Result: TweezeEdit achieves superior semantic preservation and target alignment, requiring only 12 steps (1.6 seconds per edit).

Conclusion: TweezeEdit offers a faster, more efficient, and semantically consistent alternative for real-time image editing.

Abstract: Large-scale pre-trained diffusion models empower users to edit images through text guidance. However, existing methods often over-align with target prompts while inadequately preserving source image semantics. Such approaches generate target images explicitly or implicitly from the inversion noise of the source images, termed the inversion anchors. We identify this strategy as suboptimal for semantic preservation and inefficient due to elongated editing paths. We propose TweezeEdit, a tuning- and inversion-free framework for consistent and efficient image editing. Our method addresses these limitations by regularizing the entire denoising path rather than relying solely on the inversion anchors, ensuring source semantic retention and shortening editing paths. Guided by gradient-driven regularization, we efficiently inject target prompt semantics along a direct path using a consistency model. Extensive experiments demonstrate TweezeEdit’s superior performance in semantic preservation and target alignment, outperforming existing methods. Remarkably, it requires only 12 steps (1.6 seconds per edit), underscoring its potential for real-time applications.

[147] Multi-Sample Anti-Aliasing and Constrained Optimization for 3D Gaussian Splatting

Zheng Zhou, Jia-Chen Zhang, Yu-Jie Xiong, Chun-Ming Xia

Main category: cs.CV

TL;DR: The paper proposes a framework combining MSAA and dual geometric constraints to enhance 3D Gaussian splatting, improving detail preservation in high-frequency textures and sharp discontinuities while maintaining real-time rendering.

DetailsMotivation: Current 3D Gaussian splatting methods lack sufficient geometric constraints, leading to blurred reconstructions of fine details, especially in high-frequency regions.

Method: The framework integrates MSAA with adaptive blending of quadruple subsamples and introduces two geometric constraints: adaptive weighting for under-reconstructed regions and gradient differential constraints for boundaries.

Result: The method achieves state-of-the-art performance in detail preservation, with significant improvements in SSIM and LPIPS metrics, while retaining real-time efficiency.

Conclusion: The proposed framework effectively addresses geometric constraints in 3D Gaussian splatting, enhancing detail preservation and rendering quality without compromising speed.

Abstract: Recent advances in 3D Gaussian splatting have significantly improved real-time novel view synthesis, yet insufficient geometric constraints during scene optimization often result in blurred reconstructions of fine-grained details, particularly in regions with high-frequency textures and sharp discontinuities. To address this, we propose a comprehensive optimization framework integrating multisample anti-aliasing (MSAA) with dual geometric constraints. Our system computes pixel colors through adaptive blending of quadruple subsamples, effectively reducing aliasing artifacts in high-frequency components. The framework introduces two constraints: (a) an adaptive weighting strategy that prioritizes under-reconstructed regions through dynamic gradient analysis, and (b) gradient differential constraints enforcing geometric regularization at object boundaries. This targeted optimization enables the model to allocate computational resources preferentially to critical regions requiring refinement while maintaining global consistency. Extensive experimental evaluations across multiple benchmarks demonstrate that our method achieves state-of-the-art performance in detail preservation, particularly in preserving high-frequency textures and sharp discontinuities, while maintaining real-time rendering efficiency. Quantitative metrics and perceptual studies confirm statistically significant improvements over baseline approaches in both structural similarity (SSIM) and perceptual quality (LPIPS).

[148] A Segmentation-driven Editing Method for Bolt Defect Augmentation and Detection

Yangjie Xiao, Ke Zhang, Jiacun Wang, Xin Sheng, Yurong Guo, Meijuan Chen, Zehua Ren, Zhaoye Zheng, Zhenbing Zhao

Main category: cs.CV

TL;DR: A segmentation-driven bolt defect editing method (SBDE) is proposed to augment datasets for bolt defect detection, improving performance by generating high-quality defect images.

DetailsMotivation: The scarcity of defect images and imbalanced data distributions hinder bolt defect detection performance.

Method: SBDE involves a bolt attribute segmentation model (Bolt-SAM), a mask optimization module (MOD) integrated with LaMa for defect editing, and an editing recovery augmentation (ERA) strategy.

Result: SBDE-generated images outperform state-of-the-art models and enhance bolt defect detection performance.

Conclusion: SBDE is effective and has significant application potential for bolt defect detection.

Abstract: Bolt defect detection is critical to ensure the safety of transmission lines. However, the scarcity of defect images and imbalanced data distributions significantly limit detection performance. To address this problem, we propose a segmentationdriven bolt defect editing method (SBDE) to augment the dataset. First, a bolt attribute segmentation model (Bolt-SAM) is proposed, which enhances the segmentation of complex bolt attributes through the CLAHE-FFT Adapter (CFA) and Multipart- Aware Mask Decoder (MAMD), generating high-quality masks for subsequent editing tasks. Second, a mask optimization module (MOD) is designed and integrated with the image inpainting model (LaMa) to construct the bolt defect attribute editing model (MOD-LaMa), which converts normal bolts into defective ones through attribute editing. Finally, an editing recovery augmentation (ERA) strategy is proposed to recover and put the edited defect bolts back into the original inspection scenes and expand the defect detection dataset. We constructed multiple bolt datasets and conducted extensive experiments. Experimental results demonstrate that the bolt defect images generated by SBDE significantly outperform state-of-the-art image editing models, and effectively improve the performance of bolt defect detection, which fully verifies the effectiveness and application potential of the proposed method. The code of the project is available at https://github.com/Jay-xyj/SBDE.

[149] EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba

Quang Nguyen, Nhat Le, Baoru Huang, Minh Nhat Vu, Chengcheng Tang, Van Nguyen, Ngan Le, Thieu Vo, Anh Nguyen

Main category: cs.CV

TL;DR: A new method for predicting human dance motion from egocentric video and music, using a dataset (EgoAIST++) and a Skeleton Mamba-based network, outperforming state-of-the-art approaches.

DetailsMotivation: Jointly estimating human dance motion from egocentric video and music is underexplored, despite its industrial applications. Challenges include obscured body views and aligning motion with music.

Method: Developed EgoMusic Motion Network with Skeleton Mamba, leveraging diffusion models and Mamba for sequence modeling, using the EgoAIST++ dataset.

Result: The method outperforms state-of-the-art approaches and generalizes well to real-world data.

Conclusion: The proposed approach effectively combines egocentric video and music for dance motion prediction, demonstrating superior performance and generalization.

Abstract: Estimating human dance motion is a challenging task with various industrial applications. Recently, many efforts have focused on predicting human dance motion using either egocentric video or music as input. However, the task of jointly estimating human motion from both egocentric video and music remains largely unexplored. In this paper, we aim to develop a new method that predicts human dance motion from both egocentric video and music. In practice, the egocentric view often obscures much of the body, making accurate full-pose estimation challenging. Additionally, incorporating music requires the generated head and body movements to align well with both visual and musical inputs. We first introduce EgoAIST++, a new large-scale dataset that combines both egocentric views and music with more than 36 hours of dancing motion. Drawing on the success of diffusion models and Mamba on modeling sequences, we develop an EgoMusic Motion Network with a core Skeleton Mamba that explicitly captures the skeleton structure of the human body. We illustrate that our approach is theoretically supportive. Intensive experiments show that our method clearly outperforms state-of-the-art approaches and generalizes effectively to real-world data.

[150] Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies

Ayushman Sarkar, Mohd Yamani Idna Idris, Zhenyu Yu

Main category: cs.CV

TL;DR: This survey categorizes visual reasoning into five types, reviews methodologies and evaluation protocols, and identifies open challenges and future research directions.

DetailsMotivation: To unify and analyze diverse visual reasoning types, methodologies, and evaluations, addressing gaps in existing surveys.

Method: Categorizes visual reasoning into relational, symbolic, temporal, causal, and commonsense types, and reviews architectures like graph-based models and neuro-symbolic systems.

Result: Highlights limitations in evaluation protocols and identifies challenges like scalability, integration of paradigms, and lack of benchmarks.

Conclusion: Bridging perception and reasoning is crucial for transparent, trustworthy AI, especially in critical domains like autonomous driving and medical diagnostics.

Abstract: Visual reasoning is critical for a wide range of computer vision tasks that go beyond surface-level object detection and classification. Despite notable advances in relational, symbolic, temporal, causal, and commonsense reasoning, existing surveys often address these directions in isolation, lacking a unified analysis and comparison across reasoning types, methodologies, and evaluation protocols. This survey aims to address this gap by categorizing visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and systematically examining their implementation through architectures such as graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems. We review evaluation protocols designed to assess functional correctness, structural consistency, and causal validity, and critically analyze their limitations in terms of generalizability, reproducibility, and explanatory power. Beyond evaluation, we identify key open challenges in visual reasoning, including scalability to complex scenes, deeper integration of symbolic and neural paradigms, the lack of comprehensive benchmark datasets, and reasoning under weak supervision. Finally, we outline a forward-looking research agenda for next-generation vision systems, emphasizing that bridging perception and reasoning is essential for building transparent, trustworthy, and cross-domain adaptive AI systems, particularly in critical domains such as autonomous driving and medical diagnostics.

[151] Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset

Ziye Deng, Ruihan He, Jiaxiang Liu, Yuan Wang, Zijie Meng, Songtao Jiang, Yong Xie, Zuozhu Liu

Main category: cs.CV

TL;DR: The paper introduces Med-GLIP-5M, a large-scale medical image grounding dataset, and Med-GLIP, a modality-aware framework for aligning language with image regions, improving performance in tasks like VQA and report generation.

DetailsMotivation: Existing medical image grounding research lacks modality diversity, fine-grained annotations, and a unified framework, limiting its applicability.

Method: Constructed Med-GLIP-5M with 5.3M region-level annotations across seven modalities, then developed Med-GLIP, a modality-aware framework trained on this dataset.

Result: Med-GLIP outperforms state-of-the-art baselines in grounding tasks and enhances downstream tasks like VQA and report generation.

Conclusion: The dataset and framework address key limitations in medical image grounding, offering improved performance and generalizability.

Abstract: Medical image grounding aims to align natural language phrases with specific regions in medical images, serving as a foundational task for intelligent diagnosis, visual question answering (VQA), and automated report generation (MRG). However, existing research is constrained by limited modality coverage, coarse-grained annotations, and the absence of a unified, generalizable grounding framework. To address these challenges, we construct a large-scale medical grounding dataset Med-GLIP-5M comprising over 5.3 million region-level annotations across seven imaging modalities, covering diverse anatomical structures and pathological findings. The dataset supports both segmentation and grounding tasks with hierarchical region labels, ranging from organ-level boundaries to fine-grained lesions. Based on this foundation, we propose Med-GLIP, a modality-aware grounding framework trained on Med-GLIP-5M. Rather than relying on explicitly designed expert modules, Med-GLIP implicitly acquires hierarchical semantic understanding from diverse training data – enabling it to recognize multi-granularity structures, such as distinguishing lungs from pneumonia lesions. Extensive experiments demonstrate that Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks. Furthermore, integrating its spatial outputs into downstream tasks, including medical VQA and report generation, leads to substantial performance gains. Our dataset will be released soon.

[152] GCRPNet: Graph-Enhanced Contextual and Regional Perception Network For Salient Object Detection in Optical Remote Sensing Images

Mengyu Ren, Yutong Li, Hua Li, Runmin Cong, Sam Kwong

Main category: cs.CV

TL;DR: The paper proposes GCRPNet, a graph-enhanced network for salient object detection in remote sensing images, addressing challenges like scale variations and low contrast by integrating global and local features using Mamba architecture and novel modules.

DetailsMotivation: Salient object detection in remote sensing images is hindered by scale variations and low contrast. Existing methods struggle to integrate global and local features effectively.

Method: GCRPNet uses a VSS encoder for multi-scale features, a DS-HGAM module for cross-layer interaction, and a LEVSS block with adaptive scanning and MCAEM for local modeling.

Result: Extensive experiments show GCRPNet achieves state-of-the-art performance.

Conclusion: GCRPNet effectively addresses the challenges of SOD in ORSIs, demonstrating superior performance through innovative feature integration and local enhancement.

Abstract: Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose a graph-enhanced contextual and regional perception network (GCRPNet), which builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space (VSS) encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a difference-similarity guided hierarchical graph attention module (DS-HGAM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model’s structural perception,allowing it to distinguish between foreground and background more effectively. Then, we design the LEVSS block as the decoder of GCRPNet. This module integrates our proposed adaptive scanning strategy and multi-granularity collaborative attention enhancement module (MCAEM). It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information and enhancing Mamba’s local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.

[153] PSScreen: Partially Supervised Multiple Retinal Disease Screening

Boyi Zheng, Qing Liu

Main category: cs.CV

TL;DR: PSScreen is a novel model for multiple retinal disease screening using partially labeled datasets, addressing domain shifts and label absence via dual-stream learning, feature distillation, and pseudo-label consistency.

DetailsMotivation: To reduce reliance on fully annotated datasets and tackle challenges like domain shifts and missing labels in partially labeled datasets for retinal disease screening.

Method: PSScreen uses two streams (deterministic and probabilistic features), textual guidance for feature decoupling and alignment, pseudo-label consistency, and self-distillation.

Result: Significantly improves detection performance for six retinal diseases and normal cases, achieving state-of-the-art results on in-domain and out-of-domain datasets.

Conclusion: PSScreen effectively addresses domain shifts and label absence, enhancing retinal disease screening performance.

Abstract: Leveraging multiple partially labeled datasets to train a model for multiple retinal disease screening reduces the reliance on fully annotated datasets, but remains challenging due to significant domain shifts across training datasets from various medical sites, and the label absent issue for partial classes. To solve these challenges, we propose PSScreen, a novel Partially Supervised multiple retinal disease Screening model. Our PSScreen consists of two streams and one learns deterministic features and the other learns probabilistic features via uncertainty injection. Then, we leverage the textual guidance to decouple two types of features into disease-wise features and align them via feature distillation to boost the domain generalization ability. Meanwhile, we employ pseudo label consistency between two streams to address the label absent issue and introduce a self-distillation to transfer task-relevant semantics about known classes from the deterministic to the probabilistic stream to further enhance the detection performances. Experiments show that our PSScreen significantly enhances the detection performances on six retinal diseases and the normal state averagely and achieves state-of-the-art results on both in-domain and out-of-domain datasets. Codes are available at https://github.com/boyiZheng99/PSScreen.

[154] AR Surgical Navigation With Surface Tracing: Comparing In-SitVisualization with Tool-Tracking Guidance for Neurosurgical Applications

Marc J. Fischer, Jeffrey Potts, Gabriel Urreola, Dax Jones, Paolo Palmisciano, E. Bradley Strong, Branden Cord, Andrew D. Hernandez, Julia D. Sharma, E. Brandon Strong

Main category: cs.CV

TL;DR: AR surgical navigation improves precision in simulated catheter placements using HoloLens 2, with real-time tool-tracking outperforming static visualization.

DetailsMotivation: Overcome AR depth perception and occlusion issues in surgical navigation for precision-critical tasks.

Method: Novel surface tracing and real-time infrared tool tracking with HoloLens 2, tested in simulated catheter placements under two AR guidance conditions.

Result: Tool-tracking guidance improved accuracy and was preferred by users.

Conclusion: Real-time AR tool-tracking enhances surgical precision and usability.

Abstract: Augmented Reality (AR) surgical navigation systems are emerging as the next generation of intraoperative surgical guidance, promising to overcome limitations of traditional navigation systems. However, known issues with AR depth perception due to vergence-accommodation conflict and occlusion handling limitations of the currently commercially available display technology present acute challenges in surgical settings where precision is paramount. This study presents a novel methodology for utilizing AR guidance to register anatomical targets and provide real-time instrument navigation using placement of simulated external ventricular drain catheters on a phantom model as the clinical scenario. The system registers target positions to the patient through a novel surface tracing method and uses real-time infrared tool tracking to aid in catheter placement, relying only on the onboard sensors of the Microsoft HoloLens 2. A group of intended users performed the procedure of simulated insertions under two AR guidance conditions: static in-situ visualization, where planned trajectories are overlaid directly onto the patient anatomy, and real-time tool-tracking guidance, where live feedback of the catheter’s pose is provided relative to the plan. Following the insertion tests, computed tomography scans of the phantom models were acquired, allowing for evaluation of insertion accuracy, target deviation, angular error, and depth precision. System Usability Scale surveys assessed user experience and cognitive workload. Tool-tracking guidance improved performance metrics across all accuracy measures and was preferred by users in subjective evaluations. A free copy of this paper and all supplemental materials are available at https://bit.ly/45l89Hq.

[155] Retrieval-Augmented Prompt for OOD Detection

Ruisong Han, Zongbo Han, Jiahao Zhang, Mingyue Cheng, Changqing Zhang

Main category: cs.CV

TL;DR: RAP, a novel OOD detection method, enhances semantic supervision by retrieving external knowledge to augment prompts in a pre-trained vision-language model, achieving state-of-the-art performance.

DetailsMotivation: Existing OOD detection methods lack sufficient semantic supervision due to limited or mismatched outlier samples, leading to suboptimal performance.

Method: RAP retrieves descriptive words for outliers using external textual knowledge to augment prompts during training and dynamically updates prompts during testing.

Result: RAP reduces FPR95 by 7.05% and improves AUROC by 1.71% on ImageNet-1k compared to previous methods.

Conclusion: RAP effectively addresses the limitations of existing methods and demonstrates superior performance in OOD detection.

Abstract: Out-of-Distribution (OOD) detection is crucial for the reliable deployment of machine learning models in-the-wild, enabling accurate identification of test samples that differ from the training data distribution. Existing methods rely on auxiliary outlier samples or in-distribution (ID) data to generate outlier information for training, but due to limited outliers and their mismatch with real test OOD samples, they often fail to provide sufficient semantic supervision, leading to suboptimal performance. To address this, we propose a novel OOD detection method called Retrieval-Augmented Prompt (RAP). RAP augments a pre-trained vision-language model’s prompts by retrieving external knowledge, offering enhanced semantic supervision for OOD detection. During training, RAP retrieves descriptive words for outliers based on joint similarity with external textual knowledge and uses them to augment the model’s OOD prompts. During testing, RAP dynamically updates OOD prompts in real-time based on the encountered OOD samples, enabling the model to rapidly adapt to the test environment. Our extensive experiments demonstrate that RAP achieves state-of-the-art performance on large-scale OOD detection benchmarks. For example, in 1-shot OOD detection on the ImageNet-1k dataset, RAP reduces the average FPR95 by 7.05% and improves the AUROC by 1.71% compared to previous methods. Additionally, comprehensive ablation studies validate the effectiveness of each module and the underlying motivations of our approach.

[156] PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks

Xinhao Wang, Zhiwei Lin, Zhongyu Xia, Yongtao Wang

Main category: cs.CV

TL;DR: PTQAT is a hybrid quantization method combining PTQ and QAT, improving efficiency and accuracy by selectively fine-tuning critical layers.

DetailsMotivation: Address the trade-off between PTQ's speed and QAT's accuracy, reducing GPU memory and training time while maintaining performance.

Method: Selects critical layers for QAT fine-tuning and applies PTQ to others, focusing on layers with smaller output discrepancies.

Result: Achieves similar performance to QAT with 50% fewer fine-tuned layers, outperforming QAT-only baselines in 3D perception tasks.

Conclusion: PTQAT is an efficient, universal quantization method for diverse models and tasks, balancing accuracy and resource usage.

Abstract: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) represent two mainstream model quantization approaches. However, PTQ often leads to unacceptable performance degradation in quantized models, while QAT imposes substantial GPU memory requirements and extended training time due to weight fine-tuning.In this paper, we propose PTQAT, a novel general hybrid quantization algorithm for the efficient deployment of 3D perception networks. To address the speed accuracy trade-off between PTQ and QAT, our method selects critical layers for QAT fine-tuning and performs PTQ on the remaining layers. Contrary to intuition, fine-tuning the layers with smaller output discrepancies before and after quantization, rather than those with larger discrepancies, actually leads to greater improvements in the model’s quantization accuracy. This means we better compensate for quantization errors during their propagation, rather than addressing them at the point where they occur. The proposed PTQAT achieves similar performance to QAT with more efficiency by freezing nearly 50% of quantifiable layers. Additionally, PTQAT is a universal quantization method that supports various quantization bit widths (4 bits) as well as different model architectures, including CNNs and Transformers. The experimental results on nuScenes across diverse 3D perception tasks, including object detection, semantic segmentation, and occupancy prediction, show that our method consistently outperforms QAT-only baselines. Notably, it achieves 0.2%-0.9% NDS and 0.3%-1.0% mAP gains in object detection, 0.3%-2.0% mIoU gains in semantic segmentation and occupancy prediction while fine-tuning fewer weights.

[157] HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

Shiyu Liu, Kui Jiang, Xianming Liu, Hongxun Yao, Xiaocheng Feng

Main category: cs.CV

TL;DR: HM-Talker improves talking head video generation by combining implicit and explicit motion cues, addressing motion blur and lip jitter issues.

DetailsMotivation: Current methods lack explicit articulatory priors, leading to motion blur and lip jitter in videos.

Method: HM-Talker uses a hybrid motion representation with implicit/explicit cues, a Cross-Modal Disentanglement Module (CMDM), and a Hybrid Motion Modeling Module (HMMM).

Result: HM-Talker outperforms state-of-the-art methods in visual quality and lip-sync accuracy.

Conclusion: The framework advances personalized talking head synthesis by robustly synchronizing lip movements across diverse identities.

Abstract: Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations–an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker’s superiority over state-of-the-art methods in visual quality and lip-sync accuracy.

[158] SpaRC-AD: A Baseline for Radar-Camera Fusion in End-to-End Autonomous Driving

Philipp Wolters, Johannes Gilg, Torben Teepe, Gerhard Rigoll

Main category: cs.CV

TL;DR: SpaRC-AD is a query-based end-to-end camera-radar fusion framework for autonomous driving, improving performance in 3D detection, tracking, mapping, motion prediction, and planning over vision-only methods.

DetailsMotivation: Vision-based autonomous driving systems struggle in adverse weather, occlusions, and velocity estimation, limiting safety-critical performance.

Method: SpaRC-AD uses sparse 3D feature alignment and doppler-based velocity estimation for robust 3D scene representations, refining agent anchors, map polylines, and motion modeling.

Result: The method outperforms vision-only baselines: +4.8% mAP in 3D detection, +8.3% AMOTA in tracking, +1.8% mAP in mapping, -4.0% mADE in motion prediction, and -0.1m L2/-9% TPC in planning.

Conclusion: Radar-camera fusion in SpaRC-AD enhances spatial coherence and temporal consistency, proving effective for safety-critical scenarios requiring accurate motion understanding and collision avoidance.

Abstract: End-to-end autonomous driving systems promise stronger performance through unified optimization of perception, motion forecasting, and planning. However, vision-based approaches face fundamental limitations in adverse weather conditions, partial occlusions, and precise velocity estimation - critical challenges in safety-sensitive scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. To address these limitations, we propose SpaRC-AD, a query-based end-to-end camera-radar fusion framework for planning-oriented autonomous driving. Through sparse 3D feature alignment, and doppler-based velocity estimation, we achieve strong 3D scene representations for refinement of agent anchors, map polylines and motion modelling. Our method achieves strong improvements over the state-of-the-art vision-only baselines across multiple autonomous driving tasks, including 3D detection (+4.8% mAP), multi-object tracking (+8.3% AMOTA), online mapping (+1.8% mAP), motion prediction (-4.0% mADE), and trajectory planning (-0.1m L2 and -9% TPC). We achieve both spatial coherence and temporal consistency on multiple challenging benchmarks, including real-world open-loop nuScenes, long-horizon T-nuScenes, and closed-loop simulator Bench2Drive. We show the effectiveness of radar-based fusion in safety-critical scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. The source code of all experiments is available at https://phi-wol.github.io/sparcad/

[159] Adapting SAM via Cross-Entropy Masking for Class Imbalance in Remote Sensing Change Detection

Humza Naveed, Xina Zeng, Mitch Bryson, Nagita Mehrseresht

Main category: cs.CV

TL;DR: The paper adapts the Segment Anything Model (SAM) for remote sensing change detection (RSCD) using spatial-temporal feature enhancement (STFE), multi-scale decoder fusion (MSDF), and a novel cross-entropy masking (CEM) loss to handle class imbalance. It outperforms SOTA methods on four datasets.

DetailsMotivation: To leverage foundational models like SAM for robust change detection in remote sensing, addressing challenges like multi-scale detection and class imbalance.

Method: Fine-tuning SAM encoder for RSCD, incorporating STFE and MSDF for multi-scale robustness, and introducing CEM loss for handling class imbalance.

Result: Outperforms SOTA on four datasets (Levir-CD, WHU-CD, CLCD, S2Looking), with a 2.5% F1-score improvement on S2Looking.

Conclusion: The proposed method effectively adapts SAM for RSCD, demonstrating superior performance and robustness in change detection tasks.

Abstract: Foundational models have achieved significant success in diverse domains of computer vision. They learn general representations that are easily transferable to tasks not seen during training. One such foundational model is Segment anything model (SAM), which can accurately segment objects in images. We propose adapting the SAM encoder via fine-tuning for remote sensing change detection (RSCD) along with spatial-temporal feature enhancement (STFE) and multi-scale decoder fusion (MSDF) to detect changes robustly at multiple scales. Additionally, we propose a novel cross-entropy masking (CEM) loss to handle high class imbalance in change detection datasets. Our method outperforms state-of-the-art (SOTA) methods on four change detection datasets, Levir-CD, WHU-CD, CLCD, and S2Looking. We achieved 2.5% F1-score improvement on a large complex S2Looking dataset. The code is available at: https://github.com/humza909/SAM-CEM-CD

[160] Towards Agentic AI for Multimodal-Guided Video Object Segmentation

Tuyen Tran, Thao Minh Le, Truyen Tran

Main category: cs.CV

TL;DR: The paper introduces Multi-Modal Agent, a flexible and adaptive system for Referring-based Video Object Segmentation (VOS), leveraging LLMs and specialized tools to outperform traditional methods.

DetailsMotivation: Traditional VOS methods are computationally complex and lack flexibility. Vision-language foundation models offer a training-free alternative, but existing pipelines are rigid.

Method: Proposes Multi-Modal Agent, using LLMs to generate dynamic workflows and interact with specialized tools for low-level tasks across modalities.

Result: Shows clear improvements over prior methods on RVOS and Ref-AVS tasks.

Conclusion: The agentic approach provides a more adaptable and effective solution for multimodal-conditioned VOS.

Abstract: Referring-based Video Object Segmentation is a multimodal problem that requires producing fine-grained segmentation results guided by external cues. Traditional approaches to this task typically involve training specialized models, which come with high computational complexity and manual annotation effort. Recent advances in vision-language foundation models open a promising direction toward training-free approaches. Several studies have explored leveraging these general-purpose models for fine-grained segmentation, achieving performance comparable to that of fully supervised, task-specific models. However, existing methods rely on fixed pipelines that lack the flexibility needed to adapt to the dynamic nature of the task. To address this limitation, we propose Multi-Modal Agent, a novel agentic system designed to solve this task in a more flexible and adaptive manner. Specifically, our method leverages the reasoning capabilities of large language models (LLMs) to generate dynamic workflows tailored to each input. This adaptive procedure iteratively interacts with a set of specialized tools designed for low-level tasks across different modalities to identify the target object described by the multimodal cues. Our agentic approach demonstrates clear improvements over prior methods on two multimodal-conditioned VOS tasks: RVOS and Ref-AVS.

[161] Fourier-Guided Attention Upsampling for Image Super-Resolution

Daejune Choi, Youchan No, Jinhyung Lee, Duksu Kim

Main category: cs.CV

TL;DR: FGA is a lightweight upsampling module for super-resolution, improving high-frequency detail reconstruction and reducing aliasing artifacts with minimal added parameters.

DetailsMotivation: Conventional upsamplers often fail to reconstruct high-frequency details and introduce aliasing artifacts, motivating the need for a better solution.

Method: FGA integrates Fourier feature-based MLP, cross-resolution Correlation Attention Layer, and frequency-domain L1 loss for improved spectral fidelity.

Result: FGA enhances performance across five backbones, achieving PSNR gains of 0.12~0.14 dB and up to 29% better frequency-domain consistency.

Conclusion: FGA is an effective, scalable alternative to traditional upsampling methods, reducing aliasing and preserving fine details.

Abstract: We propose Frequency-Guided Attention (FGA), a lightweight upsampling module for single image super-resolution. Conventional upsamplers, such as Sub-Pixel Convolution, are efficient but frequently fail to reconstruct high-frequency details and introduce aliasing artifacts. FGA addresses these issues by integrating (1) a Fourier feature-based Multi-Layer Perceptron (MLP) for positional frequency encoding, (2) a cross-resolution Correlation Attention Layer for adaptive spatial alignment, and (3) a frequency-domain L1 loss for spectral fidelity supervision. Adding merely 0.3M parameters, FGA consistently enhances performance across five diverse super-resolution backbones in both lightweight and full-capacity scenarios. Experimental results demonstrate average PSNR gains of 0.12~0.14 dB and improved frequency-domain consistency by up to 29%, particularly evident on texture-rich datasets. Visual and spectral evaluations confirm FGA’s effectiveness in reducing aliasing and preserving fine details, establishing it as a practical, scalable alternative to traditional upsampling methods.

[162] HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs

Zheng Qin, Ruobing Zheng, Yabing Wang, Tianqi Li, Yi Yuan, Jingdong Chen, Le Wang

Main category: cs.CV

TL;DR: HumanSense is a benchmark for evaluating MLLMs’ human-centered interaction capabilities, revealing gaps in advanced tasks and advocating for multimodal reinforcement learning to improve reasoning.

DetailsMotivation: Progress in MLLMs is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, requiring deep understanding and empathetic responses.

Method: HumanSense evaluates MLLMs using multimodal contexts. A multi-stage, modality-progressive reinforcement learning approach enhances reasoning in an Omni model.

Result: Leading MLLMs show room for improvement, especially in advanced tasks. Supplementing visual input with audio/text improves performance, and reasoning ability is key for feedback.

Conclusion: HumanSense highlights the need for better reasoning and multimodal integration in MLLMs, with reinforcement learning and prompt design offering promising improvements.

Abstract: While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks. Furthermore, we argue that appropriate feedback stems from a contextual analysis of the interlocutor’s needs and emotions, with reasoning ability serving as the key to unlocking it. Accordingly, we employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model, achieving substantial gains on evaluation results. Additionally, we observe that successful reasoning processes exhibit highly consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner. Project page: \textcolor{brightpink}https://digital-avatar.github.io/ai/HumanSense/

[163] Serial Over Parallel: Learning Continual Unification for Multi-Modal Visual Object Tracking and Benchmarking

Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Chunyang Cheng, Tao Zhou, Xiaojun Wu, Josef Kittler

Main category: cs.CV

TL;DR: The paper introduces UniBench300, a unified benchmark for multi-modal visual object tracking (MMVOT), addressing inconsistency and performance degradation by reformulating the unification process in a serial format and leveraging continual learning.

DetailsMotivation: Existing MMVOT practices suffer from inconsistency between training and testing due to the absence of a unified benchmark, leading to performance degradation.

Method: Proposes UniBench300, a unified benchmark, and reformulates the unification process serially, aligning it with continual learning principles.

Result: UniBench300 reduces inference time by 27% and shows the superiority of continual learning in stabilizing unification. Performance degradation is linked to network capacity and modality discrepancies.

Conclusion: The work highlights the importance of unified benchmarks and continual learning in MMVOT, offering insights for future multi-modal vision research.

Abstract: Unifying multiple multi-modal visual object tracking (MMVOT) tasks draws increasing attention due to the complementary nature of different modalities in building robust tracking systems. Existing practices mix all data sensor types in a single training procedure, structuring a parallel paradigm from the data-centric perspective and aiming for a global optimum on the joint distribution of the involved tasks. However, the absence of a unified benchmark where all types of data coexist forces evaluations on separated benchmarks, causing \textit{inconsistency} between training and testing, thus leading to performance \textit{degradation}. To address these issues, this work advances in two aspects: \ding{182} A unified benchmark, coined as UniBench300, is introduced to bridge the inconsistency by incorporating multiple task data, reducing inference passes from three to one and cutting time consumption by 27%. \ding{183} The unification process is reformulated in a serial format, progressively integrating new tasks. In this way, the performance degradation can be specified as knowledge forgetting of previous tasks, which naturally aligns with the philosophy of continual learning (CL), motivating further exploration of injecting CL into the unification process. Extensive experiments conducted on two baselines and four benchmarks demonstrate the significance of UniBench300 and the superiority of CL in supporting a stable unification process. Moreover, while conducting dedicated analyses, the performance degradation is found to be negatively correlated with network capacity. Additionally, modality discrepancies contribute to varying degradation levels across tasks (RGBT > RGBD > RGBE in MMVOT), offering valuable insights for future multi-modal vision research. Source codes and the proposed benchmark is available at \textit{https://github.com/Zhangyong-Tang/UniBench300}.

[164] EvTurb: Event Camera Guided Turbulence Removal

Yixing Liu, Minggui Teng, Yifei Xia, Peiqi Duan, Boxin Shi

Main category: cs.CV

TL;DR: EvTurb is a novel event-guided framework for turbulence removal, using high-speed event streams to decouple blur and tilt distortions, outperforming existing methods.

DetailsMotivation: Atmospheric turbulence degrades image quality, complicating computer vision tasks. Existing methods struggle due to the complexity of turbulence-induced distortions.

Method: EvTurb uses a two-step event-guided network: event integrals reduce blur, and a variance map from raw events eliminates tilt distortion.

Result: EvTurb outperforms state-of-the-art methods and maintains computational efficiency.

Conclusion: EvTurb effectively addresses turbulence-induced distortions, aided by the TurbEvent dataset, showcasing superior performance.

Abstract: Atmospheric turbulence degrades image quality by introducing blur and geometric tilt distortions, posing significant challenges to downstream computer vision tasks. Existing single-image and multi-frame methods struggle with the highly ill-posed nature of this problem due to the compositional complexity of turbulence-induced distortions. To address this, we propose EvTurb, an event guided turbulence removal framework that leverages high-speed event streams to decouple blur and tilt effects. EvTurb decouples blur and tilt effects by modeling event-based turbulence formation, specifically through a novel two-step event-guided network: event integrals are first employed to reduce blur in the coarse outputs. This is followed by employing a variance map, derived from raw event streams, to eliminate the tilt distortion for the refined outputs. Additionally, we present TurbEvent, the first real-captured dataset featuring diverse turbulence scenarios. Experimental results demonstrate that EvTurb surpasses state-of-the-art methods while maintaining computational efficiency.

[165] Towards Powerful and Practical Patch Attacks for 2D Object Detection in Autonomous Driving

Yuxin Cao, Yedi Zhang, Wentao He, Yifan Liao, Yan Xiao, Chang Li, Zhiyong Huang, Jin Song Dong

Main category: cs.CV

TL;DR: P$^3$A is a new adversarial patch attack framework for autonomous driving, improving attack effectiveness and transferability on high-resolution datasets with novel metrics and loss functions.

DetailsMotivation: Existing black-box attacks on autonomous driving systems overestimate effectiveness due to lenient metrics and fail on high-resolution data, posing safety risks.

Method: Introduces Practical Attack Success Rate (PASR), Localization-Confidence Suppression Loss (LCSL), and Probabilistic Scale-Preserving Padding (PSPP) for better attack performance.

Result: P$^3$A outperforms state-of-the-art attacks on unseen models and high-resolution datasets under both PASR and mAP metrics.

Conclusion: P$^3$A addresses limitations of prior attacks, offering a more practical and powerful solution for adversarial threats in autonomous driving.

Abstract: Learning-based autonomous driving systems remain critically vulnerable to adversarial patches, posing serious safety and security risks in their real-world deployment. Black-box attacks, notable for their high attack success rate without model knowledge, are especially concerning, with their transferability extensively studied to reduce computational costs compared to query-based attacks. Previous transferability-based black-box attacks typically adopt mean Average Precision (mAP) as the evaluation metric and design training loss accordingly. However, due to the presence of multiple detected bounding boxes and the relatively lenient Intersection over Union (IoU) thresholds, the attack effectiveness of these approaches is often overestimated, resulting in reduced success rates in practical attacking scenarios. Furthermore, patches trained on low-resolution data often fail to maintain effectiveness on high-resolution images, limiting their transferability to autonomous driving datasets. To fill this gap, we propose P$^3$A, a Powerful and Practical Patch Attack framework for 2D object detection in autonomous driving, specifically optimized for high-resolution datasets. First, we introduce a novel metric, Practical Attack Success Rate (PASR), to more accurately quantify attack effectiveness with greater relevance for pedestrian safety. Second, we present a tailored Localization-Confidence Suppression Loss (LCSL) to improve attack transferability under PASR. Finally, to maintain the transferability for high-resolution datasets, we further incorporate the Probabilistic Scale-Preserving Padding (PSPP) into the patch attack pipeline as a data preprocessing step. Extensive experiments show that P$^3$A outperforms state-of-the-art attacks on unseen models and unseen high-resolution datasets, both under the proposed practical IoU-based evaluation metric and the previous mAP-based metrics.

[166] AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models

Shixiong Xu, Chenghao Zhang, Lubin Fan, Yuan Zhou, Bin Fan, Shiming Xiang, Gaofeng Meng, Jieping Ye

Main category: cs.CV

TL;DR: AddressVLM enhances LVLMs for fine-grained street-level localization by integrating satellite-view and street-view images, improving accuracy by 9-12%.

DetailsMotivation: LVLMs perform poorly in fine-grained street-level localization due to limited visual cues in street-view VQA data.

Method: Proposes cross-view alignment tuning with satellite and street-view images, and a two-stage training protocol (alignment and localization tuning).

Result: AddressVLM outperforms other LVLMs by 9% and 12% in accuracy on Pittsburgh and San Francisco datasets.

Conclusion: Integrating macro and micro visual cues significantly improves fine-grained address localization in LVLMs.

Abstract: Large visual language models (LVLMs) have demonstrated impressive performance in coarse-grained geo-localization at the country or city level, but they struggle with fine-grained street-level localization within urban areas. In this paper, we explore integrating city-wide address localization capabilities into LVLMs, facilitating flexible address-related question answering using street-view images. A key challenge is that the street-view visual question-and-answer (VQA) data provides only microscopic visual cues, leading to subpar performance in fine-tuned models. To tackle this issue, we incorporate perspective-invariant satellite images as macro cues and propose cross-view alignment tuning including a satellite-view and street-view image grafting mechanism, along with an automatic label generation mechanism. Then LVLM’s global understanding of street distribution is enhanced through cross-view matching. Our proposed model, named AddressVLM, consists of two-stage training protocols: cross-view alignment tuning and address localization tuning. Furthermore, we have constructed two street-view VQA datasets based on image address localization datasets from Pittsburgh and San Francisco. Qualitative and quantitative evaluations demonstrate that AddressVLM outperforms counterpart LVLMs by over 9% and 12% in average address localization accuracy on these two datasets, respectively.

[167] Increasing the Utility of Synthetic Images through Chamfer Guidance

Nicola Dall’Asen, Xiaofeng Zhang, Reyhane Askari Hemmat, Melissa Hall, Jakob Verbeek, Adriana Romero-Soriano, Michal Drozdzal

Main category: cs.CV

TL;DR: Chamfer Guidance improves synthetic data diversity and quality using real exemplars, achieving state-of-the-art performance with minimal real images and computational efficiency.

DetailsMotivation: Address the trade-off between generation quality and diversity in conditional image generative models, and mitigate distribution shift between synthetic and real data.

Method: Introduces Chamfer Guidance, a training-free approach leveraging real exemplar images to characterize synthetic data quality and diversity.

Result: Boosts diversity and quality on benchmarks, achieves high precision (96.4%-97.5%) and coverage (86.4%-92.7%), and improves downstream classifier accuracy by up to 16%.

Conclusion: Chamfer Guidance is effective for enhancing synthetic data utility, offering computational efficiency and strong performance with minimal real data.

Abstract: Conditional image generative models hold considerable promise to produce infinite amounts of synthetic training data. Yet, recent progress in generation quality has come at the expense of generation diversity, limiting the utility of these models as a source of synthetic training data. Although guidance-based approaches have been introduced to improve the utility of generated data by focusing on quality or diversity, the (implicit or explicit) utility functions oftentimes disregard the potential distribution shift between synthetic and real data. In this work, we introduce Chamfer Guidance: a training-free guidance approach which leverages a handful of real exemplar images to characterize the quality and diversity of synthetic data. We show that by leveraging the proposed Chamfer Guidance, we can boost the diversity of the generations w.r.t. a dataset of real images while maintaining or improving the generation quality on ImageNet-1k and standard geo-diversity benchmarks. Our approach achieves state-of-the-art few-shot performance with as little as 2 exemplar real images, obtaining 96.4% in terms of precision, and 86.4% in terms of distributional coverage, which increase to 97.5% and 92.7%, respectively, when using 32 real images. We showcase the benefits of the Chamfer Guidance generation by training downstream image classifiers on synthetic data, achieving accuracy boost of up to 15% for in-distribution over the baselines, and up to 16% in out-of-distribution. Furthermore, our approach does not require using the unconditional model, and thus obtains a 31% reduction in FLOPs w.r.t. classifier-free-guidance-based approaches at sampling time.

[168] Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation

Feiran Li, Qianqian Xu, Shilong Bao, Boyu Han, Zhiyong Yang, Qingming Huang

Main category: cs.CV

TL;DR: The paper presents a method to create a high-quality, non-overlapping face dataset for training face recognition models, combining real and synthetic data, and achieves top competition results.

DetailsMotivation: To build a face dataset without overlapping identities with existing public datasets, ensuring diversity and quality for improved face recognition model performance.

Method: Cleans the HSFace dataset using MoE (clustering and GPT-4o verification), augments data, generates synthetic identities with Stable Diffusion and Vec2Face, and uses curriculum learning.

Result: Achieves 1st place in the DataCV ICCV Challenge, with improved model performance across various identity scales.

Conclusion: The hybrid approach of real and synthetic data efficiently constructs a diverse, high-quality dataset, enhancing face recognition model training.

Abstract: In this paper, we present our approach to the DataCV ICCV Challenge, which centers on building a high-quality face dataset to train a face recognition model. The constructed dataset must not contain identities overlapping with any existing public face datasets. To handle this challenge, we begin with a thorough cleaning of the baseline HSFace dataset, identifying and removing mislabeled or inconsistent identities through a Mixture-of-Experts (MoE) strategy combining face embedding clustering and GPT-4o-assisted verification. We retain the largest consistent identity cluster and apply data augmentation up to a fixed number of images per identity. To further diversify the dataset, we generate synthetic identities using Stable Diffusion with prompt engineering. As diffusion models are computationally intensive, we generate only one reference image per identity and efficiently expand it using Vec2Face, which rapidly produces 49 identity-consistent variants. This hybrid approach fuses GAN-based and diffusion-based samples, enabling efficient construction of a diverse and high-quality dataset. To address the high visual similarity among synthetic identities, we adopt a curriculum learning strategy by placing them early in the training schedule, allowing the model to progress from easier to harder samples. Our final dataset contains 50 images per identity, and all newly generated identities are checked with mainstream face datasets to ensure no identity leakage. Our method achieves \textbf{1st place} in the competition, and experimental results show that our dataset improves model performance across 10K, 20K, and 100K identity scales. Code is available at https://github.com/Ferry-Li/datacv_fr.

[169] ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation

Hosam Elgendy, Ahmed Sharshar, Ahmed Aboeitta, Mohsen Guizani

Main category: cs.CV

TL;DR: ChatENV is an interactive vision-language model for environmental monitoring, combining satellite imagery and sensor data for improved reasoning and analysis.

DetailsMotivation: Addressing limitations in current VLMs, such as overlooking sensor data, reliance on biased captions, and lack of interactive reasoning for environmental changes.

Method: Develops a dataset of 177k images (152k temporal pairs) with sensor metadata, annotates using GPT-4o and Gemini 2.0, and fine-tunes Qwen-2.5-VL with LoRA adapters.

Result: Achieves strong performance in temporal and “what-if” reasoning (BERT-F1 0.903), rivaling state-of-the-art models.

Conclusion: ChatENV is a powerful, sensor-aware tool for environmental monitoring, supporting interactive scenario-based analysis.

Abstract: Understanding environmental changes from aerial imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT- 4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and “what-if” reasoning (e.g., BERT-F1 0.903) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.

[170] EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering

Yanjun Li, Yuqian Fu, Tianwen Qian, Qi’ao Xu, Silong Dai, Danda Pani Paudel, Luc Van Gool, Xiaoling Wang

Main category: cs.CV

TL;DR: The paper introduces EgoCross, a benchmark to evaluate cross-domain generalization of MLLMs in EgocentricQA, highlighting limitations of current models and exploring potential improvements.

DetailsMotivation: Existing benchmarks for EgocentricQA are limited to daily activities, lacking evaluation for domain shifts in real-world scenarios.

Method: EgoCross covers four diverse domains (surgery, industry, extreme sports, animal perspective) with 1,000 QA pairs across 798 videos, supporting OpenQA and CloseQA formats.

Result: Most MLLMs struggle with cross-domain generalization, revealing current model limitations. Pilot studies explore fine-tuning and reinforcement learning for improvements.

Conclusion: EgoCross aims to advance domain-adaptive, robust egocentric video understanding, with data and code publicly released.

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly pushed the frontier of egocentric video question answering (EgocentricQA). However, existing benchmarks and studies are mainly limited to common daily activities such as cooking and cleaning. In contrast, real-world deployment inevitably encounters domain shifts, where target domains differ substantially in both visual style and semantic content. To bridge this gap, we introduce \textbf{EgoCross}, a comprehensive benchmark designed to evaluate the cross-domain generalization of MLLMs in EgocentricQA. EgoCross covers four diverse and challenging domains, including surgery, industry, extreme sports, and animal perspective, representing realistic and high-impact application scenarios. It comprises approximately 1,000 QA pairs across 798 video clips, spanning four key QA tasks: prediction, recognition, localization, and counting. Each QA pair provides both OpenQA and CloseQA formats to support fine-grained evaluation. Extensive experiments show that most existing MLLMs, whether general-purpose or egocentric-specialized, struggle to generalize to domains beyond daily life, highlighting the limitations of current models. Furthermore, we conduct several pilot studies, \eg, fine-tuning and reinforcement learning, to explore potential improvements. We hope EgoCross and our accompanying analysis will serve as a foundation for advancing domain-adaptive, robust egocentric video understanding. Data and codes will be released at: \href{https://github.com/MyUniverse0726/EgoCross}{https://github.com/MyUniverse0726/EgoCross.}

[171] Processing and acquisition traces in visual encoders: What does CLIP know about your camera?

Ryan Ramos, Vladan Stojnić, Giorgos Kordopatis-Zilos, Yuta Nakashima, Giorgos Tolias, Noa Garcia

Main category: cs.CV

TL;DR: The paper examines how subtle, often imperceptible image acquisition parameters affect visual encoders, revealing their systematic encoding and impact on semantic predictions.

DetailsMotivation: To explore overlooked subtle image transformations and their influence on visual encoders, beyond severe corruptions.

Method: Analyzes parameters of image acquisition and subtle transformations, assessing their encoding in visual representations and impact on predictions.

Result: Subtle parameters are systematically encoded and can significantly affect semantic predictions, depending on their correlation with labels.

Conclusion: Subtle image transformations can profoundly influence visual encoders, highlighting the need to account for them in model robustness.

Abstract: Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions. We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact, either positively or negatively, on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels. Our code and data are available at: https://github.com/ryan-caesar-ramos/visual-encoder-traces

[172] Lameness detection in dairy cows using pose estimation and bidirectional LSTMs

Helena Russello, Rik van der Tol, Eldert J. van Henten, Gert Kootstra

Main category: cs.CV

TL;DR: A lameness detection method using pose estimation and BLSTM neural networks achieves 85% accuracy, outperforming manual feature-based approaches and working with minimal video data.

DetailsMotivation: To improve lameness detection in cows by eliminating manual feature engineering and leveraging temporal motion features from keypoint trajectories.

Method: Combines T-LEAP pose estimation to extract keypoint trajectories from cow videos, then uses a BLSTM classifier for binary lameness classification.

Result: Achieved 85% accuracy, surpassing the 80% accuracy of manual feature-based methods, and worked with just one second of video.

Conclusion: The proposed method is effective, efficient, and scalable for lameness detection in cows.

Abstract: This study presents a lameness detection approach that combines pose estimation and Bidirectional Long-Short-Term Memory (BLSTM) neural networks. Combining pose-estimation and BLSTMs classifier offers the following advantages: markerless pose-estimation, elimination of manual feature engineering by learning temporal motion features from the keypoint trajectories, and working with short sequences and small training datasets. Motion sequences of nine keypoints (located on the cows’ hooves, head and back) were extracted from videos of walking cows with the T-LEAP pose estimation model. The trajectories of the keypoints were then used as an input to a BLSTM classifier that was trained to perform binary lameness classification. Our method significantly outperformed an established method that relied on manually-designed locomotion features: our best architecture achieved a classification accuracy of 85%, against 80% accuracy for the feature-based approach. Furthermore, we showed that our BLSTM classifier could detect lameness with as little as one second of video data.

[173] SemPT: Semantic Prompt Tuning for Vision-Language Models

Xiao Shi, Yangjun Ou, Zhenzhong Chen

Main category: cs.CV

TL;DR: Semantic Prompt Tuning (SemPT) improves visual transfer learning by leveraging shared attribute-level knowledge, enhancing generalization for unseen categories.

DetailsMotivation: Addressing the conflict between preserving category-specific representations and acquiring transferable knowledge in visual transfer learning.

Method: SemPT uses a two-step prompting strategy to extract shared visual attributes and generate attribute-level descriptions, followed by visually guided weighting and joint alignment of embeddings.

Result: Achieves state-of-the-art performance on 15 benchmark datasets in base-to-novel generalization, cross-dataset/domain transfer, and few-shot learning.

Conclusion: SemPT effectively balances discrimination and transferability, outperforming existing methods in diverse settings.

Abstract: Visual transfer learning for unseen categories presents an active research topic yet a challenging task, due to the inherent conflict between preserving category-specific representations and acquiring transferable knowledge. Vision-Language Models (VLMs) pre-trained on large amounts of image-text pairs offer a promising solution. However, existing prompt tuning methods rely on sparse category labels or disparate LLM-generated descriptions, which fragment knowledge representation and hinder transferability. To address this limitation, we introduce Semantic Prompt Tuning (SemPT), a novel framework that tackles the generalization challenge by leveraging shared attribute-level knowledge across categories. Specifically, SemPT adopts a two-step prompting strategy to guide LLM in extracting shared visual attributes and generating attribute-level descriptions, capturing transferable semantic cues beyond labels while ensuring coherent structure. Then, visually guided weighting is applied to the embeddings of attribute-level descriptions to reduce noise from irrelevant attributes and enhance the text embeddings. Additionally, image embeddings are jointly aligned with both label and attribute-enhanced text embeddings, balancing discrimination for seen categories and transferability to unseen ones. Considering the availability of category exposure, our inference dynamically selects between standard label embeddings for seen categories and attribute-enhanced embeddings for unseen ones to ensure effective adaptation. Extensive experiments on 15 benchmark datasets demonstrate that SemPT achieves state-of-the-art performance across various settings, including base-to-novel generalization, cross-dataset transfer, cross-domain transfer, and few-shot learning.

[174] AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences

Jieyu Li, Xin Zhang, Joey Tianyi Zhou

Main category: cs.CV

TL;DR: AEGIS is a large-scale benchmark for detecting hyper-realistic AI-generated videos, addressing gaps in existing benchmarks with diverse, challenging content and multimodal annotations.

DetailsMotivation: The rise of realistic synthetic videos threatens societal trust and digital integrity, but current benchmarks lack realism, scale, and complexity to evaluate modern detection models effectively.

Method: AEGIS includes over 10,000 real and synthetic videos from state-of-the-art models, with challenging subsets and multimodal annotations (Semantic-Authenticity Descriptions, Motion Features, Low-level Visual Features).

Result: Experiments show limited detection capabilities of advanced vision-language models on AEGIS’s challenging subsets, highlighting its complexity and realism.

Conclusion: AEGIS advances research by providing a robust benchmark for developing reliable video authenticity detection methods to counter real-world forgery threats.

Abstract: Recent advances in AI-generated content have fueled the rise of highly realistic synthetic videos, posing severe risks to societal trust and digital integrity. Existing benchmarks for video authenticity detection typically suffer from limited realism, insufficient scale, and inadequate complexity, failing to effectively evaluate modern vision-language models against sophisticated forgeries. To address this critical gap, we introduce AEGIS, a novel large-scale benchmark explicitly targeting the detection of hyper-realistic and semantically nuanced AI-generated videos. AEGIS comprises over 10,000 rigorously curated real and synthetic videos generated by diverse, state-of-the-art generative models, including Stable Video Diffusion, CogVideoX-5B, KLing, and Sora, encompassing open-source and proprietary architectures. In particular, AEGIS features specially constructed challenging subsets enhanced with robustness evaluation. Furthermore, we provide multimodal annotations spanning Semantic-Authenticity Descriptions, Motion Features, and Low-level Visual Features, facilitating authenticity detection and supporting downstream tasks such as multimodal fusion and forgery localization. Extensive experiments using advanced vision-language models demonstrate limited detection capabilities on the most challenging subsets of AEGIS, highlighting the dataset’s unique complexity and realism beyond the current generalization capabilities of existing models. In essence, AEGIS establishes an indispensable evaluation benchmark, fundamentally advancing research toward developing genuinely robust, reliable, broadly generalizable video authenticity detection methodologies capable of addressing real-world forgery threats. Our dataset is available on https://huggingface.co/datasets/Clarifiedfish/AEGIS.

[175] HyperTea: A Hypergraph-based Temporal Enhancement and Alignment Network for Moving Infrared Small Target Detection

Zhaoyuan Qi, Weihua Gao, Wenlong Niu, Jie Tang, Yun Li, Xiaodong Peng

Main category: cs.CV

TL;DR: HyperTea integrates CNNs, RNNs, and HGNNs to enhance multi-timescale feature representation for moving infrared small target detection (MIRSTD), achieving SOTA performance.

DetailsMotivation: Existing MIRSTD methods lack high-order correlation modeling and multi-timescale feature enhancement, limiting detection performance.

Method: HyperTea combines global (GTEM) and local (LTEM) temporal enhancement modules with a temporal alignment module (TAM) to model high-order spatiotemporal correlations.

Result: Experiments on DAUB and IRDST datasets show HyperTea outperforms existing methods.

Conclusion: HyperTea advances MIRSTD by leveraging hypergraphs and multi-timescale feature enhancement, offering improved detection accuracy.

Abstract: In practical application scenarios, moving infrared small target detection (MIRSTD) remains highly challenging due to the target’s small size, weak intensity, and complex motion pattern. Existing methods typically only model low-order correlations between feature nodes and perform feature extraction and enhancement within a single temporal scale. Although hypergraphs have been widely used for high-order correlation learning, they have received limited attention in MIRSTD. To explore the potential of hypergraphs and enhance multi-timescale feature representation, we propose HyperTea, which integrates global and local temporal perspectives to effectively model high-order spatiotemporal correlations of features. HyperTea consists of three modules: the global temporal enhancement module (GTEM) realizes global temporal context enhancement through semantic aggregation and propagation; the local temporal enhancement module (LTEM) is designed to capture local motion patterns between adjacent frames and then enhance local temporal context; additionally, we further develop a temporal alignment module (TAM) to address potential cross-scale feature misalignment. To our best knowledge, HyperTea is the first work to integrate convolutional neural networks (CNNs), recurrent neural networks (RNNs), and hypergraph neural networks (HGNNs) for MIRSTD, significantly improving detection performance. Experiments on DAUB and IRDST demonstrate its state-of-the-art (SOTA) performance. Our source codes are available at https://github.com/Lurenjia-LRJ/HyperTea.

[176] Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

Youping Gu, Xiaolong Li, Yuhao Hu, Bohan Zhuang

Main category: cs.CV

TL;DR: BLADE is a data-free joint training framework combining Adaptive Block-Sparse Attention and sparsity-aware step distillation to accelerate diffusion transformers for video generation, achieving significant speedups and quality improvements.

DetailsMotivation: Addressing the slow iterative denoising and high computational costs of diffusion transformers in video generation by integrating step distillation and sparse attention without expensive data.

Method: Proposes BLADE with (1) Adaptive Block-Sparse Attention for dynamic sparsity masks and (2) sparsity-aware step distillation via Trajectory Distribution Matching.

Result: Achieves 14.10x speedup on Wan2.1-1.3B and 8.89x on CogVideoX-5B, with improved VBench-2.0 scores (e.g., 0.569 from 0.534).

Conclusion: BLADE effectively combines acceleration strategies, offering efficiency and quality gains for diffusion-based video generation.

Abstract: Diffusion transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges – training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE, an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm built upon Trajectory Distribution Matching (TDM) that directly incorporates sparsity into the distillation process rather than treating it as a separate compression step, with fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10x end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations. Our code and model weights are publicly available at: http://ziplab.co/BLADE-Homepage/.

[177] Physics-Informed Joint Multi-TE Super-Resolution with Implicit Neural Representation for Robust Fetal T2 Mapping

Busra Bulut, Maik Dannecker, Thomas Sanchez, Sara Neves Silva, Vladyslav Zalevskyi, Steven Jia, Jean-Baptiste Ledoux, Guillaume Auzias, François Rousseau, Jana Hutter, Daniel Rueckert, Meritxell Bach Cuadra

Main category: cs.CV

TL;DR: A method for T2 mapping in fetal brain MRI at 0.55T combines implicit neural representations with physics-informed regularization to address motion challenges and reduce scan time.

DetailsMotivation: Improving T2 mapping in fetal brain MRI at mid-field (0.55T) is challenging due to motion-corrupted data and long scan times.

Method: Joint reconstruction across echo times (TEs) using implicit neural representations and physics-informed regularization to model T2 decay.

Result: State-of-the-art performance on simulated and in vivo datasets, with the first in vivo fetal T2 mapping results at 0.55T.

Conclusion: The method reduces scan time by leveraging anatomical redundancy and shows promise for fetal brain MRI.

Abstract: T2 mapping in fetal brain MRI has the potential to improve characterization of the developing brain, especially at mid-field (0.55T), where T2 decay is slower. However, this is challenging as fetal MRI acquisition relies on multiple motion-corrupted stacks of thick slices, requiring slice-to-volume reconstruction (SVR) to estimate a high-resolution (HR) 3D volume. Currently, T2 mapping involves repeated acquisitions of these stacks at each echo time (TE), leading to long scan times and high sensitivity to motion. We tackle this challenge with a method that jointly reconstructs data across TEs, addressing severe motion. Our approach combines implicit neural representations with a physics-informed regularization that models T2 decay, enabling information sharing across TEs while preserving anatomical and quantitative T2 fidelity. We demonstrate state-of-the-art performance on simulated fetal brain and in vivo adult datasets with fetal-like motion. We also present the first in vivo fetal T2 mapping results at 0.55T. Our study shows potential for reducing the number of stacks per TE in T2 mapping by leveraging anatomical redundancy.

[178] IADGPT: Unified LVLM for Few-Shot Industrial Anomaly Detection, Localization, and Reasoning via In-Context Learning

Mengyang Zhao, Teng Fu, Haiyang Yu, Ke Niu, Bin Li

Main category: cs.CV

TL;DR: The paper introduces IADGPT, a framework for Few-Shot Industrial Anomaly Detection (FS-IAD), addressing limitations of existing LVLMs by mimicking human-like reasoning and localization.

DetailsMotivation: Existing LVLMs lack industrial knowledge and reasoning for FS-IAD, falling short of human inspectors. The goal is to bridge this gap with a specialized framework.

Method: A three-stage training strategy: acquiring industrial knowledge, discrepancy awareness, and in-context learning for generalization. Uses logits and attention maps for anomaly scoring.

Result: IADGPT shows significant gains in anomaly detection and performs well in localization and reasoning, validated on a new 100K-image dataset.

Conclusion: IADGPT advances FS-IAD by combining human-like reasoning with LVLMs, supported by a comprehensive dataset.

Abstract: Few-Shot Industrial Anomaly Detection (FS-IAD) has important applications in automating industrial quality inspection. Recently, some FS-IAD methods based on Large Vision-Language Models (LVLMs) have been proposed with some achievements through prompt learning or fine-tuning. However, existing LVLMs focus on general tasks but lack basic industrial knowledge and reasoning capabilities related to FS-IAD, making these methods far from specialized human quality inspectors. To address these challenges, we propose a unified framework, IADGPT, designed to perform FS-IAD in a human-like manner, while also handling associated localization and reasoning tasks, even for diverse and novel industrial products. To this end, we introduce a three-stage progressive training strategy inspired by humans. Specifically, the first two stages gradually guide IADGPT in acquiring fundamental industrial knowledge and discrepancy awareness. In the third stage, we design an in-context learning-based training paradigm, enabling IADGPT to leverage a few-shot image as the exemplars for improved generalization to novel products. In addition, we design a strategy that enables IADGPT to output image-level and pixel-level anomaly scores using the logits output and the attention map, respectively, in conjunction with the language output to accomplish anomaly reasoning. To support our training, we present a new dataset comprising 100K images across 400 diverse industrial product categories with extensive attribute-level textual annotations. Experiments indicate IADGPT achieves considerable performance gains in anomaly detection and demonstrates competitiveness in anomaly localization and reasoning. We will release our dataset in camera-ready.

[179] Ultra-High-Definition Reference-Based Landmark Image Super-Resolution with Generative Diffusion Prior

Zhenning Shi, Zizheng Yan, Yuhang Yu, Clara Xue, Jingyu Zhuang, Qi Zhang, Jinwei Chen, Tao Li, Qingnan Fan

Main category: cs.CV

TL;DR: TriFlowSR is a novel framework for Reference-based Image Super-Resolution (RefSR) that improves alignment between LR and reference HR images, introduces a UHD dataset (Landmark-4K), and outperforms existing methods.

DetailsMotivation: Existing RefSR methods struggle with alignment and rely on low-quality datasets, limiting high-quality restoration.

Method: TriFlowSR uses a Reference Matching Strategy for better pattern matching and introduces the Landmark-4K dataset for UHD scenarios.

Result: TriFlowSR outperforms previous methods in utilizing semantic and texture information from reference HR images.

Conclusion: TriFlowSR is the first diffusion-based RefSR pipeline for UHD landmark scenarios with real-world degradation, offering improved performance.

Abstract: Reference-based Image Super-Resolution (RefSR) aims to restore a low-resolution (LR) image by utilizing the semantic and texture information from an additional reference high-resolution (reference HR) image. Existing diffusion-based RefSR methods are typically built upon ControlNet, which struggles to effectively align the information between the LR image and the reference HR image. Moreover, current RefSR datasets suffer from limited resolution and poor image quality, resulting in the reference images lacking sufficient fine-grained details to support high-quality restoration. To overcome the limitations above, we propose TriFlowSR, a novel framework that explicitly achieves pattern matching between the LR image and the reference HR image. Meanwhile, we introduce Landmark-4K, the first RefSR dataset for Ultra-High-Definition (UHD) landmark scenarios. Considering the UHD scenarios with real-world degradation, in TriFlowSR, we design a Reference Matching Strategy to effectively match the LR image with the reference HR image. Experimental results show that our approach can better utilize the semantic and texture information of the reference HR image compared to previous methods. To the best of our knowledge, we propose the first diffusion-based RefSR pipeline for ultra-high definition landmark scenarios under real-world degradation. Our code and model will be available at https://github.com/nkicsl/TriFlowSR.

[180] Novel View Synthesis using DDIM Inversion

Sehajdeep SIngh, A V Subramanyam

Main category: cs.CV

TL;DR: A lightweight framework for novel view synthesis from a single image using a pretrained diffusion model, avoiding expensive training and improving detail preservation.

DetailsMotivation: Existing methods for novel view synthesis are costly and produce blurry results, prompting the need for an efficient and high-quality solution.

Method: Uses a camera pose-conditioned U-Net (TUNet) to predict target view latents from DDIM inversion, with a novel fusion strategy to enhance detail.

Result: Outperforms existing methods on MVImgNet, achieving better texture and detail preservation.

Conclusion: The proposed framework effectively leverages pretrained diffusion models for high-quality novel view synthesis without extensive retraining.

Abstract: Synthesizing novel views from a single input image is a challenging task. It requires extrapolating the 3D structure of a scene while inferring details in occluded regions, and maintaining geometric consistency across viewpoints. Many existing methods must fine-tune large diffusion backbones using multiple views or train a diffusion model from scratch, which is extremely expensive. Additionally, they suffer from blurry reconstruction and poor generalization. This gap presents the opportunity to explore an explicit lightweight view translation framework that can directly utilize the high-fidelity generative capabilities of a pretrained diffusion model while reconstructing a scene from a novel view. Given the DDIM-inverted latent of a single input image, we employ a camera pose-conditioned translation U-Net, TUNet, to predict the inverted latent corresponding to the desired target view. However, the image sampled using the predicted latent may result in a blurry reconstruction. To this end, we propose a novel fusion strategy that exploits the inherent noise correlation structure observed in DDIM inversion. The proposed fusion strategy helps preserve the texture and fine-grained details. To synthesize the novel view, we use the fused latent as the initial condition for DDIM sampling, leveraging the generative prior of the pretrained diffusion model. Extensive experiments on MVImgNet demonstrate that our method outperforms existing methods.

[181] Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios

Zhanwen Liu, Yujing Sun, Yang Wang, Nan Yang, Shengbo Eben Li, Xiangmo Zhao

Main category: cs.CV

TL;DR: MCFNet integrates event and RGB cameras for improved object detection in poor lighting, achieving superior performance via spatiotemporal alignment and adaptive fusion.

DetailsMotivation: RGB cameras struggle with dynamic range in complex traffic environments, losing details and degrading detection. Event cameras offer high dynamic range but require alignment and fusion with RGB data.

Method: Proposes MCFNet with event correction (ECM), dynamic upsampling (EDUM), and cross-modal fusion (CMM) modules for spatiotemporal alignment and adaptive feature fusion.

Result: Outperforms existing methods on DSEC-Det and PKU-DAVIS-SOD datasets, with 7.4% and 1.7% improvements in mAP50 and mAP metrics, respectively.

Conclusion: MCFNet effectively addresses RGB camera limitations by leveraging event data, enhancing object detection in challenging lighting and motion scenarios.

Abstract: The dynamic range limitation of conventional RGB cameras reduces global contrast and causes loss of high-frequency details such as textures and edges in complex traffic environments (e.g., nighttime driving, tunnels), hindering discriminative feature extraction and degrading frame-based object detection. To address this, we integrate a bio-inspired event camera with an RGB camera to provide high dynamic range information and propose a motion cue fusion network (MCFNet), which achieves optimal spatiotemporal alignment and adaptive cross-modal feature fusion under challenging lighting. Specifically, an event correction module (ECM) temporally aligns asynchronous event streams with image frames via optical-flow-based warping, jointly optimized with the detection network to learn task-aware event representations. The event dynamic upsampling module (EDUM) enhances spatial resolution of event frames to match image structures, ensuring precise spatiotemporal alignment. The cross-modal mamba fusion module (CMM) uses adaptive feature fusion with a novel interlaced scanning mechanism, effectively integrating complementary information for robust detection. Experiments conducted on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that MCFNet significantly outperforms existing methods in various poor lighting and fast moving traffic scenarios. Notably, on the DSEC-Det dataset, MCFNet achieves a remarkable improvement, surpassing the best existing methods by 7.4% in mAP50 and 1.7% in mAP metrics, respectively. The code is available at https://github.com/Charm11492/MCFNet.

[182] CountCluster: Training-Free Object Quantity Guidance with Cross-Attention Map Clustering for Text-to-Image Generation

Joohyeon Lee, Jin-Seop Lee, Jee-Hyong Lee

Main category: cs.CV

TL;DR: The paper introduces CountCluster, a method to improve object count accuracy in diffusion-based text-to-image models by clustering cross-attention maps early in the denoising process.

DetailsMotivation: Existing methods fail to accurately reflect the specified number of objects in generated images, particularly due to early-stage denoising issues.

Method: CountCluster partitions object cross-attention maps into clusters based on input count, optimizes latent alignment with an ideal distribution, and avoids external tools or training.

Result: Achieves an 18.5% improvement in object count accuracy over existing methods.

Conclusion: CountCluster effectively enhances quantity control in text-to-image generation without additional training or tools.

Abstract: Diffusion-based text-to-image generation models have demonstrated strong performance in terms of image quality and diversity. However, they still struggle to generate images that accurately reflect the number of objects specified in the input prompt. Several approaches have been proposed that rely on either external counting modules for iterative refinement or quantity representations derived from learned tokens or latent features. However, they still have limitations in accurately reflecting the specified number of objects and overlook an important structural characteristic–The number of object instances in the generated image is largely determined in the early timesteps of the denoising process. To correctly reflect the object quantity for image generation, the highly activated regions in the object cross-attention map at the early timesteps should match the input object quantity, while each region should be clearly separated. To address this issue, we propose \textit{CountCluster}, a method that guides the object cross-attention map to be clustered according to the specified object count in the input, without relying on any external tools or additional training. The proposed method partitions the object cross-attention map into $k$ clusters at inference time based on attention scores, defines an ideal distribution in which each cluster is spatially well-separated, and optimizes the latent to align with this target distribution. Our method achieves an average improvement of 18.5%p in object count accuracy compared to existing methods, and demonstrates superior quantity control performance across a variety of prompts. Code will be released at: https://github.com/JoohyeonL22/CountCluster .

[183] Performance of GPT-5 in Brain Tumor MRI Reasoning

Mojtaba Safari, Shansong Wang, Mingzhe Hu, Zach Eidex, Qiang Li, Xiaofeng Yang

Main category: cs.CV

TL;DR: GPT-5 models were tested on a brain tumor VQA benchmark, with GPT-5-mini achieving the highest accuracy (44.19%), but none reached clinical acceptability.

DetailsMotivation: To evaluate the performance of LLMs (GPT-4o, GPT-5 variants) in differentiating brain tumor types on MRI using VQA.

Method: Models were tested on a curated VQA benchmark from BraTS datasets (GLI, MEN, MET) with multi-sequence MRI and clinical features, assessed in a zero-shot chain-of-thought setting.

Result: GPT-5-mini performed best (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%). Performance varied by tumor subtype.

Conclusion: GPT-5 models show moderate accuracy in neuro-oncological VQA but are not yet clinically viable.

Abstract: Accurate differentiation of brain tumor types on magnetic resonance imaging (MRI) is critical for guiding treatment planning in neuro-oncology. Recent advances in large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning. In this study, we evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark derived from 3 Brain Tumor Segmentation (BraTS) datasets - glioblastoma (GLI), meningioma (MEN), and brain metastases (MET). Each case included multi-sequence MRI triplanar mosaics and structured clinical features transformed into standardized VQA items. Models were assessed in a zero-shot chain-of-thought setting for accuracy on both visual and reasoning tasks. Results showed that GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%). Performance varied by tumor subtype, with no single model dominating across all cohorts. These findings suggest that GPT-5 family models can achieve moderate accuracy in structured neuro-oncological VQA tasks, but not at a level acceptable for clinical use.

[184] NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou, Jianjian Sun, Kaijun Tan, Kang An, Kangheng Lin, Liang Zhao, Mei Chen, Peng Xing, Rui Wang, Shiyu Liu, Shutao Xia, Tianhao You, Wei Ji, Xianfang Zeng, Xin Han, Xuelin Zhang, Yana Wei, Yanming Xu, Yimin Jiang, Yingming Wang, Yu Zhou, Yucheng Han, Ziyang Meng, Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu

Main category: cs.CV

TL;DR: NextStep-1 is a 14B autoregressive model with a 157M flow matching head, achieving state-of-the-art text-to-image generation and strong image editing capabilities.

DetailsMotivation: Overcome limitations of heavy diffusion models and quantization loss in existing autoregressive text-to-image generation methods.

Method: Uses discrete text tokens and continuous image tokens with next-token prediction, paired with a flow matching head.

Result: State-of-the-art performance in text-to-image generation and strong image editing capabilities.

Conclusion: NextStep-1 demonstrates a powerful and versatile unified approach, with plans to release code and models for open research.

Abstract: Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.

[185] Medico 2025: Visual Question Answering for Gastrointestinal Imaging

Sushant Gautam, Vajira Thambawita, Michael Riegler, Pål Halvorsen, Steven Hicks

Main category: cs.CV

TL;DR: The Medico 2025 challenge focuses on developing explainable AI models for GI imaging VQA, using the Kvasir-VQA-x1 dataset to answer questions and generate clinical explanations.

DetailsMotivation: Advance trustworthy AI in medical image analysis by ensuring models provide interpretable justifications aligned with medical reasoning.

Method: Two subtasks: (1) answering diverse visual questions using the Kvasir-VQA-x1 dataset, and (2) generating multimodal explanations for clinical decision-making.

Result: Benchmarking with 6,500 images and 159,549 QA pairs, evaluated via quantitative metrics and expert-reviewed explainability.

Conclusion: The challenge aims to improve AI reliability in medical imaging by combining performance and explainability.

Abstract: The Medico 2025 challenge addresses Visual Question Answering (VQA) for Gastrointestinal (GI) imaging, organized as part of the MediaEval task series. The challenge focuses on developing Explainable Artificial Intelligence (XAI) models that answer clinically relevant questions based on GI endoscopy images while providing interpretable justifications aligned with medical reasoning. It introduces two subtasks: (1) answering diverse types of visual questions using the Kvasir-VQA-x1 dataset, and (2) generating multimodal explanations to support clinical decision-making. The Kvasir-VQA-x1 dataset, created from 6,500 images and 159,549 complex question-answer (QA) pairs, serves as the benchmark for the challenge. By combining quantitative performance metrics and expert-reviewed explainability assessments, this task aims to advance trustworthy Artificial Intelligence (AI) in medical image analysis. Instructions, data access, and an updated guide for participation are available in the official competition repository: https://github.com/simula/MediaEval-Medico-2025

[186] Lightweight CNNs for Embedded SAR Ship Target Detection and Classification

Fabian Kresse, Georgios Pilikos, Mario Azcueta, Nicolas Floury

Main category: cs.CV

TL;DR: Proposes neural networks for real-time on-board SAR data processing to reduce downlink needs, demonstrating feasibility on FPGA and target classification.

DetailsMotivation: Near-real-time maritime vessel monitoring with SAR data is limited by bandwidth and latency due to raw data downlinking and ground processing. On-board processing could mitigate these issues.

Method: Develops neural networks for real-time inference on unfocused SAR data (Stripmap and IW modes from Sentinel-1) and evaluates deployment on FPGA. Tests binary classification (ships vs. windmills).

Result: Feasibility of on-board processing using proposed models is demonstrated. Target classification between ships and windmills is achievable.

Conclusion: Neural networks enable efficient on-board SAR data processing, reducing downlink volume and latency, with potential for real-time maritime surveillance.

Abstract: Synthetic Aperture Radar (SAR) data enables large-scale surveillance of maritime vessels. However, near-real-time monitoring is currently constrained by the need to downlink all raw data, perform image focusing, and subsequently analyze it on the ground. On-board processing to generate higher-level products could reduce the data volume that needs to be downlinked, alleviating bandwidth constraints and minimizing latency. However, traditional image focusing and processing algorithms face challenges due to the satellite’s limited memory, processing power, and computational resources. This work proposes and evaluates neural networks designed for real-time inference on unfocused SAR data acquired in Stripmap and Interferometric Wide (IW) modes captured with Sentinel-1. Our results demonstrate the feasibility of using one of our models for on-board processing and deployment on an FPGA. Additionally, by investigating a binary classification task between ships and windmills, we demonstrate that target classification is possible.

[187] SegDAC: Segmentation-Driven Actor-Critic for Visual Reinforcement Learning

Alexandre Brown, Glen Berseth

Main category: cs.CV

TL;DR: SegDAC, a Segmentation-Driven Actor-Critic method, leverages SAM and YOLO-World for object-centric decomposition and semantic grounding, achieving superior visual generalization and sample efficiency in RL.

DetailsMotivation: Visual RL struggles with integrating large perception models for effective generalization and sample efficiency.

Method: SegDAC combines SAM for segmentation and YOLO-World for semantic grounding, using a transformer-based architecture to dynamically focus on segments via online RL.

Result: SegDAC doubles performance on the hardest visual generalization benchmark and matches or exceeds prior methods in sample efficiency.

Conclusion: SegDAC effectively integrates perception and RL, advancing visual generalization and efficiency without human labels.

Abstract: Visual reinforcement learning (RL) is challenging due to the need to learn both perception and actions from high-dimensional inputs and noisy rewards. Although large perception models exist, integrating them effectively into RL for visual generalization and improved sample efficiency remains unclear. We propose SegDAC, a Segmentation-Driven Actor-Critic method. SegDAC uses Segment Anything (SAM) for object-centric decomposition and YOLO-World to ground segments semantically via text prompts. It includes a novel transformer-based architecture that supports a dynamic number of segments at each time step and effectively learns which segments to focus on using online RL, without using human labels. By evaluating SegDAC over a challenging visual generalization benchmark using Maniskill3, which covers diverse manipulation tasks under strong visual perturbations, we demonstrate that SegDAC achieves significantly better visual generalization, doubling prior performance on the hardest setting and matching or surpassing prior methods in sample efficiency across all evaluated tasks.

[188] Revisiting Cross-View Localization from Image Matching

Panwang Xia, Qiong Wu, Lei Yu, Yi Liu, Mingtao Xiong, Lei Liang, Yongjun Zhang, Yi Wan

Main category: cs.CV

TL;DR: The paper proposes a novel framework for cross-view localization by improving cross-view image matching, introducing a Surface Model and SimRefiner module, and releasing a benchmark dataset (CVFM).

DetailsMotivation: Existing methods for cross-view localization lack strict correspondences, leading to coarse or inconsistent matches, which limits interpretability.

Method: The framework includes a Surface Model for accurate BEV projection and a SimRefiner module for refining similarity matrices without post-processing.

Result: The approach significantly enhances localization accuracy and image matching quality, setting new benchmarks.

Conclusion: The proposed method advances cross-view localization by addressing matching challenges and providing a new benchmark for future research.

Abstract: Cross-view localization aims to estimate the 3 degrees of freedom pose of a ground-view image by registering it to aerial or satellite imagery. It is essential in GNSS-denied environments such as urban canyons and disaster zones. Existing methods either regress poses directly or align features in a shared bird’s-eye view (BEV) space, both built upon accurate spatial correspondences between perspectives. However, these methods fail to establish strict cross-view correspondences, yielding only coarse or geometrically inconsistent matches. Consequently, fine-grained image matching between ground and aerial views remains an unsolved problem, which in turn constrains the interpretability of localization results. In this paper, we revisit cross-view localization from the perspective of cross-view image matching and propose a novel framework that improves both matching and localization. Specifically, we introduce a Surface Model to model visible regions for accurate BEV projection, and a SimRefiner module to refine the similarity matrix through local-global residual correction, eliminating the reliance on post-processing like RANSAC. To further support research in this area, we introduce CVFM, the first benchmark with 32,509 cross-view image pairs annotated with pixel-level correspondences. Extensive experiments demonstrate that our approach substantially improves both localization accuracy and image matching quality, setting new baselines under extreme viewpoint disparity.

[189] ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

Lingen Li, Guangzhi Wang, Zhaoyang Zhang, Yaowei Li, Xiaoyu Li, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan

Main category: cs.CV

TL;DR: ToonComposer unifies inbetweening and colorization in cartoon production, reducing manual effort and improving quality with sparse inputs.

DetailsMotivation: Traditional cartoon production is labor-intensive, and existing AI methods handle stages separately, causing errors and artifacts.

Method: ToonComposer integrates inbetweening and colorization using sparse sketch injection and a cartoon adaptation method with a spatial low-rank adapter.

Result: ToonComposer outperforms existing methods in visual quality, motion consistency, and efficiency, validated by PKBench.

Conclusion: ToonComposer offers a superior, flexible solution for AI-assisted cartoon production, reducing manual workload.

Abstract: Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production.

[190] Exploiting Discriminative Codebook Prior for Autoregressive Image Generation

Longxiang Tang, Ruihang Chu, Xiang Wang, Yujin Han, Pingyu Wu, Chunming He, Yingya Zhang, Shiwei Zhang, Jiaya Jia

Main category: cs.CV

TL;DR: The paper introduces DCPE, a method to better utilize token similarity in codebooks for autoregressive image generation, outperforming k-means clustering.

DetailsMotivation: Existing methods like k-means clustering fail to effectively use token similarity in codebooks, leading to poor performance in autoregressive models.

Method: Proposes DCPE, which uses instance-based distance and agglomerative merging to address token space disparity and centroid inaccuracy.

Result: DCPE accelerates training by 42% on LlamaGen-B and improves FID and IS performance.

Conclusion: DCPE is a plug-and-play solution that enhances autoregressive model training by better leveraging codebook priors.

Abstract: Advanced discrete token-based autoregressive image generation systems first tokenize images into sequences of token indices with a codebook, and then model these sequences in an autoregressive paradigm. While autoregressive generative models are trained only on index values, the prior encoded in the codebook, which contains rich token similarity information, is not exploited. Recent studies have attempted to incorporate this prior by performing naive k-means clustering on the tokens, helping to facilitate the training of generative models with a reduced codebook. However, we reveal that k-means clustering performs poorly in the codebook feature space due to inherent issues, including token space disparity and centroid distance inaccuracy. In this work, we propose the Discriminative Codebook Prior Extractor (DCPE) as an alternative to k-means clustering for more effectively mining and utilizing the token similarity information embedded in the codebook. DCPE replaces the commonly used centroid-based distance, which is found to be unsuitable and inaccurate for the token feature space, with a more reasonable instance-based distance. Using an agglomerative merging technique, it further addresses the token space disparity issue by avoiding splitting high-density regions and aggregating low-density ones. Extensive experiments demonstrate that DCPE is plug-and-play and integrates seamlessly with existing codebook prior-based paradigms. With the discriminative prior extracted, DCPE accelerates the training of autoregressive models by 42% on LlamaGen-B and improves final FID and IS performance.

[191] Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction

Luyao Tang, Kunze Huang, Chaoqi Chen, Yuxuan Yuan, Chenxin Li, Xiaotong Tu, Xinghao Ding, Yue Huang

Main category: cs.CV

TL;DR: ConGCD introduces a human-inspired method for generalized category discovery by decomposing objects into visual primitives and using consensus units for discriminative patterns.

DetailsMotivation: Bridging the gap between human perceptual systems and machine learning by mimicking human cognitive processes for novel object understanding.

Method: Decomposes objects into visual primitives, uses dominant and contextual consensus units, and dynamically optimizes activation pathways.

Result: Effective performance on coarse- and fine-grained benchmarks, demonstrating its consensus-aware paradigm.

Conclusion: ConGCD offers a novel, human-inspired approach to generalized category discovery, outperforming existing methods.

Abstract: Human perceptual systems excel at inducing and recognizing objects across both known and novel categories, a capability far beyond current machine learning frameworks. While generalized category discovery (GCD) aims to bridge this gap, existing methods predominantly focus on optimizing objective functions. We present an orthogonal solution, inspired by the human cognitive process for novel object understanding: decomposing objects into visual primitives and establishing cross-knowledge comparisons. We propose ConGCD, which establishes primitive-oriented representations through high-level semantic reconstruction, binding intra-class shared attributes via deconstruction. Mirroring human preference diversity in visual processing, where distinct individuals leverage dominant or contextual cues, we implement dominant and contextual consensus units to capture class-discriminative patterns and inherent distributional invariants, respectively. A consensus scheduler dynamically optimizes activation pathways, with final predictions emerging through multiplex consensus integration. Extensive evaluations across coarse- and fine-grained benchmarks demonstrate ConGCD’s effectiveness as a consensus-aware paradigm. Code is available at github.com/lytang63/ConGCD.

[192] Privacy-enhancing Sclera Segmentation Benchmarking Competition: SSBC 2025

Matej Vitek, Darian Tomašević, Abhijit Das, Sabari Nathan, Gökhan Özbulak, Gözde Ayşe Tataroğlu Özbulak, Jean-Paul Calbimonte, André Anjos, Hariohm Hemant Bhatt, Dhruv Dhirendra Premani, Jay Chaudhari, Caiyong Wang, Jian Jiang, Chi Zhang, Qi Zhang, Iyyakutti Iyappan Ganapathi, Syed Sadaf Ali, Divya Velayudan, Maregu Assefa, Naoufel Werghi, Zachary A. Daniels, Leeon John, Ritesh Vyas, Jalil Nourmohammadi Khiarak, Taher Akbari Saeed, Mahsa Nasehi, Ali Kianfar, Mobina Pashazadeh Panahi, Geetanjali Sharma, Pushp Raj Panth, Raghavendra Ramachandra, Aditya Nigam, Umapada Pal, Peter Peer, Vitomir Štruc

Main category: cs.CV

TL;DR: The 2025 SSBC evaluated privacy-preserving sclera-segmentation models using synthetic data, comparing them to real-world data models. Top synthetic-data models achieved competitive performance (F1 > 0.8), and methodological choices often outweighed real-data inclusion in mixed tracks.

DetailsMotivation: To assess the viability of synthetic data for privacy-aware biometric development and compare its performance to real-world data.

Method: Two competition tracks: one using only synthetic data, and another mixing synthetic with limited real-world data. Nine groups submitted diverse models (e.g., transformers, lightweight networks).

Result: Synthetic-data models achieved competitive performance (F1 > 0.8). Mixed-track gains were driven more by methodology than real-data inclusion.

Conclusion: Synthetic data holds promise for privacy-aware biometric development, with performance often matching or exceeding real-data models when paired with effective training strategies.

Abstract: This paper presents a summary of the 2025 Sclera Segmentation Benchmarking Competition (SSBC), which focused on the development of privacy-preserving sclera-segmentation models trained using synthetically generated ocular images. The goal of the competition was to evaluate how well models trained on synthetic data perform in comparison to those trained on real-world datasets. The competition featured two tracks: $(i)$ one relying solely on synthetic data for model development, and $(ii)$ one combining/mixing synthetic with (a limited amount of) real-world data. A total of nine research groups submitted diverse segmentation models, employing a variety of architectural designs, including transformer-based solutions, lightweight models, and segmentation networks guided by generative frameworks. Experiments were conducted across three evaluation datasets containing both synthetic and real-world images, collected under diverse conditions. Results show that models trained entirely on synthetic data can achieve competitive performance, particularly when dedicated training strategies are employed, as evidenced by the top performing models that achieved $F_1$ scores of over $0.8$ in the synthetic data track. Moreover, performance gains in the mixed track were often driven more by methodological choices rather than by the inclusion of real data, highlighting the promise of synthetic data for privacy-aware biometric development. The code and data for the competition is available at: https://github.com/dariant/SSBC_2025.

[193] Axis-level Symmetry Detection with Group-Equivariant Representation

Wongyun Yu, Ahyun Seo, Minsu Cho

Main category: cs.CV

TL;DR: A novel framework for detecting reflection and rotation symmetry in complex scenes using geometric primitives and a dual-branch architecture, achieving state-of-the-art performance.

DetailsMotivation: Detecting symmetry in complex scenes is challenging, and existing heatmap-based methods lack precision in identifying individual symmetry axes.

Method: Uses a dual-branch architecture equivariant to the dihedral group, with orientational anchors for reflection symmetry and rotational matching for rotational symmetry.

Result: Outperforms existing approaches, achieving state-of-the-art performance in symmetry detection.

Conclusion: The proposed framework effectively detects symmetry axes with high precision, advancing the field of symmetry detection in computer vision.

Abstract: Symmetry is a fundamental concept that has been extensively studied, yet detecting it in complex scenes remains a significant challenge in computer vision. Recent heatmap-based approaches can localize potential regions of symmetry axes but often lack precision in identifying individual axes. In this work, we propose a novel framework for axis-level detection of the two most common symmetry types-reflection and rotation-by representing them as explicit geometric primitives, i.e. lines and points. Our method employs a dual-branch architecture that is equivariant to the dihedral group, with each branch specialized to exploit the structure of dihedral group-equivariant features for its respective symmetry type. For reflection symmetry, we introduce orientational anchors, aligned with group components, to enable orientation-specific detection, and a reflectional matching that measures similarity between patterns and their mirrored counterparts across candidate axes. For rotational symmetry, we propose a rotational matching that compares patterns at fixed angular intervals to identify rotational centers. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches.

[194] Forgery Guided Learning Strategy with Dual Perception Network for Deepfake Cross-domain Detection

Lixin Jia, Zhiqing Guo, Gaobo Yang, Liejun Wang, Keqin Li

Main category: cs.CV

TL;DR: The paper proposes a Forgery Guided Learning (FGL) strategy and Dual Perception Network (DPNet) to improve deepfake detection by adapting to unknown forgery techniques and capturing forgery traces.

DetailsMotivation: Address the poor performance of current deepfake detection methods on datasets with unknown forgery techniques and the widening gap between emerging and traditional methods.

Method: FGL captures differential information between known and unknown techniques, while DPNet integrates frequency and spatial features and uses graph convolution for trace correlation.

Result: The approach generalizes well across scenarios and effectively handles unknown forgery challenges.

Conclusion: The proposed method provides robust support for deepfake detection, adapting to fast iterative forgery techniques.

Abstract: The emergence of deepfake technology has introduced a range of societal problems, garnering considerable attention. Current deepfake detection methods perform well on specific datasets, but exhibit poor performance when applied to datasets with unknown forgery techniques. Moreover, as the gap between emerging and traditional forgery techniques continues to widen, cross-domain detection methods that rely on common forgery traces are becoming increasingly ineffective. This situation highlights the urgency of developing deepfake detection technology with strong generalization to cope with fast iterative forgery techniques. To address these challenges, we propose a Forgery Guided Learning (FGL) strategy designed to enable detection networks to continuously adapt to unknown forgery techniques. Specifically, the FGL strategy captures the differential information between known and unknown forgery techniques, allowing the model to dynamically adjust its learning process in real time. To further improve the ability to perceive forgery traces, we design a Dual Perception Network (DPNet) that captures both differences and relationships among forgery traces. In the frequency stream, the network dynamically perceives and extracts discriminative features across various forgery techniques, establishing essential detection cues. These features are then integrated with spatial features and projected into the embedding space. In addition, graph convolution is employed to perceive relationships across the entire feature space, facilitating a more comprehensive understanding of forgery trace correlations. Extensive experiments show that our approach generalizes well across different scenarios and effectively handles unknown forgery challenges, providing robust support for deepfake detection. Our code is available on https://github.com/vpsg-research/FGL.

[195] An Efficient Model-Driven Groupwise Approach for Atlas Construction

Ziwei Zou, Bei Zou, Xiaoyan Kui, Wenqi Lu, Haoran Dou, Arezoo Zakeri, Timothy Cootes, Alejandro F Frangi, Jinming Duan

Main category: cs.CV

TL;DR: DARC is a model-driven groupwise registration framework for atlas construction, offering flexibility, efficiency, and high anatomical fidelity without requiring large training datasets.

DetailsMotivation: Current data-driven registration methods for atlas construction rely on large datasets and lack generalizability, while model-driven methods face scalability issues. DARC addresses these limitations.

Method: DARC uses a coordinate descent strategy and a centrality-enforcing activation function to efficiently handle 3D images, supporting various dissimilarity metrics without GPU memory constraints.

Result: DARC produces unbiased, diffeomorphic atlases and excels in applications like one-shot segmentation and shape synthesis, outperforming few-shot methods.

Conclusion: DARC provides a scalable, training-free, and versatile solution for atlas construction and related applications.

Abstract: Atlas construction is fundamental to medical image analysis, offering a standardized spatial reference for tasks such as population-level anatomical modeling. While data-driven registration methods have recently shown promise in pairwise settings, their reliance on large training datasets, limited generalizability, and lack of true inference phases in groupwise contexts hinder their practical use. In contrast, model-driven methods offer training-free, theoretically grounded, and data-efficient alternatives, though they often face scalability and optimization challenges when applied to large 3D datasets. In this work, we introduce DARC (Diffeomorphic Atlas Registration via Coordinate descent), a novel model-driven groupwise registration framework for atlas construction. DARC supports a broad range of image dissimilarity metrics and efficiently handles arbitrary numbers of 3D images without incurring GPU memory issues. Through a coordinate descent strategy and a centrality-enforcing activation function, DARC produces unbiased, diffeomorphic atlases with high anatomical fidelity. Beyond atlas construction, we demonstrate two key applications: (1) One-shot segmentation, where labels annotated only on the atlas are propagated to subjects via inverse deformations, outperforming state-of-the-art few-shot methods; and (2) shape synthesis, where new anatomical variants are generated by warping the atlas mesh using synthesized diffeomorphic deformation fields. Overall, DARC offers a flexible, generalizable, and resource-efficient framework for atlas construction and applications.

[196] From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models

Tiancheng Han, Yunfei Gao, Yong Li, Wuzhou Yu, Qiaosheng Zhang, Wenqi Shao

Main category: cs.CV

TL;DR: The paper evaluates VLMs’ spatio-physical reasoning, finds them lacking due to biases and shallow reasoning, improves Qwen2.5-VL-7B via fine-tuning and reinforcement learning, but notes limited generalization.

DetailsMotivation: To assess and enhance VLMs' spatio-physical reasoning, a critical but underexplored capability for robust world models.

Method: Diagnostic analysis of VLMs, supervised fine-tuning, and rule-based reinforcement learning applied to Qwen2.5-VL-7B.

Result: Improved spatio-physical reasoning in Qwen2.5-VL-7B, outperforming proprietary models, but limited generalization.

Conclusion: Despite improvements, new approaches are needed for better generalization in spatio-physical reasoning.

Abstract: Spatio-physical reasoning, a foundation capability for understanding the real physics world, is a critical step towards building robust world models. While recent vision language models (VLMs) have shown remarkable progress in specialized domains like multimodal mathematics and pure spatial understanding, their capability for spatio-physical reasoning remains largely unexplored. This paper provides a comprehensive diagnostic analysis of mainstream VLMs, revealing that current models perform inadequately on this crucial task. Further detailed analysis shows that this underperformance is largely attributable to biases caused by human-like prior and a lack of deep reasoning. To address these challenges, we apply supervised fine-tuning followed by rule-based reinforcement learning to Qwen2.5-VL-7B, resulting in significant improvements in spatio-physical reasoning capabilities and surpassing leading proprietary models. Nevertheless, despite this success, the model’s generalization to new physics scenarios remains limited – underscoring the pressing need for new approaches in spatio-physical reasoning.

[197] Cooperative Face Liveness Detection from Optical Flow

Artem Sokolov, Mikhail Nikitin, Anton Konushin

Main category: cs.CV

TL;DR: A novel cooperative video-based face liveness detection method using controlled face movement and optical flow analysis to distinguish real faces from attacks.

DetailsMotivation: To improve discrimination between genuine faces and presentation attacks (e.g., photos, screens, masks, replays) by leveraging user interaction and optical flow.

Method: Participants move their face closer to the camera; optical flow and RGB frames are processed by a neural classifier for spatial-temporal feature extraction.

Result: Robust extraction of facial volume information, significantly improving liveness detection accuracy over passive methods.

Conclusion: The method enhances reliability in liveness detection by combining controlled user interaction and neural optical flow analysis.

Abstract: In this work, we proposed a novel cooperative video-based face liveness detection method based on a new user interaction scenario where participants are instructed to slowly move their frontal-oriented face closer to the camera. This controlled approaching face protocol, combined with optical flow analysis, represents the core innovation of our approach. By designing a system where users follow this specific movement pattern, we enable robust extraction of facial volume information through neural optical flow estimation, significantly improving discrimination between genuine faces and various presentation attacks (including printed photos, screen displays, masks, and video replays). Our method processes both the predicted optical flows and RGB frames through a neural classifier, effectively leveraging spatial-temporal features for more reliable liveness detection compared to passive methods.

[198] VasoMIM: Vascular Anatomy-Aware Masked Image Modeling for Vessel Segmentation

De-Xing Huang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Tian-Yu Xiang, Rui-Ze Ma, Nu-Fang Xiao, Zeng-Guang Hou

Main category: cs.CV

TL;DR: VasoMIM, a novel self-supervised learning framework, improves vessel segmentation in X-ray angiograms by integrating vascular anatomy knowledge into masked image modeling.

DetailsMotivation: The scarcity of annotated data and conventional MIM's failure to capture vascular anatomy due to class imbalance drive the need for a tailored solution.

Method: VasoMIM includes an anatomy-guided masking strategy and anatomical consistency loss to focus on vessel regions and improve representation discriminability.

Result: VasoMIM achieves state-of-the-art performance on three datasets.

Conclusion: VasoMIM shows promise for enhancing X-ray angiogram analysis by addressing the limitations of conventional MIM.

Abstract: Accurate vessel segmentation in X-ray angiograms is crucial for numerous clinical applications. However, the scarcity of annotated data presents a significant challenge, which has driven the adoption of self-supervised learning (SSL) methods such as masked image modeling (MIM) to leverage large-scale unlabeled data for learning transferable representations. Unfortunately, conventional MIM often fails to capture vascular anatomy because of the severe class imbalance between vessel and background pixels, leading to weak vascular representations. To address this, we introduce Vascular anatomy-aware Masked Image Modeling (VasoMIM), a novel MIM framework tailored for X-ray angiograms that explicitly integrates anatomical knowledge into the pre-training process. Specifically, it comprises two complementary components: anatomy-guided masking strategy and anatomical consistency loss. The former preferentially masks vessel-containing patches to focus the model on reconstructing vessel-relevant regions. The latter enforces consistency in vascular semantics between the original and reconstructed images, thereby improving the discriminability of vascular representations. Empirically, VasoMIM achieves state-of-the-art performance across three datasets. These findings highlight its potential to facilitate X-ray angiogram analysis.

[199] Object Fidelity Diffusion for Remote Sensing Image Generation

Ziqi Ye, Shuran Ma, Jie Yang, Xiaoyi Yang, Ziyang Gong, Xue Yang, Haipeng Wang

Main category: cs.CV

TL;DR: OF-Diff improves remote sensing image generation by using prior object shapes and a dual-branch diffusion model, enhancing fidelity and diversity without real images during sampling.

DetailsMotivation: Existing diffusion models struggle with low-fidelity images in remote sensing, impacting object detection reliability.

Method: Proposes OF-Diff with prior shape extraction, dual-branch diffusion, and DDPO fine-tuning for diversity.

Result: Outperforms state-of-the-art methods, with mAP increases of 8.3%, 7.7%, and 4.0% for airplanes, ships, and vehicles.

Conclusion: OF-Diff significantly enhances remote sensing image generation fidelity and object detection performance.

Abstract: High-precision controllable remote sensing image generation is both meaningful and challenging. Existing diffusion models often produce low-fidelity images due to their inability to adequately capture morphological details, which may affect the robustness and reliability of object detection models. To enhance the accuracy and fidelity of generated objects in remote sensing, this paper proposes Object Fidelity Diffusion (OF-Diff), which effectively improves the fidelity of generated objects. Specifically, we are the first to extract the prior shapes of objects based on the layout for diffusion models in remote sensing. Then, we introduce a dual-branch diffusion model with diffusion consistency loss, which can generate high-fidelity remote sensing images without providing real images during the sampling phase. Furthermore, we introduce DDPO to fine-tune the diffusion process, making the generated remote sensing images more diverse and semantically consistent. Comprehensive experiments demonstrate that OF-Diff outperforms state-of-the-art methods in the remote sensing across key quality metrics. Notably, the performance of several polymorphic and small object classes shows significant improvement. For instance, the mAP increases by 8.3%, 7.7%, and 4.0% for airplanes, ships, and vehicles, respectively.

[200] Mobile-Friendly Deep Learning for Plant Disease Detection: A Lightweight CNN Benchmark Across 101 Classes of 33 Crops

Anand Kumar, Harminder Pal Monga, Tapasi Brahma, Satyam Kalra, Navas Sherif

Main category: cs.CV

TL;DR: A mobile-friendly solution for early detection of 101 plant diseases across 33 crops was developed using lightweight architectures, achieving 94.7% accuracy with EfficientNet-B1.

DetailsMotivation: Plant diseases threaten global food security, necessitating accurate early detection systems.

Method: Combined datasets (PlantDoc, PlantVillage, PlantWild) and evaluated lightweight architectures (MobileNetV2, MobileNetV3, EfficientNet-B0/B1) for mobile deployment.

Result: EfficientNet-B1 achieved the best performance with 94.7% classification accuracy.

Conclusion: EfficientNet-B1 is optimal for real-world mobile deployment due to its balance of accuracy and computational efficiency.

Abstract: Plant diseases are a major threat to food security globally. It is important to develop early detection systems which can accurately detect. The advancement in computer vision techniques has the potential to solve this challenge. We have developed a mobile-friendly solution which can accurately classify 101 plant diseases across 33 crops. We built a comprehensive dataset by combining different datasets, Plant Doc, PlantVillage, and PlantWild, all of which are for the same purpose. We evaluated performance across several lightweight architectures - MobileNetV2, MobileNetV3, MobileNetV3-Large, and EfficientNet-B0, B1 - specifically chosen for their efficiency on resource-constrained devices. The results were promising, with EfficientNet-B1 delivering our best performance at 94.7% classification accuracy. This architecture struck an optimal balance between accuracy and computational efficiency, making it well-suited for real-world deployment on mobile devices.

[201] UI-Venus Technical Report: Building High-performance UI Agents with RFT

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo, Liang Chen, Weiqiang Wang

Main category: cs.CV

TL;DR: UI-Venus is a state-of-the-art UI agent using screenshots and a multimodal LLM, achieving top performance in UI grounding and navigation with minimal training data and innovative techniques like RFT and self-evolving frameworks.

DetailsMotivation: To advance UI interaction by creating an agent that relies solely on screenshots, outperforming existing models in grounding and navigation tasks.

Method: Uses reinforcement finetuning (RFT) based on Qwen2.5-VL, with reward functions, data cleaning, and self-evolving trajectory alignment for navigation.

Result: Achieves 94.1%/50.8% (7B) and 95.3%/61.9% (72B) on Screenspot-V2/Pro benchmarks, and 49.1%/65.9% success on AndroidWorld, surpassing prior models.

Conclusion: UI-Venus sets new benchmarks, introduces open-source tools, and encourages further research with its innovative framework and data protocols.

Abstract: We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus’s summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models.To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies.To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment & Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/antgroup/UI-Venus.

[202] Self-Supervised Stereo Matching with Multi-Baseline Contrastive Learning

Peng Xu, Zhiyu Xiang, Jingyun Fu, Tianyu Pu, Kai Wang, Chaojie Ji, Tingming Bai, Eryun Liu

Main category: cs.CV

TL;DR: BaCon-Stereo introduces a contrastive learning framework for self-supervised stereo matching, addressing occluded regions via a teacher-student paradigm and multi-baseline inputs.

DetailsMotivation: The photometric consistency assumption fails in occluded regions, leading to ill-posed correspondences. BaCon-Stereo aims to improve stereo matching in such regions.

Method: Uses a teacher-student framework with multi-baseline inputs, where the teacher’s predictions in occluded regions supervise the student. Includes an occlusion-aware attention map and a synthetic dataset (BaCon-20k).

Result: Improves predictions in occluded and non-occluded regions, shows strong generalization, and outperforms state-of-the-art methods on KITTI benchmarks.

Conclusion: BaCon-Stereo effectively addresses occlusion challenges in self-supervised stereo matching, demonstrating superior performance and robustness.

Abstract: Current self-supervised stereo matching relies on the photometric consistency assumption, which breaks down in occluded regions due to ill-posed correspondences. To address this issue, we propose BaCon-Stereo, a simple yet effective contrastive learning framework for self-supervised stereo network training in both non-occluded and occluded regions. We adopt a teacher-student paradigm with multi-baseline inputs, in which the stereo pairs fed into the teacher and student share the same reference view but differ in target views. Geometrically, regions occluded in the student’s target view are often visible in the teacher’s, making it easier for the teacher to predict in these regions. The teacher’s prediction is rescaled to match the student’s baseline and then used to supervise the student. We also introduce an occlusion-aware attention map to better guide the student in learning occlusion completion. To support training, we synthesize a multi-baseline dataset BaCon-20k. Extensive experiments demonstrate that BaCon-Stereo improves prediction in both occluded and non-occluded regions, achieves strong generalization and robustness, and outperforms state-of-the-art self-supervised methods on both KITTI 2015 and 2012 benchmarks. Our code and dataset will be released upon paper acceptance.

[203] Generalizable Federated Learning using Client Adaptive Focal Modulation

Tajamul Ashraf, Iqra Altaf Gillani

Main category: cs.CV

TL;DR: AdaptFED enhances TransFed by refining focal modulation in federated learning with task-aware embeddings, better theoretical bounds, and broader validation, while reducing communication overhead.

DetailsMotivation: To improve generalizability and scalability of transformer-based federated learning, addressing non-IID and cross-domain challenges.

Method: Introduces AdaptFED with refined adaptation strategies, theoretical performance bounds, and low-rank hypernetwork conditioning for efficiency.

Result: Outperforms state-of-the-art baselines in diverse datasets, especially in source-free and cross-task federated setups.

Conclusion: AdaptFED advances transformer-based FL, offering adaptive, scalable, and generalizable solutions.

Abstract: Federated learning (FL) has proven essential for privacy-preserving, collaborative training across distributed clients. Our prior work, TransFed, introduced a robust transformer-based FL framework that leverages a learn-to-adapt hypernetwork to generate personalized focal modulation layers per client, outperforming traditional methods in non-IID and cross-domain settings. In this extended version, we propose AdaptFED, where we deepen the investigation of focal modulation in generalizable FL by incorporating: (1) a refined adaptation strategy that integrates task-aware client embeddings to personalize modulation dynamics further, (2) enhanced theoretical bounds on adaptation performance, and (3) broader empirical validation across additional modalities, including time-series and multilingual data. We also introduce an efficient variant of TransFed that reduces server-client communication overhead via low-rank hypernetwork conditioning, enabling scalable deployment in resource-constrained environments. Extensive experiments on eight diverse datasets reaffirm the superiority of our method over state-of-the-art baselines, particularly in source-free and cross-task federated setups. Our findings not only extend the capabilities of focal modulation in FL but also pave the way for more adaptive, scalable, and generalizable transformer-based federated systems. The code is available at http://github.com/Tajamul21/TransFed

[204] Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation

Harold Haodong Chen, Haojian Huang, Qifeng Chen, Harry Yang, Ser-Nam Lim

Main category: cs.CV

TL;DR: PhysHPO introduces a hierarchical framework for fine-grained preference alignment in video generation, improving physical plausibility and realism by optimizing alignment at instance, state, motion, and semantic levels. It also automates data selection from existing datasets.

DetailsMotivation: Generating physically plausible videos is challenging but crucial for realism. Existing methods lack fine-grained alignment and rely on costly dataset construction.

Method: PhysHPO uses Hierarchical Cross-Modal Direct Preference Optimization to align videos at four levels (instance, state, motion, semantic) and automates data selection from existing datasets.

Result: PhysHPO significantly enhances physical plausibility and video quality, outperforming benchmarks without requiring new dataset construction.

Conclusion: PhysHPO pioneers fine-grained preference alignment and efficient data use, advancing realistic video generation.

Abstract: Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos are the best reflections of physical phenomena, we further introduce an automated data selection pipeline to efficiently identify and utilize “good data” from existing large-scale text-video datasets, thereby eliminating the need for costly and time-intensive dataset construction. Extensive experiments on both physics-focused and general capability benchmarks demonstrate that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models. To the best of our knowledge, this is the first work to explore fine-grained preference alignment and data selection for video generation, paving the way for more realistic and human-preferred video generation paradigms.

[205] TexVerse: A Universe of 3D Objects with High-Resolution Textures

Yibo Zhang, Li Zhang, Rui Ma, Nan Cao

Main category: cs.CV

TL;DR: TexVerse is a large-scale 3D dataset with high-resolution textures, addressing the gap in existing datasets for texture synthesis and PBR material development.

DetailsMotivation: Existing large-scale 3D datasets lack high-resolution textures, limiting research in texture synthesis and PBR material development. TexVerse aims to fill this gap.

Method: TexVerse curates over 858K unique high-resolution 3D models from Sketchfab, including subsets for rigged and animated models, with detailed annotations.

Result: The dataset includes 1.6M 3D instances, with specialized subsets (TexVerse-Skeleton and TexVerse-Animation) and detailed annotations.

Conclusion: TexVerse provides a high-quality resource for texture synthesis, PBR materials, animation, and 3D vision/graphics tasks.

Abstract: We introduce TexVerse, a large-scale 3D dataset featuring high-resolution textures. While recent advances in large-scale 3D datasets have enhanced high-resolution geometry generation, creating high-resolution textures end-to-end remains underexplored due to the lack of suitable datasets. TexVerse fills this gap with a curated collection of over 858K unique high-resolution 3D models sourced from Sketchfab, including more than 158K models with physically based rendering (PBR) materials. Each model encompasses all of its high-resolution variants, bringing the total to 1.6M 3D instances. TexVerse also includes specialized subsets: TexVerse-Skeleton, with 69K rigged models, and TexVerse-Animation, with 54K animated models, both preserving original skeleton and animation data uploaded by the user. We also provide detailed model annotations describing overall characteristics, structural components, and intricate features. TexVerse offers a high-quality data resource with wide-ranging potential applications in texture synthesis, PBR material development, animation, and various 3D vision and graphics tasks.

[206] STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, Xingang Pan

Main category: cs.CV

TL;DR: STream3R is a Transformer-based 3D reconstruction method using causal attention for efficient streaming of image sequences, outperforming existing methods in static and dynamic scenes.

DetailsMotivation: Existing 3D reconstruction methods are either computationally expensive or scale poorly with sequence length, limiting their effectiveness in dynamic scenes.

Method: STream3R reformulates pointmap prediction as a decoder-only Transformer problem, leveraging causal attention and geometric priors from large-scale 3D datasets.

Result: The method outperforms prior work in static and dynamic scene benchmarks and is compatible with LLM-style training infrastructure.

Conclusion: STream3R demonstrates the potential of causal Transformer models for real-time 3D perception in streaming environments.

Abstract: We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: https://nirvanalan.github.io/projects/stream3r.

[207] MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data

Antoine Labatie, Michael Vaccaro, Nina Lardiere, Anatol Garioud, Nicolas Gonthier

Main category: cs.CV

TL;DR: MAESTRO, a novel adaptation of the Masked Autoencoder, optimizes fusion strategies and target normalization for Earth observation data, achieving state-of-the-art results.

DetailsMotivation: Standard self-supervised methods need adaptation for Earth observation data's unique characteristics.

Method: Comprehensive benchmark of fusion strategies and reconstruction target normalization schemes, leading to MAESTRO’s design.

Result: MAESTRO sets a new state-of-the-art on multitemporal tasks and remains competitive on mono-temporal tasks.

Conclusion: MAESTRO effectively adapts self-supervised learning for Earth observation data, with reproducible code provided.

Abstract: Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and reconstruction target normalization schemes for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we propose MAESTRO, a novel adaptation of the Masked Autoencoder, featuring optimized fusion strategies and a tailored target normalization scheme that introduces a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets, MAESTRO sets a new state-of-the-art on tasks that strongly rely on multitemporal dynamics, while remaining highly competitive on tasks dominated by a single mono-temporal modality. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.

[208] ESSENTIAL: Episodic and Semantic Memory Integration for Video Class-Incremental Learning

Jongseo Lee, Kyungho Bae, Kyle Min, Gyeong-Moon Park, Jinwoo Choi

Main category: cs.CV

TL;DR: ESSENTIAL integrates episodic and semantic memory for efficient video class-incremental learning, balancing memory use and performance.

DetailsMotivation: Address the trade-off between memory-efficiency and performance in VCIL by combining sparse episodic memory with semantic prompts.

Method: Proposes ESSENTIAL, using episodic memory for sparse features and semantic memory for general knowledge, with a memory retrieval module for dense feature reconstruction.

Result: Achieves strong performance on multiple benchmarks (UCF-101, HMDB51, etc.) with reduced memory usage.

Conclusion: ESSENTIAL effectively balances memory efficiency and performance in VCIL.

Abstract: In this work, we tackle the problem of video classincremental learning (VCIL). Many existing VCIL methods mitigate catastrophic forgetting by rehearsal training with a few temporally dense samples stored in episodic memory, which is memory-inefficient. Alternatively, some methods store temporally sparse samples, sacrificing essential temporal information and thereby resulting in inferior performance. To address this trade-off between memory-efficiency and performance, we propose EpiSodic and SEmaNTIc memory integrAtion for video class-incremental Learning (ESSENTIAL). ESSENTIAL consists of episodic memory for storing temporally sparse features and semantic memory for storing general knowledge represented by learnable prompts. We introduce a novel memory retrieval (MR) module that integrates episodic memory and semantic prompts through cross-attention, enabling the retrieval of temporally dense features from temporally sparse features. We rigorously validate ESSENTIAL on diverse datasets: UCF-101, HMDB51, and Something-Something-V2 from the TCD benchmark and UCF-101, ActivityNet, and Kinetics-400 from the vCLIMB benchmark. Remarkably, with significantly reduced memory, ESSENTIAL achieves favorable performance on the benchmarks.

[209] Human-in-Context: Unified Cross-Domain 3D Human Motion Modeling via In-Context Learning

Mengyuan Liu, Xinshun Wang, Zhongbin Fang, Deheng Ye, Xia Li, Tao Tang, Songtao Wu, Xiangtai Li, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: The paper proposes a unified cross-domain model for 3D human motion, addressing limitations of existing methods by introducing Pose-in-Context (PiC) and its extension, Human-in-Context (HiC). HiC improves generalization, scalability, and performance across diverse domains.

DetailsMotivation: Existing cross-domain models for 3D human motion rely on domain-specific components and multi-stage training, limiting practicality and scalability. The goal is to create a unified model that handles multiple modalities, tasks, and datasets efficiently.

Method: The authors introduce Pose-in-Context (PiC) for pose-based tasks, then extend it to Human-in-Context (HiC), which integrates pose and mesh representations, expands task coverage, and uses a max-min similarity prompt sampling strategy. HiC also features a dual-branch context injection architecture.

Result: HiC outperforms PiC in generalization, data scale, and performance across diverse domains, demonstrating improved flexibility and scalability.

Conclusion: HiC shows promise as a unified cross-domain 3D human motion model, offering enhanced generalization and scalability. The source code and models are publicly available.

Abstract: This paper aims to model 3D human motion across domains, where a single model is expected to handle multiple modalities, tasks, and datasets. Existing cross-domain models often rely on domain-specific components and multi-stage training, which limits their practicality and scalability. To overcome these challenges, we propose a new setting to train a unified cross-domain model through a single process, eliminating the need for domain-specific components and multi-stage training. We first introduce Pose-in-Context (PiC), which leverages in-context learning to create a pose-centric cross-domain model. While PiC generalizes across multiple pose-based tasks and datasets, it encounters difficulties with modality diversity, prompting strategy, and contextual dependency handling. We thus propose Human-in-Context (HiC), an extension of PiC that broadens generalization across modalities, tasks, and datasets. HiC combines pose and mesh representations within a unified framework, expands task coverage, and incorporates larger-scale datasets. Additionally, HiC introduces a max-min similarity prompt sampling strategy to enhance generalization across diverse domains and a network architecture with dual-branch context injection for improved handling of contextual dependencies. Extensive experimental results show that HiC performs better than PiC in terms of generalization, data scale, and performance across a wide range of domains. These results demonstrate the potential of HiC for building a unified cross-domain 3D human motion model with improved flexibility and scalability. The source codes and models are available at https://github.com/BradleyWang0416/Human-in-Context.

[210] Puppeteer: Rig and Animate Your 3D Models

Chaoyue Song, Xiu Li, Fan Yang, Zhongcong Xu, Jiacheng Wei, Fayao Liu, Jiashi Feng, Guosheng Lin, Jianfeng Zhang

Main category: cs.CV

TL;DR: Puppeteer is a framework for automatic rigging and animation of 3D objects, outperforming state-of-the-art methods in skeletal prediction and skinning quality.

DetailsMotivation: To address the bottleneck of transforming static 3D models into animated assets, reducing reliance on expert intervention.

Method: Uses an auto-regressive transformer for skeletal prediction, attention-based skinning weight inference, and a differentiable optimization-based animation pipeline.

Result: Outperforms existing methods in accuracy and quality, handling diverse 3D content with stable, high-fidelity animations.

Conclusion: Puppeteer provides an efficient, high-quality solution for automating 3D rigging and animation.

Abstract: Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present Puppeteer, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.

[211] CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal

Main category: cs.CV

TL;DR: CAPTURe is a novel task evaluating vision-language models’ ability to count objects in occluded patterns, revealing their struggles with occlusion and counting.

DetailsMotivation: Occlusions hinder spatial comprehension, yet models lack robust reasoning for occluded objects. CAPTURe tests this ability.

Method: CAPTURe includes real and synthetic datasets, evaluating VLMs like GPT-4o on counting occluded and unoccluded patterns.

Result: VLMs perform poorly on occluded patterns, with GPT-4o failing notably. Humans excel, and auxiliary info improves model performance.

Conclusion: VLMs lack spatial reasoning for occlusions, highlighting a gap in their world modeling and counting capabilities.

Abstract: Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models’ ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). CAPTURe requires both recognizing visual patterns and reasoning, making it a useful testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs’ ability to form world models that would allow them to fill in missing information. CAPTURe consists of two parts: (1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs (GPT-4o, Intern-VL2, Molmo, and Qwen2-VL) on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in inferring unseen spatial relationships: even the strongest VLMs like GPT-4o fail to count with occlusion. In contrast, we find that humans achieve very little error on CAPTURe. We also find that providing auxiliary information of occluded object locations increases performance, underscoring that the model error comes both from an inability to handle occlusion as well as difficulty in counting in images. Code and data: https://github.com/atinpothiraj/CAPTURe

[212] Quantum Visual Fields with Neural Amplitude Encoding

Shuteng Wang, Christian Theobalt, Vladislav Golyanik

Main category: cs.CV

TL;DR: QVF introduces a novel Quantum Implicit Neural Representation for 2D/3D data, outperforming existing quantum and classical methods in accuracy and learning efficiency.

DetailsMotivation: Address challenges in QINRs like architecture design, training efficiency, and quantum-classical interplay by proposing a more effective quantum-based learning approach.

Method: Uses neural amplitude encoding and fully entangled parametrized quantum circuits in real Hilbert space for stable training and fast convergence.

Result: QVF surpasses existing quantum and classical baselines in visual representation accuracy and excels in high-frequency detail learning.

Conclusion: QVF demonstrates practical potential in 2D/3D field completion and shape interpolation, advancing quantum-based learning paradigms.

Abstract: Quantum Implicit Neural Representations (QINRs) include components for learning and execution on gate-based quantum computers. While QINRs recently emerged as a promising new paradigm, many challenges concerning their architecture and ansatz design, the utility of quantum-mechanical properties, training efficiency and the interplay with classical modules remain. This paper advances the field by introducing a new type of QINR for 2D image and 3D geometric field learning, which we collectively refer to as Quantum Visual Field (QVF). QVF encodes classical data into quantum statevectors using neural amplitude encoding grounded in a learnable energy manifold, ensuring meaningful Hilbert space embeddings. Our ansatz follows a fully entangled design of learnable parametrised quantum circuits, with quantum (unitary) operations performed in the real Hilbert space, resulting in numerically stable training with fast convergence. QVF does not rely on classical post-processing – in contrast to the previous QINR learning approach – and directly employs projective measurement to extract learned signals encoded in the ansatz. Experiments on a quantum hardware simulator demonstrate that QVF outperforms the existing quantum approach and widely used classical foundational baselines in terms of visual representation accuracy across various metrics and model characteristics, such as learning of high-frequency details. We also show applications of QVF in 2D and 3D field completion and 3D shape interpolation, highlighting its practical potential.

[213] Understanding Transformer-based Vision Models through Inversion

Jan Rathjens, Shirin Reyhanian, David Kappel, Laurenz Wiskott

Main category: cs.CV

TL;DR: The paper introduces a modular feature inversion method for analyzing transformer-based vision models, revealing insights into their internal representations and robustness.

DetailsMotivation: To better understand the mechanisms of deep neural networks, particularly transformer-based vision models like Detection Transformer and Vision Transformer, through feature inversion.

Method: A novel, modular feature inversion technique is proposed and applied to reconstruct images from intermediate representations in transformer models.

Result: The method provides qualitative and quantitative insights into how these models encode image features, contextual shapes, and handle color perturbations.

Conclusion: The findings enhance understanding of transformer-based vision models, with code available for reproducibility.

Abstract: Understanding the mechanisms underlying deep neural networks remains a fundamental challenge in machine learning and computer vision. One promising, yet only preliminarily explored approach, is feature inversion, which attempts to reconstruct images from intermediate representations using trained inverse neural networks. In this study, we revisit feature inversion, introducing a novel, modular variation that enables significantly more efficient application of the technique. We demonstrate how our method can be systematically applied to the large-scale transformer-based vision models, Detection Transformer and Vision Transformer, and how reconstructed images can be qualitatively interpreted in a meaningful way. We further quantitatively evaluate our method, thereby uncovering underlying mechanisms of representing image features that emerge in the two transformer architectures. Our analysis reveals key insights into how these models encode contextual shape and image details, how their layers correlate, and their robustness against color perturbations. These findings contribute to a deeper understanding of transformer-based vision models and their internal representations. The code for reproducing our experiments is available at github.com/wiskott-lab/inverse-tvm.

[214] A Lightweight Transformer with Phase-Only Cross-Attention for Illumination-Invariant Biometric Authentication

Arun K. Sharma, Shubhobrata Bhattacharya, Motahar Reza, Bishakh Bhattacharya

Main category: cs.CV

TL;DR: A novel lightweight vision transformer (POC-ViT) using forehead and periocular biometric traits is proposed, achieving high accuracy (98.8%) even with face masks and no physical contact.

DetailsMotivation: Traditional biometric systems face challenges like face masks and hygiene concerns, prompting the need for a non-contact, robust alternative.

Method: POC-ViT uses phase-only cross-attention to capture structural patterns in dual biometric traits (forehead and periocular), ensuring robustness against variations.

Result: Achieved 98.8% classification accuracy on the FSVP-PBP database, outperforming state-of-the-art methods.

Conclusion: POC-ViT offers a promising, lightweight solution for biometric systems, suitable for edge devices and resilient to common challenges.

Abstract: Traditional biometric systems have encountered significant setbacks due to various unavoidable factors, for example, wearing of face masks in face recognition-based biometrics and hygiene concerns in fingerprint-based biometrics. This paper proposes a novel lightweight vision transformer with phase-only cross-attention (POC-ViT) using dual biometric traits of forehead and periocular portions of the face, capable of performing well even with face masks and without any physical touch, offering a promising alternative to traditional methods. The POC-ViT framework is designed to handle two biometric traits and to capture inter-dependencies in terms of relative structural patterns. Each channel consists of a Cross-Attention using phase-only correlation (POC) that captures both their individual and correlated structural patterns. The computation of cross-attention using POC extracts the phase correlation in the spatial features. Therefore, it is robust against variations in resolution and intensity, as well as illumination changes in the input images. The lightweight model is suitable for edge device deployment. The performance of the proposed framework was successfully demonstrated using the Forehead Subcutaneous Vein Pattern and Periocular Biometric Pattern (FSVP-PBP) database, having 350 subjects. The POC-ViT framework outperformed state-of-the-art methods with an outstanding classification accuracy of $98.8%$ with the dual biometric traits.

[215] Evaluation of Cultural Competence of Vision-Language Models

Srishti Yadav, Lauren Tilton, Maria Antoniak, Taylor Arnold, Jiaang Li, Siddhesh Milind Pawar, Antonia Karamolegkou, Stella Frank, Zhaochong An, Negar Rostamzadeh, Daniel Hershcovich, Serge Belongie, Ekaterina Shutova

Main category: cs.CV

TL;DR: The paper highlights the lack of cultural competency in vision-language models (VLMs) and proposes foundational methodologies from visual culture studies to systematically analyze cultural dimensions in images.

DetailsMotivation: VLMs often fail in cultural competency evaluations, necessitating a deeper understanding of how they encode cultural nuances.

Method: The paper reviews foundational methodologies from visual culture studies and proposes five frameworks for analyzing cultural dimensions in images.

Result: A set of five frameworks is introduced to enable a more comprehensive cultural analysis of VLMs.

Conclusion: Foundational methodologies from visual culture studies are essential for improving the cultural competency of VLMs.

Abstract: Modern vision-language models (VLMs) often fail at cultural competency evaluations and benchmarks. Given the diversity of applications built upon VLMs, there is renewed interest in understanding how they encode cultural nuances. While individual aspects of this problem have been studied, we still lack a comprehensive framework for systematically identifying and annotating the nuanced cultural dimensions present in images for VLMs. This position paper argues that foundational methodologies from visual culture studies (cultural studies, semiotics, and visual studies) are necessary for cultural analysis of images. Building upon this review, we propose a set of five frameworks, corresponding to cultural dimensions, that must be considered for a more complete analysis of the cultural competencies of VLMs.

[216] GC-MVSNet: Multi-View, Multi-Scale, Geometrically-Consistent Multi-View Stereo

Vibhas K. Vats, Sripad Joshi, David J. Crandall, Md. Alimoor Reza, Soon-heung Jung

Main category: cs.CV

TL;DR: GC-MVSNet introduces multi-view, multi-scale geometric consistency during learning, reducing training iterations and achieving state-of-the-art results.

DetailsMotivation: Traditional MVS methods rely on post-processing for geometric consistency, while newer methods lack explicit multi-view, multi-scale consistency during learning.

Method: GC-MVSNet enforces geometric consistency of reference view depth maps across multiple source views at different scales during learning, using a geometric consistency loss.

Result: The method reduces training iterations by nearly half and achieves state-of-the-art results on DTU and BlendedMVS datasets, with competitive performance on Tanks and Temples.

Conclusion: GC-MVSNet is the first to enforce multi-view, multi-scale geometric consistency during learning, significantly improving efficiency and accuracy.

Abstract: Traditional multi-view stereo (MVS) methods rely heavily on photometric and geometric consistency constraints, but newer machine learning-based MVS methods check geometric consistency across multiple source views only as a post-processing step. In this paper, we present a novel approach that explicitly encourages geometric consistency of reference view depth maps across multiple source views at different scales during learning (see Fig. 1). We find that adding this geometric consistency loss significantly accelerates learning by explicitly penalizing geometrically inconsistent pixels, reducing the training iteration requirements to nearly half that of other MVS methods. Our extensive experiments show that our approach achieves a new state-of-the-art on the DTU and BlendedMVS datasets, and competitive results on the Tanks and Temples benchmark. To the best of our knowledge, GC-MVSNet is the first attempt to enforce multi-view, multi-scale geometric consistency during learning.

[217] Video-based automatic lameness detection of dairy cows using pose estimation and multiple locomotion traits

Helena Russello, Rik van der Tol, Menno Holzhauer, Eldert J. van Henten, Gert Kootstra

Main category: cs.CV

TL;DR: An automated lameness detection system using deep-learning and pose estimation to analyze cow locomotion traits, improving classification accuracy with multiple traits.

DetailsMotivation: To develop a reliable, automated system for detecting lameness in cows by analyzing locomotion traits, addressing challenges like varying outdoor conditions.

Method: Uses the T-LEAP pose estimation model to extract keypoints from cow walking videos, then computes six locomotion traits for analysis.

Result: Classification accuracy improved from 76.6% (single trait) to 80.1% (all six traits), with key traits being back posture, head bobbing, and tracking distance.

Conclusion: Incorporating multiple locomotion traits enhances lameness detection accuracy, demonstrating the system’s effectiveness in real-world conditions.

Abstract: This study presents an automated lameness detection system that uses deep-learning image processing techniques to extract multiple locomotion traits associated with lameness. Using the T-LEAP pose estimation model, the motion of nine keypoints was extracted from videos of walking cows. The videos were recorded outdoors, with varying illumination conditions, and T-LEAP extracted 99.6% of correct keypoints. The trajectories of the keypoints were then used to compute six locomotion traits: back posture measurement, head bobbing, tracking distance, stride length, stance duration, and swing duration. The three most important traits were back posture measurement, head bobbing, and tracking distance. For the ground truth, we showed that a thoughtful merging of the scores of the observers could improve intra-observer reliability and agreement. We showed that including multiple locomotion traits improves the classification accuracy from 76.6% with only one trait to 79.9% with the three most important traits and to 80.1% with all six locomotion traits.

[218] Debiasing Multimodal Large Language Models via Penalization of Language Priors

YiFan Zhang, Yang Shi, Weichen Yu, Qingsong Wen, Xue Wang, Wenjing Yang, Zhang Zhang, Liang Wang, Rong Jin

Main category: cs.CV

TL;DR: The paper identifies bias in Multimodal Large Language Models (MLLMs) where generated content relies more on LLM priors than visual inputs. It proposes two training-free strategies to mitigate this bias and improve performance.

DetailsMotivation: To address the observed bias in MLLMs where generated content is overly influenced by LLM priors rather than visual inputs, leading to unreliable outputs.

Method: Two strategies are proposed: 1) ‘Post-Hoc Debias’ for classification/multi-choice tasks, adjusting output distribution via affine calibration; 2) ‘Visual Debias Decoding’ for open-ended tasks, contrasting token log-probabilities with correct vs. meaningless images.

Result: The strategies effectively mitigate bias, reduce hallucinations, and improve output precision. Performance surpasses prior results, highlighting evaluation fairness concerns.

Conclusion: The proposed training-free methods successfully redirect MLLM focus to visual inputs, enhancing reliability and accuracy in multimodal tasks.

Abstract: In the realms of computer vision and natural language processing, Multimodal Large Language Models (MLLMs) have become indispensable tools, proficient in generating textual responses based on visual inputs. Despite their advancements, our investigation reveals a noteworthy bias: the generated content is often driven more by the inherent priors of the underlying Large Language Models (LLMs) than by the input image. Empirical experiments underscore the persistence of this bias, as MLLMs often provide confident answers even in the absence of relevant images or given incongruent visual inputs. To rectify these biases and redirect the model’s focus toward visual information, we propose two simple, training-free strategies. First, for tasks such as classification or multi-choice question answering, we introduce a “Post-Hoc Debias” method using an affine calibration step to adjust the output distribution. This approach ensures uniform answer scores when the image is absent, acting as an effective regularization technique to alleviate the influence of LLM priors. For more intricate open-ended generation tasks, we extend this method to “Visual Debias Decoding”, which mitigates bias by contrasting token log-probabilities conditioned on a correct image versus a meaningless one. Additionally, our investigation sheds light on the instability of MLLMs across various decoding configurations. Through systematic exploration of different settings, we achieve significant performance improvements–surpassing previously reported results–and raise concerns about the fairness of current evaluation practices. Comprehensive experiments substantiate the effectiveness of our proposed strategies in mitigating biases. These strategies not only prove beneficial in minimizing hallucinations but also contribute to the generation of more helpful and precise illustrations.

[219] VPOcc: Exploiting Vanishing Point for 3D Semantic Occupancy Prediction

Junsu Kim, Junhee Lee, Ukcheol Shin, Jean Oh, Kyungdon Joo

Main category: cs.CV

TL;DR: VPOcc is a novel framework using vanishing points to address 2D-3D discrepancies in semantic occupancy prediction, improving accuracy on benchmark datasets.

DetailsMotivation: Accurate 3D scene understanding is vital for robot and autonomous vehicle navigation, but camera-based methods struggle with perspective distortions.

Method: VPOcc employs VPZoomer for pixel-level image warping, VPCA for feature-level perspective-aware aggregation, and SVF for feature fusion.

Result: The framework enhances IoU and mIoU metrics on SemanticKITTI and SSCBench-KITTI360 datasets.

Conclusion: VPOcc effectively mitigates 2D-3D discrepancies, advancing 3D semantic occupancy prediction for robot vision.

Abstract: Understanding 3D scenes semantically and spatially is crucial for the safe navigation of robots and autonomous vehicles, aiding obstacle avoidance and accurate trajectory planning. Camera-based 3D semantic occupancy prediction, which infers complete voxel grids from 2D images, is gaining importance in robot vision for its resource efficiency compared to 3D sensors. However, this task inherently suffers from a 2D-3D discrepancy, where objects of the same size in 3D space appear at different scales in a 2D image depending on their distance from the camera due to perspective projection. To tackle this issue, we propose a novel framework called VPOcc that leverages a vanishing point (VP) to mitigate the 2D-3D discrepancy at both the pixel and feature levels. As a pixel-level solution, we introduce a VPZoomer module, which warps images by counteracting the perspective effect using a VP-based homography transformation. In addition, as a feature-level solution, we propose a VP-guided cross-attention (VPCA) module that performs perspective-aware feature aggregation, utilizing 2D image features that are more suitable for 3D space. Lastly, we integrate two feature volumes extracted from the original and warped images to compensate for each other through a spatial volume fusion (SVF) module. By effectively incorporating VP into the network, our framework achieves improvements in both IoU and mIoU metrics on SemanticKITTI and SSCBench-KITTI360 datasets. Additional details are available at https://vision3d-lab.github.io/vpocc/.

[220] MinD-3D++: Advancing fMRI-Based 3D Reconstruction with High-Quality Textured Mesh Generation and a Comprehensive Dataset

Jianxiong Gao, Yanwei Fu, Yuqian Fu, Yun Wang, Xuelin Qian, Jianfeng Feng

Main category: cs.CV

TL;DR: The paper introduces Recon3DMind, a method for reconstructing 3D visuals from fMRI data, and presents the fMRI-3D dataset. It also proposes MinD-3D++, a framework for decoding textured 3D visuals from fMRI signals, achieving high accuracy and new benchmarks.

DetailsMotivation: Advancing the reconstruction of 3D visuals from fMRI data is crucial for cognitive neuroscience and computer vision, enabling deeper understanding of brain processes.

Method: The paper introduces the fMRI-3D dataset and proposes MinD-3D++, a framework for decoding textured 3D visuals from fMRI signals, with new evaluation metrics.

Result: MinD-3D++ reconstructs 3D objects with high semantic and spatial accuracy and generates textured meshes, providing insights into brain processing of 3D visuals.

Conclusion: The work advances fMRI-based 3D reconstruction, offering a novel dataset and framework with significant potential for neuroscience and vision research.

Abstract: Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI) data, introduced as Recon3DMind, is of significant interest to both cognitive neuroscience and computer vision. To advance this task, we present the fMRI-3D dataset, which includes data from 15 participants and showcases a total of 4,768 3D objects. The dataset consists of two components: fMRI-Shape, previously introduced and available at https://huggingface.co/datasets/Fudan-fMRI/fMRI-Shape, and fMRI-Objaverse, proposed in this paper and available at https://huggingface.co/datasets/Fudan-fMRI/fMRI-Objaverse. fMRI-Objaverse includes data from 5 subjects, 4 of whom are also part of the core set in fMRI-Shape. Each subject views 3,142 3D objects across 117 categories, all accompanied by text captions. This significantly enhances the diversity and potential applications of the dataset. Moreover, we propose MinD-3D++, a novel framework for decoding textured 3D visual information from fMRI signals. The framework evaluates the feasibility of not only reconstructing 3D objects from the human mind but also generating, for the first time, 3D textured meshes with detailed textures from fMRI data. We establish new benchmarks by designing metrics at the semantic, structural, and textured levels to evaluate model performance. Furthermore, we assess the model’s effectiveness in out-of-distribution settings and analyze the attribution of the proposed 3D pari fMRI dataset in visual regions of interest (ROIs) in fMRI signals. Our experiments demonstrate that MinD-3D++ not only reconstructs 3D objects with high semantic and spatial accuracy but also provides deeper insights into how the human brain processes 3D visual information. Project page: https://jianxgao.github.io/MinD-3D.

[221] CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models

Junho Kim, Hyungjin Chung, Byung-Hoon Kim

Main category: cs.CV

TL;DR: CapeLLM is a multimodal large language model (MLLM) for category-agnostic pose estimation (CAPE), using query images and text descriptions without relying on annotated support images. It introduces effective training, inference mechanisms, and achieves state-of-the-art results on the MP-100 benchmark.

DetailsMotivation: Traditional CAPE methods rely on annotated support images, which are cumbersome and may lack generalization. Text queries offer stability but are limited by reliance on support queries and underutilization of pre-trained language models.

Method: CapeLLM leverages MLLMs for CAPE, using query images and text descriptions. It includes training strategies, instruction design, and an inference mechanism for reasoning about unseen keypoints and modeling spatial distribution.

Result: CapeLLM achieves state-of-the-art performance on the MP-100 benchmark in 1-shot and 5-shot settings, demonstrating robustness and adaptability.

Conclusion: CapeLLM advances CAPE by eliminating the need for annotated support images, leveraging MLLMs, and setting new benchmarks, with potential for broader applications.

Abstract: Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have explored the use of text queries, leveraging their enhanced stability and generalization capabilities. However, existing approaches often remain constrained by their reliance on support queries, their failure to fully utilize the rich priors embedded in pre-trained large language models, and the limitations imposed by their parametric distribution assumptions. To address these challenges, we introduce CapeLLM, the first multimodal large language model (MLLM) designed for CAPE. Our method only employs query image and detailed text descriptions as an input to estimate category-agnostic keypoints. Our method encompasses effective training strategies and carefully designed instructions for applying the MLLM to CAPE. Moreover, we propose an inference mechanism that further enhances the reasoning process for unseen keypoints. while flexibly modeling their underlying spatial distribution and uncertainty, allowing for adaptive refinement based on contextual cues. We conducted extensive experiments to apply the MLLM to CAPE effectively, focusing not only on the model architecture and prompt design but also on ensuring robustness across input variations. Our approach sets a new state-of-the-art on the MP-100 benchmark in the 1-shot and even 5-shot setting, marking a significant advancement in the field of category-agnostic pose estimation. Code is available at https://github.com/Junhojuno/CapeLLM.

[222] Re:Verse – Can Your VLM Read a Manga?

Aaditya Baranwal, Madhav Kataria, Naitik Agrawal, Yogesh S Rawat, Shruti Vyas

Main category: cs.CV

TL;DR: The paper highlights a gap in Vision Language Models (VLMs) for deep narrative reasoning in sequential visual storytelling, introduces a novel evaluation framework, and reveals limitations in current models for long-form narrative understanding.

DetailsMotivation: To address the critical gap in VLMs' ability to perform deep narrative reasoning, particularly in sequential visual storytelling like manga, and to evaluate their limitations systematically.

Method: The study employs a novel framework combining fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment. It includes rigorous annotation, evaluation across reasoning paradigms, and cross-modal similarity analysis.

Result: Current VLMs fail at temporal causality and cross-panel cohesion, struggling with non-linear narratives, character consistency, and causal inference.

Conclusion: The work provides foundational insights and methodologies for evaluating narrative intelligence in VLMs, emphasizing the need for improved deep sequential understanding.

Abstract: Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning when processing sequential visual storytelling. Through a comprehensive investigation of manga narrative understanding, we reveal that while recent large multimodal models excel at individual panel interpretation, they systematically fail at temporal causality and cross-panel cohesion, core requirements for coherent story comprehension. We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment to systematically characterize these limitations. Our methodology includes (i) a rigorous annotation protocol linking visual elements to narrative structure through aligned light novel text, (ii) comprehensive evaluation across multiple reasoning paradigms, including direct inference and retrieval-augmented generation, and (iii) cross-modal similarity analysis revealing fundamental misalignments in current VLMs’ joint representations. Applying this framework to Re:Zero manga across 11 chapters with 308 annotated panels, we conduct the first systematic study of long-form narrative understanding in VLMs through three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning. Our findings demonstrate that current models lack genuine story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference across extended sequences. This work establishes both the foundation and practical methodology for evaluating narrative intelligence, while providing actionable insights into the capability of deep sequential understanding of Discrete Visual Narratives beyond basic recognition in Multimodal Models. Project Page: https://re-verse.vercel.app

[223] Quantum-Brain: Quantum-Inspired Neural Network Approach to Vision-Brain Understanding

Hoang-Quan Nguyen, Xuan-Bac Nguyen, Hugh Churchill, Arabinda Kumar Choudhary, Pawan Sinha, Samee U. Khan, Khoa Luu

Main category: cs.CV

TL;DR: A quantum-inspired neural network, Quantum-Brain, is proposed to enhance vision-brain understanding by modeling brain signal connectivities using quantum computing principles, achieving high accuracy in benchmarks.

DetailsMotivation: Existing deep learning methods lack the ability to model brain region connectivities, while quantum computing's entanglement properties offer a promising paradigm for this challenge.

Method: The approach includes a Quantum-Inspired Voxel-Controlling module for connectivity computation, a Phase-Shifting module for signal calibration, and a Measurement-like Projection module for feature space representation.

Result: Achieved Top-1 accuracies of 95.1% (image retrieval) and 95.6% (brain retrieval), and an Inception score of 95.3% (fMRI-to-image reconstruction).

Conclusion: The Quantum-Brain approach effectively learns brain signal connectivities and enhances semantic information extraction, demonstrating the potential of quantum-inspired models for vision-brain problems.

Abstract: Vision-brain understanding aims to extract semantic information about brain signals from human perceptions. Existing deep learning methods for vision-brain understanding are usually introduced in a traditional learning paradigm missing the ability to learn the connectivities between brain regions. Meanwhile, the quantum computing theory offers a new paradigm for designing deep learning models. Motivated by the connectivities in the brain signals and the entanglement properties in quantum computing, we propose a novel Quantum-Brain approach, a quantum-inspired neural network, to tackle the vision-brain understanding problem. To compute the connectivity between areas in brain signals, we introduce a new Quantum-Inspired Voxel-Controlling module to learn the impact of a brain voxel on others represented in the Hilbert space. To effectively learn connectivity, a novel Phase-Shifting module is presented to calibrate the value of the brain signals. Finally, we introduce a new Measurement-like Projection module to present the connectivity information from the Hilbert space into the feature space. The proposed approach can learn to find the connectivities between fMRI voxels and enhance the semantic information obtained from human perceptions. Our experimental results on the Natural Scene Dataset benchmarks illustrate the effectiveness of the proposed method with Top-1 accuracies of 95.1% and 95.6% on image and brain retrieval tasks and an Inception score of 95.3% on fMRI-to-image reconstruction task. Our proposed quantum-inspired network brings a potential paradigm to solving the vision-brain problems via the quantum computing theory.

[224] Continual Learning for Multiple Modalities

Hyundong Jin, Eunwoo Kim

Main category: cs.CV

TL;DR: A novel continual learning framework for multiple modalities (image, video, audio, depth, text) is proposed, mitigating forgetting by aligning modalities with text and consolidating intra- and inter-modal knowledge.

DetailsMotivation: Existing continual learning methods focus on single modalities, limiting their use in multi-modal scenarios. This work addresses the challenge of learning and retaining knowledge across diverse modalities.

Method: The framework aligns modalities with text, self-regulates representation shifts, and selectively integrates inter-modal knowledge. It also re-aligns modality embeddings to address biased alignment.

Result: Extensive experiments show the method outperforms existing approaches in various continual learning scenarios, regardless of modality identity.

Conclusion: The proposed framework effectively handles multi-modal continual learning, reducing forgetting and improving performance across diverse tasks.

Abstract: Continual learning aims to learn knowledge of tasks observed in sequential time steps while mitigating the forgetting of previously learned knowledge. Existing methods were designed to learn a single modality (e.g., image) over time, which limits their applicability in scenarios involving multiple modalities. In this work, we propose a novel continual learning framework that accommodates multiple modalities (image, video, audio, depth, and text). We train a model to align various modalities with text, leveraging its rich semantic information. However, this increases the risk of forgetting previously learned knowledge, exacerbated by the differing input traits across tasks. To alleviate the overwriting of previous knowledge of modalities, we propose a framework that consolidates intra-modal knowledge while incorporating relevant inter-modal information. This is achieved by self-regulating shifts in learned representations to gradually integrating novel knowledge into the information retained across modalities. Simultaneously, it mitigates inter-modal interference by selectively integrating knowledge from previously encountered modalities based on their mutual relevance. Furthermore, we introduce a strategy to re-align modality embeddings, effectively addressing biased alignment between modalities. We evaluate the proposed method in a wide range of continual learning scenarios using multiple datasets with different modalities. Extensive experiments demonstrate that ours outperforms existing methods in the scenarios, regardless of whether the identity of the modality is given.

[225] MyTimeMachine: Personalized Facial Age Transformation

Luchao Qi, Jiaye Wu, Bang Gong, Annie N. Wang, David W. Jacobs, Roni Sengupta

Main category: cs.CV

TL;DR: MyTimeMachine (MyTM) combines global aging prior with personal photos to create personalized age transformations using an Adapter Network and StyleGAN2, outperforming existing methods.

DetailsMotivation: Existing aging techniques lack personalization, often failing to resemble actual appearances at target ages, despite access to personal photo collections.

Method: Proposes MyTM with an Adapter Network integrating global and personalized aging features, using StyleGAN2 and three novel loss functions for personalization.

Result: Achieves high-quality, identity-preserving, and temporally consistent aging effects, resembling actual appearances at target ages.

Conclusion: MyTM demonstrates superior performance over state-of-the-art methods, especially in video applications.

Abstract: Facial aging is a complex process, highly dependent on multiple factors like gender, ethnicity, lifestyle, etc., making it extremely challenging to learn a global aging prior to predict aging for any individual accurately. Existing techniques often produce realistic and plausible aging results, but the re-aged images often do not resemble the person’s appearance at the target age and thus need personalization. In many practical applications of virtual aging, e.g. VFX in movies and TV shows, access to a personal photo collection of the user depicting aging in a small time interval (20$\sim$40 years) is often available. However, naive attempts to personalize global aging techniques on personal photo collections often fail. Thus, we propose MyTimeMachine (MyTM), which combines a global aging prior with a personal photo collection (using as few as 50 images) to learn a personalized age transformation. We introduce a novel Adapter Network that combines personalized aging features with global aging features and generates a re-aged image with StyleGAN2. We also introduce three loss functions to personalize the Adapter Network with personalized aging loss, extrapolation regularization, and adaptive w-norm regularization. Our approach can also be extended to videos, achieving high-quality, identity-preserving, and temporally consistent aging effects that resemble actual appearances at target ages, demonstrating its superiority over state-of-the-art approaches.

[226] Speedy-Splat: Fast 3D Gaussian Splatting with Sparse Pixels and Sparse Primitives

Alex Hanson, Allen Tu, Geng Lin, Vasu Singla, Matthias Zwicker, Tom Goldstein

Main category: cs.CV

TL;DR: Speedy-Splat optimizes 3D-GS by improving rendering speed and reducing model size through pipeline optimization and pruning.

DetailsMotivation: 3D-GS faces inefficiencies in rendering speed and model size, limiting its use in resource-constrained settings.

Method: Optimized rendering pipeline for precise Gaussian localization and introduced pruning to reduce model size and training time.

Result: Achieved a 6.71× average rendering speed improvement across multiple datasets.

Conclusion: Speedy-Splat effectively addresses key inefficiencies in 3D-GS, enhancing performance and practicality.

Abstract: 3D Gaussian Splatting (3D-GS) is a recent 3D scene reconstruction technique that enables real-time rendering of novel views by modeling scenes as parametric point clouds of differentiable 3D Gaussians. However, its rendering speed and model size still present bottlenecks, especially in resource-constrained settings. In this paper, we identify and address two key inefficiencies in 3D-GS to substantially improve rendering speed. These improvements also yield the ancillary benefits of reduced model size and training time. First, we optimize the rendering pipeline to precisely localize Gaussians in the scene, boosting rendering speed without altering visual fidelity. Second, we introduce a novel pruning technique and integrate it into the training pipeline, significantly reducing model size and training time while further raising rendering speed. Our Speedy-Splat approach combines these techniques to accelerate average rendering speed by a drastic $\mathit{6.71\times}$ across scenes from the Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets.

[227] Improving Viewpoint Consistency in 3D Generation via Structure Feature and CLIP Guidance

Qing Zhang, Jinguang Tong, Jing Zhang, Jie Hong, Xuesong Li

Main category: cs.CV

TL;DR: The paper proposes ACG, a tuning-free method to address the Janus Problem in text-to-3D generation by correcting viewpoint bias in diffusion models.

DetailsMotivation: Current text-to-3D methods suffer from geometric inconsistencies (Janus Problem) due to viewpoint generation bias in diffusion models.

Method: ACG adaptively controls cross-attention maps, uses CLIP-based filtering for viewpoints, and employs coarse-to-fine optimization with staged prompts.

Result: ACG significantly reduces the Janus Problem without affecting generation speed, proving effective as a plug-and-play solution.

Conclusion: ACG is an efficient and adaptable solution for improving text-to-3D generation by addressing viewpoint bias.

Abstract: Despite recent advances in text-to-3D generation techniques, current methods often suffer from geometric inconsistencies, commonly referred to as the Janus Problem. This paper identifies the root cause of the Janus Problem: viewpoint generation bias in diffusion models, which creates a significant gap between the actual generated viewpoint and the expected one required for optimizing the 3D model. To address this issue, we propose a tuning-free approach called the Attention and CLIP Guidance (ACG) mechanism. ACG enhances desired viewpoints by adaptively controlling cross-attention maps, employs CLIP-based view-text similarities to filter out erroneous viewpoints, and uses a coarse-to-fine optimization strategy with staged prompts to progressively refine 3D generation. Extensive experiments demonstrate that our method significantly reduces the Janus Problem without compromising generation speed, establishing ACG as an efficient, plug-and-play component for existing text-to-3D frameworks.

[228] DGNS: Deformable Gaussian Splatting and Dynamic Neural Surface for Monocular Dynamic 3D Reconstruction

Xuesong Li, Jinguang Tong, Jie Hong, Vivien Rolland, Lars Petersson

Main category: cs.CV

TL;DR: DGNS combines Deformable Gaussian Splatting and Dynamic Neural Surfaces for dynamic scene reconstruction and novel-view synthesis, achieving state-of-the-art results.

DetailsMotivation: Dynamic scene reconstruction from monocular video is crucial for real-world applications, requiring simultaneous novel-view synthesis and 3D geometry reconstruction.

Method: DGNS integrates deformable Gaussian splatting for depth guidance and dynamic neural surfaces for geometry reconstruction, with a depth-filtering approach for refinement.

Result: DGNS achieves state-of-the-art 3D reconstruction and competitive novel-view synthesis on public datasets.

Conclusion: The hybrid framework effectively addresses dynamic scene reconstruction, demonstrating superior performance in both geometry and rendering quality.

Abstract: Dynamic scene reconstruction from monocular video is essential for real-world applications. We introduce DGNS, a hybrid framework integrating \underline{D}eformable \underline{G}aussian Splatting and Dynamic \underline{N}eural \underline{S}urfaces, effectively addressing dynamic novel-view synthesis and 3D geometry reconstruction simultaneously. During training, depth maps generated by the deformable Gaussian splatting module guide the ray sampling for faster processing and provide depth supervision within the dynamic neural surface module to improve geometry reconstruction. Conversely, the dynamic neural surface directs the distribution of Gaussian primitives around the surface, enhancing rendering quality. In addition, we propose a depth-filtering approach to further refine depth supervision. Extensive experiments conducted on public datasets demonstrate that DGNS achieves state-of-the-art performance in 3D reconstruction, along with competitive results in novel-view synthesis.

[229] DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction

Ben Kaye, Tomas Jakab, Shangzhe Wu, Christian Rupprecht, Andrea Vedaldi

Main category: cs.CV

TL;DR: The paper introduces Dual Point Maps (DualPM) for 3D reconstruction of deformable objects, extending viewpoint-invariant point maps to handle pose and shape, and demonstrates its effectiveness with synthetic data.

DetailsMotivation: The success of deep learning in geometric tasks depends on data representation. The paper aims to generalize the concept of viewpoint-invariant point maps to deformable objects.

Method: DualPM involves predicting a pair of point maps from an image: one for 3D object locations and another for a canonical rest pose. It also extends to amodal reconstruction for complete shape recovery.

Result: DualPMs enable 3D reconstruction and pose estimation, trained on synthetic data with few models per category, and generalize well to real images, outperforming previous methods.

Conclusion: DualPMs are a robust representation for deep networks, achieving significant improvements in 3D analysis and reconstruction of deformable objects.

Abstract: The choice of data representation is a key factor in the success of deep learning in geometric tasks. For instance, DUSt3R recently introduced the concept of viewpoint-invariant point maps, generalizing depth prediction and showing that all key problems in the 3D reconstruction of static scenes can be reduced to predicting such point maps. In this paper, we develop an analogous concept for a very different problem: the reconstruction of the 3D shape and pose of deformable objects. To this end, we introduce Dual Point Maps (DualPM), where a pair of point maps is extracted from the same image-one associating pixels to their 3D locations on the object and the other to a canonical version of the object in its rest pose. We also extend point maps to amodal reconstruction to recover the complete shape of the object, even through self-occlusions. We show that 3D reconstruction and 3D pose estimation can be reduced to the prediction of DualPMs. Empirically, we demonstrate that this representation is a suitable target for deep networks to predict. Specifically, we focus on modeling quadrupeds, showing that DualPMs can be trained purely on synthetic 3D data, consisting of one or two models per category, while generalizing effectively to real images. With this approach, we achieve significant improvements over previous methods for the 3D analysis and reconstruction of such objects.

[230] Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

Qifan Yu, Zhebei Shen, Zhongqi Yue, Yang Wu, Bosheng Qin, Wenqiao Zhang, Yunfei Li, Juncheng Li, Siliang Tang, Yueting Zhuang

Main category: cs.CV

TL;DR: DataTailor is a framework for efficient data selection in instruction tuning of MLLMs, reducing redundancy and computational costs while maintaining performance.

DetailsMotivation: Addressing data redundancy and high computational costs in visual instruction datasets for MLLMs.

Method: Proposes DataTailor, a framework using informativeness, uniqueness, and representativeness principles for data selection, with adaptive scoring.

Result: Achieves 101.3% performance of full-data fine-tuning with only 15% of the data, reducing computational costs.

Conclusion: Demonstrates the ‘Less is More’ philosophy, proving efficient data selection can maintain or improve performance.

Abstract: Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to handle real-world tasks. However, the rapid expansion of visual instruction datasets introduces data redundancy, leading to excessive computational costs. We propose a collaborative framework, DataTailor, which leverages three key principles–informativeness, uniqueness, and representativeness–for effective data selection. We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier). We further propose practical ways to score against each principle, which automatically adapts to a given dataset without tedious hyperparameter tuning. Comprehensive experiments on various benchmarks demonstrate that DataTailor achieves 101.3% of the performance of full-data fine-tuning with only 15% of the data, significantly reducing computational costs while maintaining superior results. This exemplifies the “Less is More” philosophy in MLLM development. The code and data is available in this \href{https://github.com/Yuqifan1117/DataTailor}{URL}.

[231] Vision Transformers in Precision Agriculture: A Comprehensive Survey

Saber Mehdipour, Seyed Abolghasem Mirroshandel, Seyed Amirhossein Tabatabaei

Main category: cs.CV

TL;DR: The paper reviews the application of Vision Transformers (ViTs) in precision agriculture for plant disease detection, comparing them with traditional methods like CNNs and discussing challenges and future directions.

DetailsMotivation: To address limitations of traditional plant disease detection methods (manual inspection and CNNs) by leveraging ViTs for improved scalability and accuracy.

Method: Reviews ViT architecture, its transition from NLP to computer vision, and compares it with CNNs. Analyzes methodologies, datasets, performance metrics, and hybrid models.

Result: ViTs offer advantages like better long-range dependency handling and scalability, though challenges like data requirements and computational demands exist.

Conclusion: ViTs have transformative potential for precision agriculture, with future research needed to address technical challenges and enhance real-world integration.

Abstract: Detecting plant diseases is a crucial aspect of modern agriculture, as it plays a key role in maintaining crop health and increasing overall yield. Traditional approaches, though still valuable, often rely on manual inspection or conventional machine learning techniques, both of which face limitations in scalability and accuracy. Recently, Vision Transformers (ViTs) have emerged as a promising alternative, offering advantages such as improved handling of long-range dependencies and better scalability for visual tasks. This review explores the application of ViTs in precision agriculture, covering a range of tasks. We begin by introducing the foundational architecture of ViTs and discussing their transition from Natural Language Processing (NLP) to Computer Vision. The discussion includes the concept of inductive bias in traditional models like Convolutional Neural Networks (CNNs), and how ViTs mitigate these biases. We provide a comprehensive review of recent literature, focusing on key methodologies, datasets, and performance metrics. This study also includes a comparative analysis of CNNs and ViTs, along with a review of hybrid models and performance enhancements. Technical challenges such as data requirements, computational demands, and model interpretability are addressed, along with potential solutions. Finally, we outline future research directions and technological advancements that could further support the integration of ViTs in real-world agricultural settings. Our goal with this study is to offer practitioners and researchers a deeper understanding of how ViTs are poised to transform smart and precision agriculture.

[232] Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation

Yuxuan Wang, Xuanyu Yi, Haohan Weng, Qingshan Xu, Xiaokang Wei, Xianghui Yang, Chunchao Guo, Long Chen, Hanwang Zhang

Main category: cs.CV

TL;DR: Nautilus is a locality-aware autoencoder for generating high-fidelity, scalable triangle meshes, outperforming current methods with up to 5,000 faces.

DetailsMotivation: Current mesh generation methods produce suboptimal outputs due to reliance on intermediate representations or face count limitations.

Method: Nautilus uses a novel tokenization algorithm and Dual-stream Point Conditioner for local and global mesh fidelity.

Result: Nautilus achieves unprecedented scalability (5,000 faces) and outperforms state-of-the-art methods in fidelity.

Conclusion: Nautilus advances mesh generation by combining locality-awareness and multi-scale guidance for high-quality results.

Abstract: Triangle meshes are fundamental to 3D applications, enabling efficient modification and rasterization while maintaining compatibility with standard rendering pipelines. However, current automatic mesh generation methods typically rely on intermediate representations that lack the continuous surface quality inherent to meshes. Converting these representations into meshes produces dense, suboptimal outputs. Although recent autoregressive approaches demonstrate promise in directly modeling mesh vertices and faces, they are constrained by the limitation in face count, scalability, and structural fidelity. To address these challenges, we propose Nautilus, a locality-aware autoencoder for artist-like mesh generation that leverages the local properties of manifold meshes to achieve structural fidelity and efficient representation. Our approach introduces a novel tokenization algorithm that preserves face proximity relationships and compresses sequence length through locally shared vertices and edges, enabling the generation of meshes with an unprecedented scale of up to 5,000 faces. Furthermore, we develop a Dual-stream Point Conditioner that provides multi-scale geometric guidance, ensuring global consistency and local structural fidelity by capturing fine-grained geometric features. Extensive experiments demonstrate that Nautilus significantly outperforms state-of-the-art methods in both fidelity and scalability. The project page is at https://nautilusmeshgen.github.io.

[233] NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning

Zhixi Cai, Fucai Ke, Simindokht Jahangard, Maria Garcia de la Banda, Reza Haffari, Peter J. Stuckey, Hamid Rezatofighi

Main category: cs.CV

TL;DR: NAVER, a compositional visual grounding method, integrates probabilistic logic reasoning and a self-correcting mechanism, achieving state-of-the-art performance in visual grounding tasks.

DetailsMotivation: To address limitations in current visual grounding methods, especially in complex reasoning tasks requiring human-like cognition, by leveraging explicit probabilistic logic reasoning.

Method: Proposes NAVER, a compositional method combining large language models (LLMs) and foundation models with probabilistic logic reasoning in a finite-state automaton, featuring a self-correcting mechanism.

Result: NAVER outperforms recent end-to-end and compositional baselines, achieving state-of-the-art performance.

Conclusion: NAVER enhances robustness and interpretability in visual grounding through explicit logic reasoning, offering a promising solution for complex reasoning tasks.

Abstract: Visual Grounding (VG) tasks, such as referring expression detection and segmentation tasks are important for linking visual entities to context, especially in complex reasoning tasks that require detailed query interpretation. This paper explores VG beyond basic perception, highlighting challenges for methods that require reasoning like human cognition. Recent advances in large language methods (LLMs) and Vision-Language methods (VLMs) have improved abilities for visual comprehension, contextual understanding, and reasoning. These methods are mainly split into end-to-end and compositional methods, with the latter offering more flexibility. Compositional approaches that integrate LLMs and foundation models show promising performance but still struggle with complex reasoning with language-based logical representations. To address these limitations, we propose NAVER, a compositional visual grounding method that integrates explicit probabilistic logic reasoning within a finite-state automaton, equipped with a self-correcting mechanism. This design improves robustness and interpretability in inference through explicit logic reasoning. Our results show that NAVER achieves SoTA performance comparing to recent end-to-end and compositional baselines. The code is available at https://github.com/ControlNet/NAVER .

[234] MIDAS: Modeling Ground-Truth Distributions with Dark Knowledge for Domain Generalized Stereo Matching

Peng Xu, Zhiyu Xiang, Jingyun Fu, Tianyu Pu, Hanzhi Zhong, Eryun Liu

Main category: cs.CV

TL;DR: A method to improve domain generalization in stereo matching by extracting similarity and uncertainty information from pre-trained networks, using network ensemble and modeling knowledge as a mixture of Laplacians.

DetailsMotivation: Existing stereo matching methods struggle with domain-specific preferences when transferring from synthetic to real domains, limiting practical use in diverse scenarios.

Method: Extracts similarity and uncertainty (dark knowledge) from pre-trained networks, uses network ensemble to distinguish objective/biased knowledge, and models supervision as a mixture of Laplacians.

Result: Improves generalization of existing networks; PCWNet achieves state-of-the-art performance on KITTI datasets and outperforms others in real-world datasets.

Conclusion: The proposed method effectively enhances generalization in stereo matching, demonstrating superior performance across multiple datasets.

Abstract: Despite the significant advances in domain generalized stereo matching, existing methods still exhibit domain-specific preferences when transferring from synthetic to real domains, hindering their practical applications in complex and diverse scenarios. The probability distributions predicted by the stereo network naturally encode rich similarity and uncertainty information. Inspired by this observation, we propose to extract these two types of dark knowledge from the pre-trained network to model intuitive multi-modal ground-truth distributions for both edge and non-edge regions. To mitigate the inherent domain preferences of a single network, we adopt network ensemble and further distinguish between objective and biased knowledge in the Laplace parameter space. Finally, the objective knowledge and the original disparity labels are jointly modeled as a mixture of Laplacians to provide fine-grained supervision for the stereo network training. Extensive experiments demonstrate that: (1) Our method is generic and effectively improves the generalization of existing networks. (2) PCWNet with our method achieves the state-of-the-art generalization performance on both KITTI 2015 and 2012 datasets. (3) Our method outperforms existing methods in comprehensive ranking across four popular real-world datasets.

[235] Just Functioning as a Hook for Two-Stage Referring Multi-Object Tracking

Weize Li, Yunhao Du, Qixiang Yin, Zhicheng Zhao, Fei Su, Daqi Liu

Main category: cs.CV

TL;DR: JustHook is a novel two-stage RBT framework for RMOT, improving subtask interaction and generalization by introducing a Hook module and Parallel Combined Decoder, achieving state-of-the-art results.

DetailsMotivation: Current RMOT frameworks lack sufficient modeling of subtask interactions and rely inflexibly on semantic alignment modules like CLIP.

Method: Proposes JustHook with a Hook module for feature-level grid sampling and a Parallel Combined Decoder for joint feature learning.

Result: Achieves +6.9% HOTA improvement on Refer-KITTI-V2 with superior efficiency.

Conclusion: JustHook enhances interpretability, modularity, and generalization, setting new benchmarks in RMOT.

Abstract: Referring Multi-Object Tracking (RMOT) aims to localize target trajectories in videos specified by natural language expressions. Despite recent progress, the intrinsic relationship between the two subtasks of tracking and referring in RMOT has not been fully studied. In this paper, we present a systematic analysis of their interdependence, revealing that current two-stage Referring-by-Tracking (RBT) frameworks remain fundamentally limited by insufficient modeling of subtask interactions and inflexible reliance on semantic alignment modules like CLIP. To this end, we propose JustHook, a novel two-stage RBT framework where a Hook module is firstly designed to redefine the linkage between subtasks. The Hook is built centered on grid sampling at the feature-level and is used for context-aware target feature extraction. Moreover, we propose a Parallel Combined Decoder (PCD) that learns in a unified joint feature space rather than relying on pre-defined cross-modal embeddings. Our design not only enhances the interpretability and modularity but also significantly improves the generalization. Extensive experiments on Refer-KITTI, Refer-KITTI-V2, and Refer-Dance demonstrate that JustHook achieves state-of-the-art performance, improving the HOTA by +6.9% on Refer-KITTI-V2 with superior efficiency. Code will be available soon.

[236] Scaling Open-Vocabulary Action Detection

Zhen Hao Sia, Yogesh Singh Rawat

Main category: cs.CV

TL;DR: The paper addresses scaling open-vocabulary action detection by introducing a lightweight encoder-only multimodal model and a weakly supervised training strategy, while proposing a new benchmark for evaluation.

DetailsMotivation: Existing action detection methods are limited to closed-set scenarios and rely on complex architectures, making open-vocabulary adaptation challenging due to dataset and overfitting issues.

Method: An encoder-only multimodal model reduces parameter-heavy additions, and a weakly supervised strategy leverages closed-set datasets for pretraining. A new benchmark is introduced for evaluation.

Result: The approach provides novel baselines for open-vocabulary action detection, avoiding reliance on base-to-novel benchmarks.

Conclusion: The proposed model and benchmark offer a scalable solution for open-vocabulary action detection, with potential for future research.

Abstract: In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work. Our code is available at https://siatheindochinese.github.io/sia_act_page/ .

[237] OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM

Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Weiqiang Wang, Wentong Li, Hongxia Xu, Danny Chen, Jintai Chen, Jian Wu

Main category: cs.CV

TL;DR: OrderChain improves MLLMs’ ordinal regression performance via specificity and commonality modeling, achieving significant accuracy boosts on diverse datasets.

DetailsMotivation: Addressing the underperformance of MLLMs in ordinal regression tasks.

Method: Introduces OrderChain with task-aware prompts and RO-CoT for commonality modeling, plus CRD for automatic optimization.

Result: LLaVA with OrderChain achieves up to 93.2% accuracy on Adience and outperforms SOTA methods by 27%.

Conclusion: OrderChain effectively enhances MLLMs for OR tasks, demonstrating broad applicability and superior performance.

Abstract: Despite the remarkable progress of multimodal large language models (MLLMs), they continue to face challenges in achieving competitive performance on ordinal regression (OR; a.k.a. ordinal classification). To address this issue, this paper presents OrderChain, a novel and general prompting paradigm that improves the ordinal understanding ability of MLLMs by specificity and commonality modeling. Specifically, our OrderChain consists of a set of task-aware prompts to facilitate the specificity modeling of diverse OR tasks and a new range optimization Chain-of-Thought (RO-CoT), which learns a commonality way of thinking about OR tasks by uniformly decomposing them into multiple small-range optimization subtasks. Further, we propose a category recursive division (CRD) method to generate instruction candidate category prompts to support RO-CoT automatic optimization. Comprehensive experiments show that LLaVA model with our OrderChain improves baseline LLaVA significantly on diverse OR datasets, e.g., from 47.5% to 93.2% accuracy on the Adience dataset for age estimation, and from 30.0% to 85.7% accuracy on the Diabetic Retinopathy dataset. Notably, LLaVA with our OrderChain also remarkably outperforms state-of-the-art methods by 27% on accuracy and 0.24 on MAE on the Adience dataset. To our best knowledge, our OrderChain is the first work that augments MLLMs for OR tasks, and the effectiveness is witnessed across a spectrum of OR datasets. Project Page: https://order-chain.github.io/.

[238] CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting

Lei Tian, Xiaomin Li, Liqian Ma, Hao Yin, Zirui Zheng, Hefei Huang, Taiqing Li, Huchuan Lu, Xu Jia

Main category: cs.CV

TL;DR: CCL-LGS is a novel framework addressing cross-view semantic inconsistencies in 3D reconstruction by integrating multi-view semantic cues, outperforming prior methods.

DetailsMotivation: Cross-view semantic inconsistencies in 3D reconstruction, caused by occlusion and view-dependent variations, degrade 3D semantic understanding.

Method: Uses a zero-shot tracker for mask alignment, CLIP for semantic encoding, and a Contrastive Codebook Learning (CCL) module to resolve conflicts.

Result: CCL-LGS outperforms state-of-the-art methods by ensuring view-consistent semantic supervision.

Conclusion: The framework effectively mitigates semantic inconsistencies while preserving discriminability, advancing 3D semantic understanding.

Abstract: Recent advances in 3D reconstruction techniques and vision-language models have fueled significant progress in 3D semantic understanding, a capability critical to robotics, autonomous driving, and virtual/augmented reality. However, methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies induced by occlusion, image blur, and view-dependent variations. These inconsistencies, when propagated via projection supervision, deteriorate the quality of 3D Gaussian semantic fields and introduce artifacts in the rendered outputs. To mitigate this limitation, we propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues. Specifically, our approach first employs a zero-shot tracker to align a set of SAM-generated 2D masks and reliably identify their corresponding categories. Next, we utilize CLIP to extract robust semantic encodings across views. Finally, our Contrastive Codebook Learning (CCL) module distills discriminative semantic features by enforcing intra-class compactness and inter-class distinctiveness. In contrast to previous methods that directly apply CLIP to imperfect masks, our framework explicitly resolves semantic conflicts while preserving category discriminability. Extensive experiments demonstrate that CCL-LGS outperforms previous state-of-the-art methods. Our project page is available at https://epsilontl.github.io/CCL-LGS/.

[239] Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection

Ayush K. Rai, Kyle Min, Tarun Krishna, Feiyan Hu, Alan F. Smeaton, Noel E. O’Connor

Main category: cs.CV

TL;DR: The paper introduces TATS, a motion-aware token sampler for masked video modeling, enhancing MAE with adaptive masking and joint training via PPO, achieving efficient pre-training and strong performance on action recognition tasks.

DetailsMotivation: Existing masking strategies in MVM are either predefined or rely on external priors, lacking adaptability. The goal is to improve masking by dynamically selecting motion-centric tokens.

Method: Proposes TATS to model token motion dynamics, integrates it with MAE, and uses PPO for joint optimization. Evaluated on four benchmarks.

Result: Achieves aggressive masking without performance loss, demonstrating efficiency and effectiveness in action recognition.

Conclusion: TATS is a versatile, efficient solution for MVM, outperforming state-of-the-art methods in generalization and transferability.

Abstract: Masked video modeling~(MVM) has emerged as a highly effective pre-training strategy for visual foundation models, whereby the model reconstructs masked spatiotemporal tokens using information from visible tokens. However, a key challenge in such approaches lies in selecting an appropriate masking strategy. Previous studies have explored predefined masking techniques, including random and tube-based masking, as well as approaches that leverage key motion priors, optical flow and semantic cues from externally pre-trained models. In this work, we introduce a novel and generalizable Trajectory-Aware Adaptive Token Sampler (TATS), which models the motion dynamics of tokens and can be seamlessly integrated into the masked autoencoder (MAE) framework to select motion-centric tokens in videos. Additionally, we propose a unified training strategy that enables joint optimization of both MAE and TATS from scratch using Proximal Policy Optimization (PPO). We show that our model allows for aggressive masking without compromising performance on the downstream task of action recognition while also ensuring that the pre-training remains memory efficient. Extensive experiments of the proposed approach across four benchmarks, including Something-Something v2, Kinetics-400, UCF101, and HMDB51, demonstrate the effectiveness, transferability, generalization, and efficiency of our work compared to other state-of-the-art methods.

[240] Data Pruning by Information Maximization

Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi

Main category: cs.CV

TL;DR: InfoMax is a data pruning method that maximizes information content and minimizes redundancy in selected samples, formalized as a discrete quadratic programming task with an efficient solver.

DetailsMotivation: To enhance the informativeness of coresets by selecting samples with high importance and low redundancy, improving model learning efficiency.

Method: Uses importance scores for sample influence and pairwise similarities for redundancy, formalized as a DQP task with gradient-based solver and sparsification techniques.

Result: Demonstrates superior performance in tasks like image classification, vision-language pre-training, and LLM instruction tuning.

Conclusion: InfoMax effectively scales to large datasets and outperforms existing methods in data pruning tasks.

Abstract: In this paper, we present InfoMax, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, InfoMax enhances the overall informativeness of the coreset. The information of individual samples is measured by importance scores, which capture their influence or difficulty in model learning. To quantify redundancy, we use pairwise sample similarities, based on the premise that similar samples contribute similarly to the learning process. We formalize the coreset selection problem as a discrete quadratic programming (DQP) task, with the objective of maximizing the total information content, represented as the sum of individual sample contributions minus the redundancies introduced by similar samples within the coreset. To ensure practical scalability, we introduce an efficient gradient-based solver, complemented by sparsification techniques applied to the similarity matrix and dataset partitioning strategies. This enables InfoMax to seamlessly scale to datasets with millions of samples. Extensive experiments demonstrate the superior performance of InfoMax in various data pruning tasks, including image classification, vision-language pre-training, and instruction tuning for large language models. Code is available at https://github.com/hrtan/InfoMax.

[241] PiT: Progressive Diffusion Transformer

Jiafu Wu, Yabiao Wang, Jian Li, Jinlong Peng, Yun Cao, Chengjie Wang, Jiangning Zhang

Main category: cs.CV

TL;DR: The paper introduces Pseudo Shifted Window Attention (PSWA) and Progressive Coverage Channel Allocation (PCCA) to improve Diffusion Transformers (DiTs), reducing redundancy and computational costs while enhancing performance.

DetailsMotivation: DiTs face high computational costs and redundancy in global attention, and conventional attention mechanisms suffer from inefficiency.

Method: Proposes PSWA for balanced global-local modeling and PCCA for high-order attention. Introduces Pseudo Progressive Diffusion Transformer (PiT).

Result: PiT-L achieves 54% FID improvement over DiT-XL/2 with less computation.

Conclusion: The proposed innovations significantly enhance DiTs’ efficiency and performance.

Abstract: Diffusion Transformers (DiTs) achieve remarkable performance within image generation via the transformer architecture. Conventionally, DiTs are constructed by stacking serial isotropic global modeling transformers, which face significant quadratic computational cost. However, through empirical analysis, we find that DiTs do not rely as heavily on global information as previously believed. In fact, most layers exhibit significant redundancy in global computation. Additionally, conventional attention mechanisms suffer from low-frequency inertia, limiting their efficiency. To address these issues, we propose Pseudo Shifted Window Attention (PSWA), which fundamentally mitigates global attention redundancy. PSWA achieves moderate global-local information through window attention. It further utilizes a high-frequency bridging branch to simulate shifted window operations, which both enrich the high-frequency information and strengthen inter-window connections. Furthermore, we propose the Progressive Coverage Channel Allocation (PCCA) strategy that captures high-order attention without additional computational cost. Based on these innovations, we propose a series of Pseudo Progressive Diffusion Transformer (PiT). Our extensive experiments show their superior performance; for example, our proposed PiT-L achieves 54% FID improvement over DiT-XL/2 while using less computation.

[242] Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings

Xingguang Wei, Haomin Wang, Shenglong Ye, Ruifeng Luo, Yanting Zhang, Lixin Gu, Jifeng Dai, Yu Qiao, Wenhai Wang, Hongjie Zhang

Main category: cs.CV

TL;DR: VecFormer introduces a line-based representation for panoptic symbol spotting in CAD drawings, improving accuracy and computational efficiency, and achieves state-of-the-art results.

DetailsMotivation: Existing methods for panoptic symbol spotting in CAD drawings suffer from high computational costs, limited generality, and loss of geometric information.

Method: VecFormer uses line-based representation of primitives and a Branch Fusion Refinement module to integrate instance and semantic predictions.

Result: Achieves 91.1 PQ, with Stuff-PQ improved by 9.6 and 21.2 points over the second-best results.

Conclusion: Line-based representation shows strong potential for vector graphic understanding, offering accuracy and efficiency.

Abstract: We study the task of panoptic symbol spotting, which involves identifying both individual instances of countable things and the semantic regions of uncountable stuff in computer-aided design (CAD) drawings composed of vector graphical primitives. Existing methods typically rely on image rasterization, graph construction, or point-based representation, but these approaches often suffer from high computational costs, limited generality, and loss of geometric structural information. In this paper, we propose VecFormer, a novel method that addresses these challenges through line-based representation of primitives. This design preserves the geometric continuity of the original primitive, enabling more accurate shape representation while maintaining a computation-friendly structure, making it well-suited for vector graphic understanding tasks. To further enhance prediction reliability, we introduce a Branch Fusion Refinement module that effectively integrates instance and semantic predictions, resolving their inconsistencies for more coherent panoptic outputs. Extensive experiments demonstrate that our method establishes a new state-of-the-art, achieving 91.1 PQ, with Stuff-PQ improved by 9.6 and 21.2 points over the second-best results under settings with and without prior information, respectively, highlighting the strong potential of line-based representation as a foundation for vector graphic understanding.

[243] Quantitative Comparison of Fine-Tuning Techniques for Pretrained Latent Diffusion Models in the Generation of Unseen SAR Images

Solène Debuysère, Nicolas Trouvé, Nathan Letheule, Olivier Lévêque, Elise Colin

Main category: cs.CV

TL;DR: A framework adapts a pretrained latent diffusion model for high-resolution SAR image generation, enabling controllable synthesis of rare scenes. It uses LoRA for parameter-efficient tuning and evaluates performance via statistical, textural, and semantic metrics.

DetailsMotivation: To leverage pretrained models for SAR image generation, avoiding the need for task-specific small models and enabling rare scene synthesis.

Method: Adapts a text-to-image foundation model to SAR using LoRA for tuning. Evaluates via statistical distances, GLCM descriptors, and SAR-specialized CLIP.

Result: Hybrid tuning (full UNet with LoRA on text encoders) best preserves SAR geometry and texture while maintaining prompt fidelity.

Conclusion: The framework supports text-based control and multimodal conditioning, useful for SAR data augmentation and scenario simulation.

Abstract: We present a framework for adapting a large pretrained latent diffusion model to high-resolution Synthetic Aperture Radar (SAR) image generation. The approach enables controllable synthesis and the creation of rare or out-of-distribution scenes beyond the training set. Rather than training a task-specific small model from scratch, we adapt an open-source text-to-image foundation model to the SAR modality, using its semantic prior to align prompts with SAR imaging physics (side-looking geometry, slant-range projection, and coherent speckle with heavy-tailed statistics). Using a 100k-image SAR dataset, we compare full fine-tuning and parameter-efficient Low-Rank Adaptation (LoRA) across the UNet diffusion backbone, the Variational Autoencoder (VAE), and the text encoders. Evaluation combines (i) statistical distances to real SAR amplitude distributions, (ii) textural similarity via Gray-Level Co-occurrence Matrix (GLCM) descriptors, and (iii) semantic alignment using a SAR-specialized CLIP model. Our results show that a hybrid strategy-full UNet tuning with LoRA on the text encoders and a learned token embedding-best preserves SAR geometry and texture while maintaining prompt fidelity. The framework supports text-based control and multimodal conditioning (e.g., segmentation maps, TerraSAR-X, or optical guidance), opening new paths for large-scale SAR scene data augmentation and unseen scenario simulation in Earth observation.

[244] Stepwise Decomposition and Dual-stream Focus: A Novel Approach for Training-free Camouflaged Object Segmentation

Chao Yin, Hao Li, Kequan Yang, Jide Li, Pinpin Zhu, Xiaoqiang Li

Main category: cs.CV

TL;DR: RDVP-MSD is a training-free framework combining RDVP and MSD-CoT to improve camouflaged object segmentation by addressing semantic ambiguity and discrepancy.

DetailsMotivation: Current promptable segmentation methods struggle with semantic ambiguity and spatial separation in COS, requiring manual prompts for each object.

Method: Proposes RDVP-MSD, integrating Region-constrained Dual-stream Visual Prompting (RDVP) and Multimodal Stepwise Decomposition Chain of Thought (MSD-CoT) to refine prompts.

Result: Achieves state-of-the-art segmentation on COS benchmarks with faster inference and no training needed.

Conclusion: RDVP-MSD effectively addresses key issues in COS, offering improved accuracy and efficiency without supervision.

Abstract: While promptable segmentation (\textit{e.g.}, SAM) has shown promise for various segmentation tasks, it still requires manual visual prompts for each object to be segmented. In contrast, task-generic promptable segmentation aims to reduce the need for such detailed prompts by employing only a task-generic prompt to guide segmentation across all test samples. However, when applied to Camouflaged Object Segmentation (COS), current methods still face two critical issues: 1) \textit{\textbf{semantic ambiguity in getting instance-specific text prompts}}, which arises from insufficient discriminative cues in holistic captions, leading to foreground-background confusion; 2) \textit{\textbf{semantic discrepancy combined with spatial separation in getting instance-specific visual prompts}}, which results from global background sampling far from object boundaries with low feature correlation, causing SAM to segment irrelevant regions. To address the issues above, we propose \textbf{RDVP-MSD}, a novel training-free test-time adaptation framework that synergizes \textbf{R}egion-constrained \textbf{D}ual-stream \textbf{V}isual \textbf{P}rompting (RDVP) via \textbf{M}ultimodal \textbf{S}tepwise \textbf{D}ecomposition Chain of Thought (MSD-CoT). MSD-CoT progressively disentangles image captions to eliminate semantic ambiguity, while RDVP injects spatial constraints into visual prompting and independently samples visual prompts for foreground and background points, effectively mitigating semantic discrepancy and spatial separation. Without requiring any training or supervision, RDVP-MSD achieves a state-of-the-art segmentation result on multiple COS benchmarks and delivers a faster inference speed than previous methods, demonstrating significantly improved accuracy and efficiency. The codes will be available at \href{https://github.com/ycyinchao/RDVP-MSD}{https://github.com/ycyinchao/RDVP-MSD}

[245] Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability

Jongoh Jeong, Hunmin Yang, Jaeseok Jeong, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: The paper introduces a semantic structure-aware attack framework using Mean Teacher to enhance adversarial transferability by leveraging under-exploited semantic features in generative models.

DetailsMotivation: Existing generative adversarial attacks underutilize semantic information in intermediate activations, limiting perturbation alignment with object-salient regions critical for transferability.

Method: Proposes a framework based on Mean Teacher for semantic consistency via feature distillation, anchoring perturbation synthesis to semantically salient early intermediate blocks.

Result: Demonstrates consistent improvements over state-of-the-art generative attacks across diverse models, domains, and tasks.

Conclusion: The method effectively enhances adversarial transferability by better utilizing semantic features, validated by conventional metrics and a new Accidental Correction Rate (ACR).

Abstract: Generative adversarial attacks train a perturbation generator on a white-box surrogate model and subsequently apply the crafted perturbations to unseen black-box victim models. In contrast to iterative attacks, these methods deliver superior inference-time efficiency, scalability, and transferability; however, up until now, existing studies have not fully exploited the representational capacity of generative models to preserve and harness semantic information. Specifically, the intermediate activations of the generator encode rich semantic features–object boundaries and coarse shapes–that remain under-exploited, thereby limiting the alignment of perturbations with object-salient regions which are critical for adversarial transferability. To remedy this, we introduce a semantic structure-aware attack framework based on the Mean Teacher, which serves as a temporally smoothed feature reference. With this smoothed reference, we further direct semantic consistency between the early-layer activations in the student and those of the semantically rich teacher by feature distillation. By anchoring perturbation synthesis to the semantically salient early intermediate blocks within the generator based on empirical findings, our method guides progressive adversarial perturbation on regions that substantially enhance adversarial transferability. We conduct extensive experiments over diverse models, domains and tasks to demonstrate consistent improvements relative to state-of-the-art generative attacks, comprehensively evaluated using conventional metrics and our newly proposed Accidental Correction Rate (ACR).

[246] TD3Net: A temporal densely connected multi-dilated convolutional network for lipreading

Byung Hoon Lee, Wooseok Shin, Sung Won Han

Main category: cs.CV

TL;DR: The paper proposes TD3Net, a backend architecture for word-level lipreading that combines dense skip connections and multi-dilated temporal convolutions to address blind spots in the receptive field, achieving state-of-the-art performance with fewer parameters and lower computational costs.

DetailsMotivation: Current temporal convolutional networks (TCNs) in lipreading suffer from blind spots in the receptive field, leading to information loss about continuous lip movements.

Method: TD3Net integrates dense skip connections and multi-dilated temporal convolutions to ensure a wide, dense receptive field without blind spots.

Result: TD3Net outperforms existing TCN-based methods in accuracy while using fewer parameters and lower computational resources, as demonstrated on LRW and LRW-1000 datasets.

Conclusion: TD3Net effectively models complex temporal representations in lipreading, preserving temporal continuity and offering practical advantages for real-world applications.

Abstract: The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems. The code is available at our GitHub repository (https://github.com/Leebh-kor/TD3Net).

[247] GLM-4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingde Xu, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianyu Tong, Wenkai Li, Wei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xinyue Fan, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yanzi Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuting Wang, Yu Wang, Yuxuan Zhang, Zhao Xue, Zhenyu Hou, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang

Main category: cs.CV

TL;DR: GLM-4.1V-Thinking and GLM-4.5V are advanced vision-language models (VLMs) achieving state-of-the-art performance on diverse tasks through a reasoning-centric training framework and Reinforcement Learning with Curriculum Sampling (RLCS).

DetailsMotivation: To advance general-purpose multimodal understanding and reasoning by developing highly capable vision-language models.

Method: Large-scale pre-training followed by RLCS to enhance capabilities across tasks like STEM problem solving, video understanding, and coding.

Result: GLM-4.5V outperforms open-source models and competes with closed-source models like Gemini-2.5-Flash. GLM-4.1V-9B-Thinking surpasses larger models like Qwen2.5-VL-72B on 29 benchmarks.

Conclusion: The models demonstrate superior performance and are open-sourced, contributing to the field of multimodal AI.

Abstract: We present GLM-4.1V-Thinking and GLM-4.5V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. Code, models and more information are released at https://github.com/zai-org/GLM-V.

[248] VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

Runjia Li, Philip Torr, Andrea Vedaldi, Tomas Jakab

Main category: cs.CV

TL;DR: A novel memory module, Surfel-Indexed View Memory (VMem), improves video generation by efficiently retrieving relevant past views, ensuring long-term scene coherence and reducing computational costs.

DetailsMotivation: Address limitations of previous methods, such as error accumulation in 3D reconstruction and short context windows in video generators, to achieve consistent scene exploration.

Method: Introduces VMem, which indexes past views geometrically using 3D surface elements (surfels) for efficient retrieval during new view generation.

Result: Outperforms existing methods in maintaining scene coherence and camera control on long-term scene synthesis benchmarks.

Conclusion: VMem offers a computationally efficient solution for consistent and interactive environment exploration in video generation.

Abstract: We propose a novel memory module for building video generators capable of interactively exploring environments. Previous approaches have achieved similar results either by out-painting 2D views of a scene while incrementally reconstructing its 3D geometry-which quickly accumulates errors-or by using video generators with a short context window, which struggle to maintain scene coherence over the long term. To address these limitations, we introduce Surfel-Indexed View Memory (VMem), a memory module that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed. VMem enables efficient retrieval of the most relevant past views when generating new ones. By focusing only on these relevant views, our method produces consistent explorations of imagined environments at a fraction of the computational cost required to use all past views as context. We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.

[249] EXAONE Path 2.0: Pathology Foundation Model with End-to-End Supervision

Myeongjang Pyeon, Janghyeon Lee, Minsoo Lee, Juseung Yun, Hwanil Choi, Jonghyun Kim, Jiwon Kim, Yi Hu, Jongseong Jang, Soonyoung Lee

Main category: cs.CV

TL;DR: EXAONE Path 2.0 improves biomarker prediction in digital pathology by using slide-level supervision, achieving state-of-the-art results with high data efficiency.

DetailsMotivation: Patch-level self-supervised learning (SSL) overlooks domain-specific features and is less data-efficient for biomarker prediction in WSIs.

Method: EXAONE Path 2.0 learns patch-level representations under direct slide-level supervision, using only 37k WSIs.

Result: Achieves state-of-the-art performance across 10 biomarker prediction tasks with remarkable data efficiency.

Conclusion: Slide-level supervision enhances feature learning and data efficiency in digital pathology.

Abstract: In digital pathology, whole-slide images (WSIs) are often difficult to handle due to their gigapixel scale, so most approaches train patch encoders via self-supervised learning (SSL) and then aggregate the patch-level embeddings via multiple instance learning (MIL) or slide encoders for downstream tasks. However, patch-level SSL may overlook complex domain-specific features that are essential for biomarker prediction, such as mutation status and molecular characteristics, as SSL methods rely only on basic augmentations selected for natural image domains on small patch-level area. Moreover, SSL methods remain less data efficient than fully supervised approaches, requiring extensive computational resources and datasets to achieve competitive performance. To address these limitations, we present EXAONE Path 2.0, a pathology foundation model that learns patch-level representations under direct slide-level supervision. Using only 37k WSIs for training, EXAONE Path 2.0 achieves state-of-the-art average performance across 10 biomarker prediction tasks, demonstrating remarkable data efficiency.

[250] Deblurring in the Wild: A Real-World Dataset from Smartphone High-Speed Videos

Mahdi Mohd Hossain Noki, Syed Mumtahin Mahmud, Prothito Shovon Majumder, Abdul Mohaimen Al Radi, Sudipto Das Sukanto, Afia Lubaina, Md. Mosaddek Khan

Main category: cs.CV

TL;DR: A new large-scale dataset for image deblurring is introduced, created from smartphone slow-motion videos, with 42,000 high-resolution blur-sharp pairs. It challenges existing SOTA models, showing performance drops due to its complexity and diversity.

DetailsMotivation: To address the lack of large, diverse, and realistic datasets for image deblurring, which limits the development of robust models.

Method: Constructed a dataset by averaging frames from slow-motion videos to simulate blur, using the centered frame as a sharp reference.

Result: The dataset is 10x larger and 8x more diverse than existing ones, causing significant performance degradation in SOTA models.

Conclusion: The dataset provides a challenging benchmark to advance the development of generalizable deblurring models.

Abstract: We introduce the largest real-world image deblurring dataset constructed from smartphone slow-motion videos. Using 240 frames captured over one second, we simulate realistic long-exposure blur by averaging frames to produce blurry images, while using the temporally centered frame as the sharp reference. Our dataset contains over 42,000 high-resolution blur-sharp image pairs, making it approximately 10 times larger than widely used datasets, with 8 times the amount of different scenes, including indoor and outdoor environments, with varying object and camera motions. We benchmark multiple state-of-the-art (SOTA) deblurring models on our dataset and observe significant performance degradation, highlighting the complexity and diversity of our benchmark. Our dataset serves as a challenging new benchmark to facilitate robust and generalizable deblurring models.

[251] MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models

Jian Chen, Wenye Ma, Penghang Liu, Wei Wang, Tengwei Song, Ming Li, Chenguang Wang, Jiayu Qin, Ruiyi Zhang, Changyou Chen

Main category: cs.CV

TL;DR: The paper introduces MusiXQA, a dataset for evaluating MLLMs in music sheet understanding, and presents Phi-3-MusiX, a fine-tuned model outperforming GPT-based methods.

DetailsMotivation: Current MLLMs lack exploration in interpreting music sheets, prompting the need for a dedicated dataset and model.

Method: Created MusiXQA with synthetic music sheets and structured annotations, then fine-tuned Phi-3-MusiX on this dataset.

Result: Revealed limitations of existing MLLMs in music sheet understanding, with Phi-3-MusiX showing superior performance.

Conclusion: MusiXQA and Phi-3-MusiX provide a foundation for advancing MLLMs in music sheet interpretation.

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable visual reasoning abilities in natural images, text-rich documents, and graphic designs. However, their ability to interpret music sheets remains underexplored. To bridge this gap, we introduce MusiXQA, the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding. MusiXQA features high-quality synthetic music sheets generated via MusiXTeX, with structured annotations covering note pitch and duration, chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks. Through extensive evaluations, we reveal significant limitations of current state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods. The proposed dataset and model establish a foundation for future advances in MLLMs for music sheet understanding. Code, data, and model will be released upon acceptance.

[252] Warehouse Spatial Question Answering with LLM Agent

Hsiang-Wei Huang, Jen-Hao Cheng, Kuang-Ming Chen, Cheng-Yen Yang, Bahaa Alattar, Yi-Ru Lin, Pyongkun Kim, Sangwon Kim, Kwangju Kim, Chung-I Huang, Jenq-Neng Hwang

Main category: cs.CV

TL;DR: A data-efficient LLM agent system enhances spatial reasoning for complex indoor warehouse tasks, outperforming previous methods.

DetailsMotivation: Existing MLLMs struggle with spatial understanding; this work aims to improve efficiency and accuracy in spatial reasoning.

Method: Proposes an LLM agent system with advanced spatial reasoning and API tool integration for complex spatial tasks.

Result: Achieves high accuracy and efficiency in object retrieval, counting, and distance estimation on the AI City Challenge dataset.

Conclusion: The system demonstrates superior performance in spatial reasoning tasks, offering a practical solution for warehouse scenarios.

Abstract: Spatial understanding has been a challenging task for existing Multi-modal Large Language Models~(MLLMs). Previous methods leverage large-scale MLLM finetuning to enhance MLLM’s spatial understanding ability. In this paper, we present a data-efficient approach. We propose a LLM agent system with strong and advanced spatial reasoning ability, which can be used to solve the challenging spatial question answering task in complex indoor warehouse scenarios. Our system integrates multiple tools that allow the LLM agent to conduct spatial reasoning and API tools interaction to answer the given complicated spatial question. Extensive evaluations on the 2025 AI City Challenge Physical AI Spatial Intelligence Warehouse dataset demonstrate that our system achieves high accuracy and efficiency in tasks such as object retrieval, counting, and distance estimation. The code is available at: https://github.com/hsiangwei0903/SpatialAgent

[253] Common Data Properties Limit Object-Attribute Binding in CLIP

Bijay Gurung, David T. Hoffmann, Thomas Brox

Main category: cs.CV

TL;DR: The paper investigates why CLIP models struggle with binding (e.g., distinguishing object-attribute pairs) and identifies data properties like low attribute density and saliency bias as key issues. Scaling batch size or adding hard negatives doesn’t help; only data with specific properties enables reliable binding.

DetailsMotivation: CLIP models fail at binding tasks (e.g., distinguishing object-attribute pairs), and previous fixes like hard negatives or architectural changes are insufficient. The paper explores whether data properties are the root cause.

Method: The authors use a synthetic dataset to rigorously analyze how data properties (e.g., attribute density, caption completeness, saliency bias) affect CLIP’s binding performance.

Result: Common natural data properties hinder binding. Neither scaling batch size nor adding hard negatives improves performance. Only data with specific properties enables reliable binding.

Conclusion: Data properties are critical for CLIP’s binding ability. Future work should focus on curating or generating data with these properties to improve binding performance.

Abstract: Contrastive vision-language models like CLIP are used for a large variety of applications, such as zero-shot classification or as vision encoder for multi-modal models. Despite their popularity, their representations show major limitations. For instance, CLIP models learn bag-of-words representations and, as a consequence, fail to distinguish whether an image is of a yellow submarine and a blue bus'' or a blue submarine and a yellow bus’’. Previous attempts to fix this issue added hard negatives during training or modified the architecture, but failed to resolve the problem in its entirety. We suspect that the missing insights to solve the binding problem for CLIP are hidden in arguably the most important part of learning algorithms: the data. In this work, we fill this gap by rigorously identifying the influence of data properties on CLIP’s ability to learn binding using a synthetic dataset. We find that common properties of natural data such as low attribute density, incomplete captions, and the saliency bias, a tendency of human captioners to describe the object that is ``most salient’’ to them, have a detrimental effect on binding performance. In contrast to common belief, we find that neither scaling the batch size, i.e., implicitly adding more hard negatives, nor explicitly creating hard negatives enables CLIP to learn reliable binding. Only when the data expresses our identified data properties does CLIP learn almost perfect binding.

[254] M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation

Kui Jiang, Shiyu Liu, Junjun Jiang, Hongxun Yao, Xiaopeng Fan

Main category: cs.CV

TL;DR: M2DAO-Talker introduces a unified framework for audio-driven talking head generation, improving motion modeling and rendering quality via multi-granular motion decoupling and alternating optimization.

DetailsMotivation: Existing 3D methods for talking head generation suffer from rendering artifacts like motion blur and local penetration due to unstable motion field representation.

Method: The framework involves video preprocessing, motion representation, and rendering reconstruction, with a focus on multi-granular motion decoupling and alternating optimization.

Result: M2DAO-Talker achieves state-of-the-art performance, with significant improvements in PSNR (2.43 dB) and user-evaluated realness (0.64 gain) over TalkingGaussian, while maintaining 150 FPS inference speed.

Conclusion: The proposed method effectively addresses rendering artifacts and enhances realism in talking head generation, demonstrating superior performance and efficiency.

Abstract: Audio-driven talking head generation holds significant potential for film production. While existing 3D methods have advanced motion modeling and content synthesis, they often produce rendering artifacts, such as motion blur, temporal jitter, and local penetration, due to limitations in representing stable, fine-grained motion fields. Through systematic analysis, we reformulate talking head generation into a unified framework comprising three steps: video preprocessing, motion representation, and rendering reconstruction. This framework underpins our proposed M2DAO-Talker, which addresses current limitations via multi-granular motion decoupling and alternating optimization. Specifically, we devise a novel 2D portrait preprocessing pipeline to extract frame-wise deformation control conditions (motion region segmentation masks, and camera parameters) to facilitate motion representation. To ameliorate motion modeling, we elaborate a multi-granular motion decoupling strategy, which independently models non-rigid (oral and facial) and rigid (head) motions for improved reconstruction accuracy. Meanwhile, a motion consistency constraint is developed to ensure head-torso kinematic consistency, thereby mitigating penetration artifacts caused by motion aliasing. In addition, an alternating optimization strategy is designed to iteratively refine facial and oral motion parameters, enabling more realistic video generation. Experiments across multiple datasets show that M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness versus TalkingGaussian while with 150 FPS inference speed. Our project homepage is https://m2dao-talker.github.io/M2DAO-Talk.github.io.

[255] Exploring the Application of Visual Question Answering (VQA) for Classroom Activity Monitoring

Sinh Trong Vu, Hieu Trung Pham, Dung Manh Nguyen, Hieu Minh Hoang, Nhu Hoang Le, Thu Ha Pham, Tai Tan Mai

Main category: cs.CV

TL;DR: The paper explores using state-of-the-art VQA models (LLaMA2, LLaMA3, QWEN3, NVILA) for classroom behavior analysis, introducing the BAV-Classroom-VQA dataset. Results show promising performance for automated classroom monitoring.

DetailsMotivation: Classroom behavior monitoring is crucial for student engagement and learning outcomes, and VQA models offer automated tools for analyzing classroom interactions.

Method: The study evaluates open-source VQA models on the BAV-Classroom-VQA dataset, detailing data collection, annotation, and benchmarking.

Result: All four VQA models perform well in answering behavior-related visual questions, indicating potential for classroom analytics.

Conclusion: VQA models are promising for automating classroom behavior analysis, with implications for future educational interventions.

Abstract: Classroom behavior monitoring is a critical aspect of educational research, with significant implications for student engagement and learning outcomes. Recent advancements in Visual Question Answering (VQA) models offer promising tools for automatically analyzing complex classroom interactions from video recordings. In this paper, we investigate the applicability of several state-of-the-art open-source VQA models, including LLaMA2, LLaMA3, QWEN3, and NVILA, in the context of classroom behavior analysis. To facilitate rigorous evaluation, we introduce our BAV-Classroom-VQA dataset derived from real-world classroom video recordings at the Banking Academy of Vietnam. We present the methodology for data collection, annotation, and benchmark the performance of the selected VQA models on this dataset. Our initial experimental results demonstrate that all four models achieve promising performance levels in answering behavior-related visual questions, showcasing their potential in future classroom analytics and intervention systems.

[256] Hierarchical Cross-modal Prompt Learning for Vision-Language Models

Hao Zheng, Shunzhi Yang, Zhuoxin He, Jinfeng Yang, Zhenhua Huang

Main category: cs.CV

TL;DR: HiCroPL, a hierarchical cross-modal prompt learning framework, addresses modality isolation and semantic decay in VLMs by enabling bidirectional knowledge flow between text and vision modalities, achieving state-of-the-art results.

DetailsMotivation: Adapting large-scale VLMs like CLIP to downstream tasks without losing generalization is challenging due to modality isolation and hierarchical semantic decay.

Method: HiCroPL establishes bidirectional knowledge flow using a hierarchical knowledge mapper and lightweight layer-specific proxies, refining semantics between text and vision modalities.

Result: HiCroPL outperforms existing methods, achieving state-of-the-art results on 11 benchmarks across four tasks.

Conclusion: HiCroPL effectively enhances generalization in VLMs by addressing key bottlenecks, demonstrating superior performance in cross-modal tasks.

Abstract: Pre-trained Vision-Language Models (VLMs) such as CLIP have shown excellent generalization abilities. However, adapting these large-scale models to downstream tasks while preserving their generalization capabilities remains challenging. Although prompt learning methods have shown promise, they suffer from two fundamental bottlenecks that limit generalization: (a) modality isolation, and (b) hierarchical semantic decay. To address these limitations, we propose HiCroPL, a Hierarchical Cross-modal Prompt Learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling them to refine their semantics mutually. HiCroPL routes knowledge flows by leveraging the complementary strengths of text and vision. In early layers, text prompts inject relatively clear semantics into visual prompts through a hierarchical knowledge mapper, enhancing the representation of low-level visual semantics. In later layers, visual prompts encoding specific task-relevant objects flow back to refine text prompts, enabling deeper alignment. Crucially, our hierarchical knowledge mapper allows representations at multi-scales to be fused, ensuring that deeper representations retain transferable shallow semantics thereby enhancing generalization. We further introduce a lightweight layer-specific knowledge proxy to enable efficient cross-modal interactions. Extensive evaluations across four tasks demonstrate HiCroPL’s superior performance, achieving state-of-the-art results on 11 benchmarks with significant improvements. Code is available at: https://github.com/zzeoZheng/HiCroPL.

[257] Motion Matters: Motion-guided Modulation Network for Skeleton-based Micro-Action Recognition

Jihao Gu, Kun Li, Fei Wang, Yanyan Wei, Zhiliang Wu, Hehe Fan, Meng Wang

Main category: cs.CV

TL;DR: The paper introduces a Motion-guided Modulation Network (MMN) to improve Micro-Action Recognition by capturing subtle motion cues, achieving state-of-the-art results.

DetailsMotivation: Existing methods overlook subtle changes in Micro-Actions (MAs), limiting recognition accuracy.

Method: MMN includes Motion-guided Skeletal Modulation (MSM) for skeletal-level motion cues and Motion-guided Temporal Modulation (MTM) for frame-level motion patterns, with a motion consistency learning strategy.

Result: MMN outperforms on Micro-Action 52 and iMiGUE datasets, proving the value of modeling subtle motion cues.

Conclusion: Explicitly modeling subtle motion cues enhances Micro-Action Recognition, as demonstrated by MMN’s performance.

Abstract: Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle changes in MAs, which limits the accuracy of distinguishing MAs with subtle changes. To address this issue, we present a novel Motion-guided Modulation Network (MMN) that implicitly captures and modulates subtle motion cues to enhance spatial-temporal representation learning. Specifically, we introduce a Motion-guided Skeletal Modulation module (MSM) to inject motion cues at the skeletal level, acting as a control signal to guide spatial representation modeling. In parallel, we design a Motion-guided Temporal Modulation module (MTM) to incorporate motion information at the frame level, facilitating the modeling of holistic motion patterns in micro-actions. Finally, we propose a motion consistency learning strategy to aggregate the motion cues from multi-scale features for micro-action classification. Experimental results on the Micro-Action 52 and iMiGUE datasets demonstrate that MMN achieves state-of-the-art performance in skeleton-based micro-action recognition, underscoring the importance of explicitly modeling subtle motion cues. The code will be available at https://github.com/momiji-bit/MMN.

[258] SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, Ruqi Huang

Main category: cs.CV

TL;DR: SIFThinker is a spatially-aware framework for MLLMs that improves visual tasks by mimicking human perception, using attention correction and spatial cues.

DetailsMotivation: Current MLLMs struggle with complex visual tasks like spatial understanding and fine-grained perception, lacking iterative refinement of focus on relevant regions.

Method: SIFThinker integrates depth-enhanced bounding boxes and natural language for attention correction. It uses a reverse-expansion-forward-inference strategy and GRPO-SIF, a reinforced training paradigm.

Result: SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained perception while maintaining general capabilities.

Conclusion: The framework effectively addresses MLLM challenges in visual tasks, demonstrating superior performance and generalizability.

Abstract: Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware “think-with-images” framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method. Code: https://github.com/zhangquanchen/SIFThinker.

[259] A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks

Hang Su, Yunlong Feng, Daniel Gehrig, Panfeng Jiang, Ling Gao, Xavier Lagorce, Laurent Kneip

Main category: cs.CV

TL;DR: A unified approach for structure and linear motion estimation from 2D point correspondences with arbitrary timestamps, applicable to various camera types.

DetailsMotivation: Existing algorithms like the 5-point or 8-point methods are limited to synchronized views, failing for asynchronous data from rolling shutter or event cameras.

Method: Formulates the problem using first-order dynamics and a constant velocity motion model, deriving a linear point incidence relation for efficient recovery of velocity and 3D points.

Result: Validated on simulated and real-world data, showing consistent improvement over recent approaches across all camera modalities.

Conclusion: The work enables efficient structure and motion estimation from asynchronous data, with potential applications in diverse sensing modalities.

Abstract: Structure and continuous motion estimation from point correspondences is a fundamental problem in computer vision that has been powered by well-known algorithms such as the familiar 5-point or 8-point algorithm. However, despite their acclaim, these algorithms are limited to processing point correspondences originating from a pair of views each one representing an instantaneous capture of the scene. Yet, in the case of rolling shutter cameras, or more recently, event cameras, this synchronization breaks down. In this work, we present a unified approach for structure and linear motion estimation from 2D point correspondences with arbitrary timestamps, from an arbitrary set of views. By formulating the problem in terms of first-order dynamics and leveraging a constant velocity motion model, we derive a novel, linear point incidence relation allowing for the efficient recovery of both linear velocity and 3D points with predictable degeneracies and solution multiplicities. Owing to its general formulation, it can handle correspondences from a wide range of sensing modalities such as global shutter, rolling shutter, and event cameras, and can even combine correspondences from different collocated sensors. We validate the effectiveness of our solver on both simulated and real-world data, where we show consistent improvement across all modalities when compared to recent approaches. We believe our work opens the door to efficient structure and motion estimation from asynchronous data. Code can be found at https://github.com/suhang99/AsyncTrack-Motion-Solver.

[260] iSafetyBench: A video-language benchmark for safety in industrial environment

Raiyaan Abdullah, Yogesh Singh Rawat, Shruti Vyas

Main category: cs.CV

TL;DR: The paper introduces iSafetyBench, a benchmark for evaluating vision-language models in industrial settings, focusing on routine and hazardous actions. It reveals gaps in model performance, especially for safety-critical scenarios.

DetailsMotivation: To address the underexplored capabilities of VLMs in high-stakes industrial domains, where recognizing both routine and hazardous actions is crucial.

Method: The authors created iSafetyBench, a dataset of 1,100 real-world industrial video clips annotated with multi-label action tags (98 routine and 67 hazardous categories) and paired with multiple-choice questions. They evaluated eight state-of-the-art VLMs under zero-shot conditions.

Result: Despite strong performance on existing benchmarks, the models struggled with iSafetyBench, particularly in hazardous activity recognition and multi-label scenarios, revealing significant performance gaps.

Conclusion: The study highlights the need for more robust, safety-aware multimodal models in industrial applications and introduces iSafetyBench as a pioneering testbed for future research.

Abstract: Recent advances in vision-language models (VLMs) have enabled impressive generalization across diverse video understanding tasks under zero-shot settings. However, their capabilities in high-stakes industrial domains-where recognizing both routine operations and safety-critical anomalies is essential-remain largely underexplored. To address this gap, we introduce iSafetyBench, a new video-language benchmark specifically designed to evaluate model performance in industrial environments across both normal and hazardous scenarios. iSafetyBench comprises 1,100 video clips sourced from real-world industrial settings, annotated with open-vocabulary, multi-label action tags spanning 98 routine and 67 hazardous action categories. Each clip is paired with multiple-choice questions for both single-label and multi-label evaluation, enabling fine-grained assessment of VLMs in both standard and safety-critical contexts. We evaluate eight state-of-the-art video-language models under zero-shot conditions. Despite their strong performance on existing video benchmarks, these models struggle with iSafetyBench-particularly in recognizing hazardous activities and in multi-label scenarios. Our results reveal significant performance gaps, underscoring the need for more robust, safety-aware multimodal models for industrial applications. iSafetyBench provides a first-of-its-kind testbed to drive progress in this direction. The dataset is available at: https://github.com/iSafetyBench/data.

[261] PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation

Zonglei Jing, Xiao Yang, Xiaoqian Li, Siyuan Liang, Aishan Liu, Mingchuan Zhang, Xianglong Liu

Main category: cs.CV

TL;DR: PromptSafe is a gated prompt tuning framework for text-to-image models to prevent NSFW content, combining lightweight text-only training with adaptive inference-time control.

DetailsMotivation: Current moderation methods for T2I models are costly, degrade benign image quality, and lack adaptability to nuanced safety needs.

Method: Uses an LLM to rewrite unsafe prompts into safe alternatives, trains a universal soft prompt, and employs a gated control network for adaptive defense.

Result: Achieves a 2.36% unsafe generation rate while preserving benign fidelity, with strong generalization and robustness.

Conclusion: PromptSafe offers a practical, scalable solution for safe T2I deployment.

Abstract: Text-to-image (T2I) models have demonstrated remarkable generative capabilities but remain vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery. While recent moderation efforts have introduced soft prompt-guided tuning by appending defensive tokens to the input, these approaches often rely on large-scale curated image-text datasets and apply static, one-size-fits-all defenses at inference time. However, this results not only in high computational cost and degraded benign image quality, but also in limited adaptability to the diverse and nuanced safety requirements of real-world prompts. To address these challenges, we propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network. Instead of training on expensive image-text datasets, we first rewrite unsafe prompts into semantically aligned but safe alternatives using an LLM, constructing an efficient text-only training corpus. Based on this, we optimize a universal soft prompt that repels unsafe and attracts safe embeddings during the diffusion denoising process. To avoid over-suppressing benign prompts, we introduce a gated mechanism that adaptively adjusts the defensive strength based on estimated prompt toxicity, thereby aligning defense intensity with prompt risk and ensuring strong protection for harmful inputs while preserving benign generation quality. Extensive experiments across multiple benchmarks and T2I models show that PromptSafe achieves a SOTA unsafe generation rate (2.36%), while preserving high benign fidelity. Furthermore, PromptSafe demonstrates strong generalization to unseen harmful categories, robust transferability across diffusion model architectures, and resilience under adaptive adversarial attacks, highlighting its practical value for safe and scalable deployment.

[262] Yan: Foundational Interactive Video Generation

Deheng Ye, Fangyun Zhou, Jiacheng Lv, Jianqi Ma, Jun Zhang, Junyan Lv, Junyou Li, Minwen Deng, Mingyu Yang, Qiang Fu, Wei Yang, Wenkai Lv, Yangbin Yu, Yewen Wang, Yonghang Guan, Zhihao Hu, Zhongbin Fang, Zhongqian Sun

Main category: cs.CV

TL;DR: Yan is a framework for interactive video generation, combining simulation, multi-modal generation, and multi-granularity editing into a unified pipeline.

DetailsMotivation: To advance interactive video generation by integrating simulation, generation, and editing into a cohesive system for creative applications.

Method: Uses a 3D-VAE for simulation, hierarchical autoregressive captioning for multi-modal generation, and a hybrid model for disentangling mechanics and rendering for editing.

Result: Achieves real-time 1080P/60FPS simulation, flexible domain blending, and text-based video editing during interaction.

Conclusion: Yan unifies interactive video generation capabilities, enabling a comprehensive AI-driven creation paradigm for future creative tools and media.

Abstract: We present Yan, a foundational framework for interactive video generation, covering the entire pipeline from simulation and generation to editing. Specifically, Yan comprises three core modules. AAA-level Simulation: We design a highly-compressed, low-latency 3D-VAE coupled with a KV-cache-based shift-window denoising inference process, achieving real-time 1080P/60FPS interactive simulation. Multi-Modal Generation: We introduce a hierarchical autoregressive caption method that injects game-specific knowledge into open-domain multi-modal video diffusion models (VDMs), then transforming the VDM into a frame-wise, action-controllable, real-time infinite interactive video generator. Notably, when the textual and visual prompts are sourced from different domains, the model demonstrates strong generalization, allowing it to blend and compose the style and mechanics across domains flexibly according to user prompts. Multi-Granularity Editing: We propose a hybrid model that explicitly disentangles interactive mechanics simulation from visual rendering, enabling multi-granularity video content editing during interaction through text. Collectively, Yan offers an integration of these modules, pushing interactive video generation beyond isolated capabilities toward a comprehensive AI-driven interactive creation paradigm, paving the way for the next generation of creative tools, media, and entertainment. The project page is: https://greatx3.github.io/Yan/.

[263] MedVLThinker: Simple Baselines for Multimodal Medical Reasoning

Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou

Main category: cs.CV

TL;DR: MedVLThinker introduces open, reproducible methods for building reasoning-centric medical LMMs, outperforming SFT with RLVR and achieving SOTA results.

DetailsMotivation: The lack of open recipes for medical reasoning models hinders research and comparison, prompting the need for a transparent framework.

Method: Systematic data curation and two training paradigms: SFT on reasoning traces and RLVR based on answer correctness.

Result: RLVR outperforms SFT, and text-only data boosts performance more than multimodal data. The 7B model achieves SOTA, and scaling to 32B matches GPT-4o.

Conclusion: MedVLThinker provides a strong, open foundation for future medical reasoning research, with released data, models, and code.

Abstract: Large Reasoning Models (LRMs) have introduced a new paradigm in AI by enabling models to ``think before responding" via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical LMMs hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning (SFT) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR) based on final answer correctness. Across extensive experiments on the Qwen2.5-VL model family (3B, 7B) and six medical QA benchmarks, we find that RLVR consistently and significantly outperforms SFT. Additionally, under the RLVR framework, a key, counter-intuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the RLVR recipe on text-only data, establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs. Furthermore, scaling our model to 32B achieves performance on par with the proprietary GPT-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.

[264] Unifying Locality of KANs and Feature Drift Compensation Projection for Data-free Replay based Continual Face Forgery Detection

Tianshuo Zhang, Siran Peng, Li Gao, Haoyuan Zhang, Xiangyu Zhu, Zhen Lei

Main category: cs.CV

TL;DR: The paper proposes a KAN-based Continual Face Forgery Detection (KAN-CFD) framework to address catastrophic forgetting in face forgery detection, using Domain-Group KAN Detector (DG-KD) and Feature Separation strategy (FS-KDCP).

DetailsMotivation: Face forgery detectors degrade performance on older forgery types when learning new ones (catastrophic forgetting). Kolmogorov-Arnold Networks (KANs) offer local plasticity but struggle with high-dimensional images and overlapping domains.

Method: Introduces KAN-CFD with DG-KD for high-dimensional image compatibility and FS-KDCP to prevent input space overlap without prior task data.

Result: The method achieves superior performance and reduces forgetting in experiments.

Conclusion: KAN-CFD effectively addresses catastrophic forgetting in continual face forgery detection, improving adaptability and performance.

Abstract: The rapid advancements in face forgery techniques necessitate that detectors continuously adapt to new forgery methods, thus situating face forgery detection within a continual learning paradigm. However, when detectors learn new forgery types, their performance on previous types often degrades rapidly, a phenomenon known as catastrophic forgetting. Kolmogorov-Arnold Networks (KANs) utilize locally plastic splines as their activation functions, enabling them to learn new tasks by modifying only local regions of the functions while leaving other areas unaffected. Therefore, they are naturally suitable for addressing catastrophic forgetting. However, KANs have two significant limitations: 1) the splines are ineffective for modeling high-dimensional images, while alternative activation functions that are suitable for images lack the essential property of locality; 2) in continual learning, when features from different domains overlap, the mapping of different domains to distinct curve regions always collapses due to repeated modifications of the same regions. In this paper, we propose a KAN-based Continual Face Forgery Detection (KAN-CFD) framework, which includes a Domain-Group KAN Detector (DG-KD) and a data-free replay Feature Separation strategy via KAN Drift Compensation Projection (FS-KDCP). DG-KD enables KANs to fit high-dimensional image inputs while preserving locality and local plasticity. FS-KDCP avoids the overlap of the KAN input spaces without using data from prior tasks. Experimental results demonstrate that the proposed method achieves superior performance while notably reducing forgetting.

[265] IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection

Yanhui Li, Yunkang Cao, Chengliang Liu, Yuan Xiong, Xinghui Dong, Chao Huang

Main category: cs.CV

TL;DR: IAD-R1 is a universal post-training framework for Vision-Language Models (VLMs) that enhances industrial anomaly detection through a two-stage training strategy, achieving significant improvements over baselines and outperforming commercial models.

DetailsMotivation: Traditional anomaly detection methods are limited by the scarcity of defective samples, and VLMs, despite their generalization capabilities, underperform in industrial settings.

Method: IAD-R1 uses a two-stage approach: PA-SFT for anomaly perception training with a Chain-of-Thought dataset (Expert-AD) and SC-GRPO for optimizing anomaly interpretation with reward functions.

Result: IAD-R1 improves accuracy by up to 43.3% on the DAGM dataset and outperforms models like GPT-4.1 and Claude-Sonnet-4 in zero-shot settings.

Conclusion: IAD-R1 is effective and superior for industrial anomaly detection, with publicly available resources for further use.

Abstract: Industrial anomaly detection is a critical component of modern manufacturing, yet the scarcity of defective samples restricts traditional detection methods to scenario-specific applications. Although Vision-Language Models (VLMs) demonstrate significant advantages in generalization capabilities, their performance in industrial anomaly detection remains limited. To address this challenge, we propose IAD-R1, a universal post-training framework applicable to VLMs of different architectures and parameter scales, which substantially enhances their anomaly detection capabilities. IAD-R1 employs a two-stage training strategy: the Perception Activation Supervised Fine-Tuning (PA-SFT) stage utilizes a meticulously constructed high-quality Chain-of-Thought dataset (Expert-AD) for training, enhancing anomaly perception capabilities and establishing reasoning-to-answer correlations; the Structured Control Group Relative Policy Optimization (SC-GRPO) stage employs carefully designed reward functions to achieve a capability leap from “Anomaly Perception” to “Anomaly Interpretation”. Experimental results demonstrate that IAD-R1 achieves significant improvements across 7 VLMs, the largest improvement was on the DAGM dataset, with average accuracy 43.3% higher than the 0.5B baseline. Notably, the 0.5B parameter model trained with IAD-R1 surpasses commercial models including GPT-4.1 and Claude-Sonnet-4 in zero-shot settings, demonstrating the effectiveness and superiority of IAD-R1. The dataset, code, and all model weights will be publicly available at https://github.com/Yanhui-Lee/IAD-R1.

[266] EditGarment: An Instruction-Based Garment Editing Dataset Constructed with Automated MLLM Synthesis and Semantic-Aware Evaluation

Deqiang Yin, Junyi Guo, Huanda Lu, Fangyu Wu, Dongming Lu

Main category: cs.CV

TL;DR: The paper introduces an automated pipeline to create a garment editing dataset (EditGarment) using natural language instructions, addressing the lack of high-quality data in this domain.

DetailsMotivation: The scarcity of high-quality instruction-image pairs for garment editing limits progress, as manual annotation is costly and hard to scale. Existing methods lack precision and fashion-specific supervision.

Method: The authors define six editing instruction categories and introduce Fashion Edit Score, a semantic-aware metric, to generate and evaluate balanced, diverse instruction-image triplets.

Result: They construct 52,257 candidate triplets, retaining 20,596 high-quality ones to build EditGarment, the first dataset for standalone garment editing.

Conclusion: The proposed pipeline and dataset (EditGarment) address key challenges in garment editing, enabling better understanding of garment-specific semantics and attribute dependencies.

Abstract: Instruction-based garment editing enables precise image modifications via natural language, with broad applications in fashion design and customization. Unlike general editing tasks, it requires understanding garment-specific semantics and attribute dependencies. However, progress is limited by the scarcity of high-quality instruction-image pairs, as manual annotation is costly and hard to scale. While MLLMs have shown promise in automated data synthesis, their application to garment editing is constrained by imprecise instruction modeling and a lack of fashion-specific supervisory signals. To address these challenges, we present an automated pipeline for constructing a garment editing dataset. We first define six editing instruction categories aligned with real-world fashion workflows to guide the generation of balanced and diverse instruction-image triplets. Second, we introduce Fashion Edit Score, a semantic-aware evaluation metric that captures semantic dependencies between garment attributes and provides reliable supervision during construction. Using this pipeline, we construct a total of 52,257 candidate triplets and retain 20,596 high-quality triplets to build EditGarment, the first instruction-based dataset tailored to standalone garment editing. The project page is https://yindq99.github.io/EditGarment-project/.

[267] A Neurosymbolic Framework for Interpretable Cognitive Attack Detection in Augmented Reality

Rongqian Chen, Allison Andreyev, Yanming Xiu, Mahdi Imani, Bin Li, Maria Gorlatova, Gang Tan, Tian Lan

Main category: cs.CV

TL;DR: CADAR is a neurosymbolic approach for detecting cognitive attacks in AR, combining neural VLMs with symbolic reasoning for improved accuracy and interpretability.

DetailsMotivation: Existing methods for detecting cognitive attacks in AR lack semantic reasoning or rely on black-box models, limiting effectiveness and interpretability.

Method: CADAR fuses vision-language inputs into a symbolic perception-graph, using particle-filter based statistical reasoning for attack detection.

Result: Experiments show CADAR improves accuracy by up to 10.7% over baselines in challenging AR attack scenarios.

Conclusion: CADAR demonstrates the potential of neurosymbolic methods for effective and interpretable cognitive attack detection in AR.

Abstract: Augmented Reality (AR) enriches perception by overlaying virtual elements on the physical world. Due to its growing popularity, cognitive attacks that alter AR content to manipulate users’ semantic perception have received increasing attention. Existing detection methods often focus on visual changes, which are restricted to pixel- or image-level processing and lack semantic reasoning capabilities, or they rely on pre-trained vision-language models (VLMs), which function as black-box approaches with limited interpretability. In this paper, we present CADAR, a novel neurosymbolic approach for cognitive attack detection in AR. It fuses multimodal vision-language inputs using neural VLMs to obtain a symbolic perception-graph representation, incorporating prior knowledge, salience weighting, and temporal correlations. The model then enables particle-filter based statistical reasoning – a sequential Monte Carlo method – to detect cognitive attacks. Thus, CADAR inherits the adaptability of pre-trained VLM and the interpretability and reasoning rigor of particle filtering. Experiments on an extended AR cognitive attack dataset show accuracy improvements of up to 10.7% over strong baselines on challenging AR attack scenarios, underscoring the promise of neurosymbolic methods for effective and interpretable cognitive attack detection.

[268] Leveraging AI to Accelerate Medical Data Cleaning: A Comparative Study of AI-Assisted vs. Traditional Methods

Matthew Purri, Amit Patel, Erik Deurrell

Main category: cs.CV

TL;DR: Octozi, an AI-assisted platform, improves clinical trial data cleaning by 6.03-fold throughput and reduces errors by 6.44-fold, saving $5.1M in a Phase III trial.

DetailsMotivation: Manual data cleaning in clinical trials is inefficient and costly, necessitating AI solutions to handle increasing data complexity.

Method: Octozi combines large language models with domain-specific heuristics for medical data review, tested in a controlled study with 10 reviewers.

Result: AI assistance increased throughput by 6.03-fold, reduced errors from 54.67% to 8.48%, and saved $5.1M in a Phase III trial.

Conclusion: AI-assisted approaches can transform clinical trial operations, accelerating timelines and reducing costs while maintaining compliance.

Abstract: Clinical trial data cleaning represents a critical bottleneck in drug development, with manual review processes struggling to manage exponentially increasing data volumes and complexity. This paper presents Octozi, an artificial intelligence-assisted platform that combines large language models with domain-specific heuristics to transform medical data review. In a controlled experimental study with experienced medical reviewers (n=10), we demonstrate that AI assistance increased data cleaning throughput by 6.03-fold while simultaneously decreasing cleaning errors from 54.67% to 8.48% (a 6.44-fold improvement). Crucially, the system reduced false positive queries by 15.48-fold, minimizing unnecessary site burden. Economic analysis of a representative Phase III oncology trial reveals potential cost savings of $5.1 million, primarily driven by accelerated database lock timelines (5-day reduction saving $4.4M), improved medical review efficiency ($420K savings), and reduced query management burden ($288K savings). These improvements were consistent across reviewers regardless of experience level, suggesting broad applicability. Our findings indicate that AI-assisted approaches can address fundamental inefficiencies in clinical trial operations, potentially accelerating drug development timelines such as database lock by 33% while maintaining regulatory compliance and significantly reducing operational costs. This work establishes a framework for integrating AI into safety-critical clinical workflows and demonstrates the transformative potential of human-AI collaboration in pharmaceutical clinical trials.

[269] Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method

Masoumeh Sharafi, Soufiane Belharbi, Houssem Ben Salem, Ali Etemad, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger

Main category: cs.CV

TL;DR: The paper proposes Personalized Feature Translation (PFT) for source-free domain adaptation (SFDA) in facial expression recognition (FER), addressing challenges like subtle expressions and single-class target data.

DetailsMotivation: Deep FER models struggle with subtle expressions and inter-subject variability. SFDA methods are needed to adapt models without source data, but current methods fail with single-class target data.

Method: PFT operates in latent space, pre-training a translator on source data to transform style features while preserving expression. It adapts using neutral target data without image synthesis.

Result: PFT avoids image generation complexity, reduces computational overhead, and produces discriminative embeddings for classification.

Conclusion: PFT is efficient, lightweight, and effective for SFDA in FER, eliminating the need for image synthesis and adapting only part of the model.

Abstract: Facial expression recognition (FER) models are employed in many video-based affective computing applications, such as human-computer interaction and healthcare monitoring. However, deep FER models often struggle with subtle expressions and high inter-subject variability, limiting their performance in real-world applications. To improve their performance, source-free domain adaptation (SFDA) methods have been proposed to personalize a pretrained source model using only unlabeled target domain data, thereby avoiding data privacy, storage, and transmission constraints. This paper addresses a challenging scenario where source data is unavailable for adaptation, and only unlabeled target data consisting solely of neutral expressions is available. SFDA methods are not typically designed to adapt using target data from only a single class. Further, using models to generate facial images with non-neutral expressions can be unstable and computationally intensive. In this paper, personalized feature translation (PFT) is proposed for SFDA. Unlike current image translation methods for SFDA, our lightweight method operates in the latent space. We first pre-train the translator on the source domain data to transform the subject-specific style features from one source subject into another. Expression information is preserved by optimizing a combination of expression consistency and style-aware objectives. Then, the translator is adapted on neutral target data, without using source data or image synthesis. By translating in the latent space, PFT avoids the complexity and noise of face expression generation, producing discriminative embeddings optimized for classification. Using PFT eliminates the need for image synthesis, reduces computational overhead (using a lightweight translator), and only adapts part of the model, making the method efficient compared to image-based translation.

[270] S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision

Huihui Xu, Jin Ye, Hongqiu Wang, Changkai Ji, Jiashi Lin, Ming Hu, Ziyan Huang, Ying Chen, Chenglong Ma, Tianbin Li, Lihao Liu, Junjun He, Lei Zhu

Main category: cs.CV

TL;DR: The paper introduces Fast Universal Agglomerative Pooling (UniAP) and S2-UniSeg, a scalable self-supervised universal segmentation model, addressing inefficiencies in pseudo-mask generation and achieving superior performance on benchmarks.

DetailsMotivation: Current self-supervised segmentation models suffer from time-consuming pseudo-mask generation and discontinuous optimization, limiting scalability and performance.

Method: Proposes UniAP for fast pseudo-mask generation and S2-UniSeg with QuerySD for continuous pretraining, leveraging student-teacher architecture.

Result: Outperforms UnSAM with significant improvements on COCO, UVO, COCOStuff-27, and Cityscapes, and scales well to larger datasets.

Conclusion: S2-UniSeg offers an efficient, scalable solution for self-supervised segmentation, achieving state-of-the-art results.

Abstract: Recent self-supervised image segmentation models have achieved promising performance on semantic segmentation and class-agnostic instance segmentation. However, their pretraining schedule is multi-stage, requiring a time-consuming pseudo-masks generation process between each training epoch. This time-consuming offline process not only makes it difficult to scale with training dataset size, but also leads to sub-optimal solutions due to its discontinuous optimization routine. To solve these, we first present a novel pseudo-mask algorithm, Fast Universal Agglomerative Pooling (UniAP). Each layer of UniAP can identify groups of similar nodes in parallel, allowing to generate both semantic-level and instance-level and multi-granular pseudo-masks within ens of milliseconds for one image. Based on the fast UniAP, we propose the Scalable Self-Supervised Universal Segmentation (S2-UniSeg), which employs a student and a momentum teacher for continuous pretraining. A novel segmentation-oriented pretext task, Query-wise Self-Distillation (QuerySD), is proposed to pretrain S2-UniSeg to learn the local-to-global correspondences. Under the same setting, S2-UniSeg outperforms the SOTA UnSAM model, achieving notable improvements of AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff-27, RQ+8.0 on Cityscapes. After scaling up to a larger 2M-image subset of SA-1B, S2-UniSeg further achieves performance gains on all four benchmarks. Our code and pretrained models are available at https://github.com/bio-mlhui/S2-UniSeg

[271] EventRR: Event Referential Reasoning for Referring Video Object Segmentation

Huihui Xu, Jiashi Lin, Haoyu Chen, Junjun He, Lei Zhu

Main category: cs.CV

TL;DR: EventRR framework improves RVOS by decoupling it into object summarization and referent reasoning, leveraging semantic event structures for better performance.

DetailsMotivation: Current RVOS methods overlook semantic structure in referring expressions, especially for videos, which include event attributes and temporal relations.

Method: EventRR decouples RVOS into object summarization (using bottleneck tokens and video-level aggregation) and referent reasoning (via Referential Event Graph and Temporal Concept-Role Reasoning).

Result: EventRR outperforms state-of-the-art RVOS methods on four benchmark datasets.

Conclusion: EventRR effectively addresses the complexity of video-referring expressions by structured reasoning, setting a new benchmark for RVOS.

Abstract: Referring Video Object Segmentation (RVOS) aims to segment out the object in a video referred by an expression. Current RVOS methods view referring expressions as unstructured sequences, neglecting their crucial semantic structure essential for referent reasoning. Besides, in contrast to image-referring expressions whose semantics focus only on object attributes and object-object relations, video-referring expressions also encompass event attributes and event-event temporal relations. This complexity challenges traditional structured reasoning image approaches. In this paper, we propose the Event Referential Reasoning (EventRR) framework. EventRR decouples RVOS into object summarization part and referent reasoning part. The summarization phase begins by summarizing each frame into a set of bottleneck tokens, which are then efficiently aggregated in the video-level summarization step to exchange the global cross-modal temporal context. For reasoning part, EventRR extracts semantic eventful structure of a video-referring expression into highly expressive Referential Event Graph (REG), which is a single-rooted directed acyclic graph. Guided by topological traversal of REG, we propose Temporal Concept-Role Reasoning (TCRR) to accumulate the referring score of each temporal query from REG leaf nodes to root node. Each reasoning step can be interpreted as a question-answer pair derived from the concept-role relations in REG. Extensive experiments across four widely recognized benchmark datasets, show that EventRR quantitatively and qualitatively outperforms state-of-the-art RVOS methods. Code is available at https://github.com/bio-mlhui/EventRR

[272] TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning

Junzhe Xu, Yuyang Yin, Xi Chen

Main category: cs.CV

TL;DR: TBAC-UniImage integrates a Diffusion Model with a Multimodal Large Language Model (MLLM) using multiple layer representations for deeper unification of understanding and generation.

DetailsMotivation: Overcome limitations of shallow connections in diffusion-based unified models and high computational costs of training from scratch.

Method: Uses representations from multiple MLLM layers as generative conditions for the diffusion model.

Result: Achieves deeper and more fine-grained unification of understanding and generation.

Conclusion: TBAC-UniImage presents a novel, efficient paradigm for multimodal understanding and generation.

Abstract: This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM’s final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM’s intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the diffusion model. This method treats the pre-trained generator as a ladder, receiving guidance from various depths of the MLLM’s understanding process. Consequently, TBAC-UniImage achieves a much deeper and more fine-grained unification of understanding and generation.

[273] Preacher: Paper-to-Video Agentic System

Jingwei Liu, Ling Yang, Hao Luo, Fan Wang, Hongyan Li, Mengdi Wang

Main category: cs.CV

TL;DR: Preacher is a paper-to-video system that decomposes, summarizes, and reformulates research papers into structured video abstracts, overcoming limitations of current video generation models.

DetailsMotivation: To address the constraints of existing video generation models (limited context, rigid duration, lack of diversity, and domain-specific knowledge representation) in converting research papers into video abstracts.

Method: Preacher uses a top-down approach for decomposition and summarization, followed by bottom-up video generation with Progressive Chain of Thought (P-CoT) for iterative planning and cross-modal alignment.

Result: Preacher generates high-quality video abstracts across five research fields, outperforming current video generation models.

Conclusion: Preacher demonstrates expertise in paper-to-video conversion, offering a scalable solution for creating accessible video abstracts.

Abstract: The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a topdown approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: https://github.com/GenVerse/Paper2Video

[274] Reinforcement Learning in Vision: A Survey

Weijia Wu, Chen Gao, Joya Chen, Kevin Qinghong Lin, Qingwei Meng, Yiming Zhang, Yuke Qiu, Hong Zhou, Mike Zheng Shou

Main category: cs.CV

TL;DR: A survey synthesizing recent advances in visual reinforcement learning (RL), covering problem formalization, policy-optimization evolution, thematic pillars, and evaluation protocols, while highlighting open challenges.

DetailsMotivation: To provide a coherent overview of the rapidly expanding field of visual RL, organizing works into thematic pillars and identifying future research directions.

Method: The survey formalizes visual RL problems, traces policy-optimization evolution, categorizes 200+ works into four pillars, and reviews evaluation protocols.

Result: Organized trends like curriculum-driven training and unified reward modeling, alongside open challenges such as sample efficiency and safe deployment.

Conclusion: The survey maps the visual RL landscape, offering researchers a structured overview and highlighting promising future directions.

Abstract: Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: https://github.com/weijiawu/Awesome-Visual-Reinforcement-Learning.

[275] From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts

Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Chengming Xu, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma

Main category: cs.CV

TL;DR: The paper introduces a Mixture of Facial Experts (MoFE) and a data processing pipeline to improve identity preservation in video generation, especially for large facial angles. It also curates a new dataset (LFA) to address dataset limitations.

DetailsMotivation: Current video generation models fail to preserve identity under large facial angles due to ineffective feature integration and lack of suitable datasets.

Method: Proposes MoFE, combining three experts for facial attributes, and a data pipeline (Face Constraints and Identity Consistency) to create the LFA dataset.

Result: Outperforms SOTA methods in face similarity, FID, and CLIP alignment on the LFA benchmark.

Conclusion: The MoFE and LFA dataset effectively address identity preservation challenges in video generation for large facial angles.

Abstract: Current video generation models struggle with identity preservation under large facial angles, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT structure, and the lack of targeted coverage of large facial angles in existing open-source video datasets. To address these, we present two key innovations. First, we introduce a Mixture of Facial Experts (MoFE) that dynamically combines complementary cues from three specialized experts, each designed to capture distinct but mutually reinforcing aspects of facial attributes. The identity expert captures cross-pose identity-sensitive features, the semantic expert extracts high-level visual semantxics, and the detail expert preserves pixel-level features (e.g., skin texture, color gradients). Furthermore, to mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency. Face Constraints ensure facial angle diversity and a high proportion of facial regions, while Identity Consistency preserves coherent person-specific features across temporal sequences, collectively addressing the scarcity of large facial angles and identity-stable training data in existing datasets. Leveraging this pipeline, we have curated and refined a Large Face Angles (LFA) Dataset from existing open-source human video datasets, comprising 460K video clips with annotated facial angles. Experimental results on the LFA benchmark demonstrate that our method, empowered by the LFA dataset, significantly outperforms prior SOTA methods in face similarity, face FID, and CLIP semantic alignment. The code and dataset will be made publicly available at https://github.com/rain152/LFA-Video-Generation.

[276] SOI is the Root of All Evil: Quantifying and Breaking Similar Object Interference in Single Object Tracking

Yipei Wang, Shiyu Hu, Shukun Jia, Panxi Xu, Hongfei Ma, Yiping Ma, Jing Zhang, Xiaobo Lu, Xin Zhao

Main category: cs.CV

TL;DR: The paper investigates Similar Object Interference (SOI) in Single Object Tracking (SOT), introduces SOIBench for semantic cognitive guidance, and proposes a VLM-based paradigm for improved tracking performance.

DetailsMotivation: To address the overlooked issue of SOI in SOT and explore the potential of external cognitive guidance, particularly natural language, to enhance tracking robustness.

Method: Conducts OIM experiments to quantify SOI, creates SOIBench for semantic guidance, and integrates large-scale VLMs into RGB trackers.

Result: Eliminating SOI improves tracker performance (AUC up to 4.35). VLMs outperform existing VLT methods (AUC gains up to 0.93).

Conclusion: SOIBench and VLM integration offer significant advancements in semantic cognitive tracking, setting a new standard for future research.

Abstract: In this paper, we present the first systematic investigation and quantification of Similar Object Interference (SOI), a long-overlooked yet critical bottleneck in Single Object Tracking (SOT). Through controlled Online Interference Masking (OIM) experiments, we quantitatively demonstrate that eliminating interference sources leads to substantial performance improvements (AUC gains up to 4.35) across all SOTA trackers, directly validating SOI as a primary constraint for robust tracking and highlighting the feasibility of external cognitive guidance. Building upon these insights, we adopt natural language as a practical form of external guidance, and construct SOIBench-the first semantic cognitive guidance benchmark specifically targeting SOI challenges. It automatically mines SOI frames through multi-tracker collective judgment and introduces a multi-level annotation protocol to generate precise semantic guidance texts. Systematic evaluation on SOIBench reveals a striking finding: existing vision-language tracking (VLT) methods fail to effectively exploit semantic cognitive guidance, achieving only marginal improvements or even performance degradation (AUC changes of -0.26 to +0.71). In contrast, we propose a novel paradigm employing large-scale vision-language models (VLM) as external cognitive engines that can be seamlessly integrated into arbitrary RGB trackers. This approach demonstrates substantial improvements under semantic cognitive guidance (AUC gains up to 0.93), representing a significant advancement over existing VLT methods. We hope SOIBench will serve as a standardized evaluation platform to advance semantic cognitive tracking research and contribute new insights to the tracking research community.

[277] Iterative Volume Fusion for Asymmetric Stereo Matching

Yuanting Gao, Linghao Shen

Main category: cs.CV

TL;DR: The paper addresses stereo matching challenges in asymmetric multi-camera systems by proposing a two-phase Iterative Volume Fusion network (IVF-AStereo) to enhance matching accuracy.

DetailsMotivation: Traditional stereo matching assumes symmetric visual properties, but asymmetric systems (e.g., tele-wide cameras) disrupt this, complicating cost volume computation.

Method: The proposed IVF-AStereo method refines correlation volume via aggregated concatenation and fuses volumes to improve fine details.

Result: The method performs robustly in asymmetric scenarios and outperforms benchmarks in handling resolution and color degradation.

Conclusion: IVF-AStereo effectively addresses asymmetric stereo matching challenges, validated by experiments and ablation studies.

Abstract: Stereo matching is vital in 3D computer vision, with most algorithms assuming symmetric visual properties between binocular visions. However, the rise of asymmetric multi-camera systems (e.g., tele-wide cameras) challenges this assumption and complicates stereo matching. Visual asymmetry disrupts stereo matching by affecting the crucial cost volume computation. To address this, we explore the matching cost distribution of two established cost volume construction methods in asymmetric stereo. We find that each cost volume experiences distinct information distortion, indicating that both should be comprehensively utilized to solve the issue. Based on this, we propose the two-phase Iterative Volume Fusion network for Asymmetric Stereo matching (IVF-AStereo). Initially, the aggregated concatenation volume refines the correlation volume. Subsequently, both volumes are fused to enhance fine details. Our method excels in asymmetric scenarios and shows robust performance against significant visual asymmetry. Extensive comparative experiments on benchmark datasets, along with ablation studies, confirm the effectiveness of our approach in asymmetric stereo with resolution and color degradation.

[278] WeatherPrompt: Multi-modality Representation Learning for All-Weather Drone Visual Geo-Localization

Jiahao Wen, Hang Yu, Zhedong Zheng

Main category: cs.CV

TL;DR: WeatherPrompt improves drone geo-localization under diverse weather by fusing image and text embeddings, achieving significant recall improvements.

DetailsMotivation: Existing methods struggle with weather perturbations due to limited weather categories and poor feature disentanglement.

Method: Uses a training-free weather reasoning mechanism and a dynamic gating framework with cross-modal objectives.

Result: Boosts Recall@1 by +13.37% (night) and +18.69% (fog/snow) over state-of-the-art methods.

Conclusion: WeatherPrompt effectively addresses weather-invariant geo-localization for drones.

Abstract: Visual geo-localization for drones faces critical degradation under weather perturbations, \eg, rain and fog, where existing methods struggle with two inherent limitations: 1) Heavy reliance on limited weather categories that constrain generalization, and 2) Suboptimal disentanglement of entangled scene-weather features through pseudo weather categories. We present WeatherPrompt, a multi-modality learning paradigm that establishes weather-invariant representations through fusing the image embedding with the text context. Our framework introduces two key contributions: First, a Training-free Weather Reasoning mechanism that employs off-the-shelf large multi-modality models to synthesize multi-weather textual descriptions through human-like reasoning. It improves the scalability to unseen or complex weather, and could reflect different weather strength. Second, to better disentangle the scene and weather feature, we propose a multi-modality framework with the dynamic gating mechanism driven by the text embedding to adaptively reweight and fuse visual features across modalities. The framework is further optimized by the cross-modal objectives, including image-text contrastive learning and image-text matching, which maps the same scene with different weather conditions closer in the respresentation space. Extensive experiments validate that, under diverse weather conditions, our method achieves competitive recall rates compared to state-of-the-art drone geo-localization methods. Notably, it improves Recall@1 by +13.37% under night conditions and by 18.69% under fog and snow conditions.

[279] SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs

Bei Yan, Zhiyuan Chen, Yuecong Min, Jie Zhang, Jiahao Wang, Xiaozhen Wang, Shiguang Shan

Main category: cs.CV

TL;DR: The paper introduces SHALE, a scalable benchmark for evaluating hallucinations in Large Vision-Language Models (LVLMs), addressing both faithfulness and factuality issues with fine-grained analysis.

DetailsMotivation: Current LVLMs suffer from hallucinations (inconsistent outputs), but existing benchmarks lack fine-grained analysis and scalability.

Method: Proposes an automated data pipeline and hierarchical hallucination induction framework to create SHALE, a benchmark with 30K+ image-instruction pairs.

Result: Experiments show significant factuality hallucinations and sensitivity to perturbations in LVLMs.

Conclusion: SHALE provides a scalable, fine-grained solution for evaluating LVLM hallucinations, highlighting their limitations.

Abstract: Despite rapid advances, Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge, which correspond to faithfulness and factuality hallucinations, respectively. Prior studies primarily evaluate faithfulness hallucination at a rather coarse level (e.g., object-level) and lack fine-grained analysis. Additionally, existing benchmarks often rely on costly manual curation or reused public datasets, raising concerns about scalability and data leakage. To address these limitations, we propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data. We also design a hierarchical hallucination induction framework with input perturbations to simulate realistic noisy scenarios. Integrating these designs, we construct SHALE, a Scalable HALlucination Evaluation benchmark designed to assess both faithfulness and factuality hallucinations via a fine-grained hallucination categorization scheme. SHALE comprises over 30K image-instruction pairs spanning 12 representative visual perception aspects for faithfulness and 6 knowledge domains for factuality, considering both clean and noisy scenarios. Extensive experiments on over 20 mainstream LVLMs reveal significant factuality hallucinations and high sensitivity to semantic perturbations.

[280] Semantic-aware DropSplat: Adaptive Pruning of Redundant Gaussians for 3D Aerial-View Segmentation

Xu Tang, Junan Jia, Yijing Wang, Jingjing Ma, Xiangrong Zhang

Main category: cs.CV

TL;DR: SAD-Splat is a novel 3D-AVS-SS method that improves segmentation accuracy by addressing semantic ambiguity with a Gaussian point drop module and pseudo-label generation.

DetailsMotivation: Traditional methods struggle with semantic ambiguity due to scale variations and occlusions in aerial images, limiting accuracy.

Method: Introduces a Gaussian point drop module with semantic confidence estimation and a pseudo-label generation pipeline using 2D foundation models.

Result: Achieves a balance between segmentation accuracy and compactness, validated on the new 3D-AS benchmark.

Conclusion: SAD-Splat provides an efficient, scalable solution for 3D aerial scene understanding.

Abstract: In the task of 3D Aerial-view Scene Semantic Segmentation (3D-AVS-SS), traditional methods struggle to address semantic ambiguity caused by scale variations and structural occlusions in aerial images. This limits their segmentation accuracy and consistency. To tackle these challenges, we propose a novel 3D-AVS-SS approach named SAD-Splat. Our method introduces a Gaussian point drop module, which integrates semantic confidence estimation with a learnable sparsity mechanism based on the Hard Concrete distribution. This module effectively eliminates redundant and semantically ambiguous Gaussian points, enhancing both segmentation performance and representation compactness. Furthermore, SAD-Splat incorporates a high-confidence pseudo-label generation pipeline. It leverages 2D foundation models to enhance supervision when ground-truth labels are limited, thereby further improving segmentation accuracy. To advance research in this domain, we introduce a challenging benchmark dataset: 3D Aerial Semantic (3D-AS), which encompasses diverse real-world aerial scenes with sparse annotations. Experimental results demonstrate that SAD-Splat achieves an excellent balance between segmentation accuracy and representation compactness. It offers an efficient and scalable solution for 3D aerial scene understanding.

cs.AI

[281] A Survey of Optimization Modeling Meets LLMs: Progress and Future Directions

Ziyang Xiao, Jingrong Xie, Lilin Xu, Shisi Guan, Jingyan Zhu, Xiongwei Han, Xiaojin Fu, WingYin Yu, Han Wu, Wei Shi, Qingcan Kang, Jiahui Duan, Tao Zhong, Mingxuan Yuan, Jia Zeng, Yuan Wang, Gang Chen, Dongxiang Zhang

Main category: cs.AI

TL;DR: The paper reviews advancements in using large language models (LLMs) to automate optimization modeling, covering data synthesis, fine-tuning, inference frameworks, benchmarks, and evaluation. It highlights dataset quality issues, introduces a cleaned dataset leaderboard, and provides an online portal for resources. Future research directions are also outlined.

DetailsMotivation: To address the expertise gap in optimization modeling by leveraging LLMs for automation, improving dataset quality, and fostering community collaboration.

Method: Survey of recent advancements, dataset cleaning, construction of a new leaderboard, and development of an online portal.

Result: Identified high error rates in benchmark datasets, created a cleaned dataset leaderboard, and built an online resource portal.

Conclusion: Current methodologies have limitations, but the work provides a foundation for future research and community benefits.

Abstract: By virtue of its great utility in solving real-world problems, optimization modeling has been widely employed for optimal decision-making across various sectors, but it requires substantial expertise from operations research professionals. With the advent of large language models (LLMs), new opportunities have emerged to automate the procedure of mathematical modeling. This survey presents a comprehensive and timely review of recent advancements that cover the entire technical stack, including data synthesis and fine-tuning for the base model, inference frameworks, benchmark datasets, and performance evaluation. In addition, we conducted an in-depth analysis on the quality of benchmark datasets, which was found to have a surprisingly high error rate. We cleaned the datasets and constructed a new leaderboard with fair performance evaluation in terms of base LLM model and datasets. We also build an online portal that integrates resources of cleaned datasets, code and paper repository to benefit the community. Finally, we identify limitations in current methodologies and outline future research opportunities.

[282] Amazon Nova AI Challenge – Trusted AI: Advancing secure, AI-assisted software development

Sattvik Sahai, Prasoon Goyal, Michael Johnston, Anna Gottardi, Yao Lu, Lucy Hu, Luke Dai, Shaohua Liu, Samyuth Sagi, Hangjie Shi, Desheng Zhang, Lavina Vaz, Leslie Ball, Maureen Murray, Rahul Gupta, Shankar Ananthakrishna

Main category: cs.AI

TL;DR: The paper discusses the Amazon Nova AI Challenge, where university teams competed to advance secure AI in software development through red-teaming and safety alignment methods.

DetailsMotivation: To address safety challenges in AI systems for software development by fostering innovation through a global competition.

Method: Teams developed automated red-teaming bots and safe AI assistants, evaluated through adversarial tournaments and iterative improvements using annotated data.

Result: State-of-the-art techniques were introduced, including reasoning-based safety alignment, robust model guardrails, and efficient probing of LLMs.

Conclusion: The collaborative effort raised the bar for AI safety, showcasing advancements in secure AI development.

Abstract: AI systems for software development are rapidly gaining prominence, yet significant challenges remain in ensuring their safety. To address this, Amazon launched the Trusted AI track of the Amazon Nova AI Challenge, a global competition among 10 university teams to drive advances in secure AI. In the challenge, five teams focus on developing automated red teaming bots, while the other five create safe AI assistants. This challenge provides teams with a unique platform to evaluate automated red-teaming and safety alignment methods through head-to-head adversarial tournaments where red teams have multi-turn conversations with the competing AI coding assistants to test their safety alignment. Along with this, the challenge provides teams with a feed of high quality annotated data to fuel iterative improvement. Throughout the challenge, teams developed state-of-the-art techniques, introducing novel approaches in reasoning-based safety alignment, robust model guardrails, multi-turn jail-breaking, and efficient probing of large language models (LLMs). To support these efforts, the Amazon Nova AI Challenge team made substantial scientific and engineering investments, including building a custom baseline coding specialist model for the challenge from scratch, developing a tournament orchestration service, and creating an evaluation harness. This paper outlines the advancements made by university teams and the Amazon Nova AI Challenge team in addressing the safety challenges of AI for software development, highlighting this collaborative effort to raise the bar for AI safety.

[283] MCP-Orchestrated Multi-Agent System for Automated Disinformation Detection

Alexandru-Andrei Avram, Adrian Groza, Alexandru Lecu

Main category: cs.AI

TL;DR: A multi-agent AI system detects disinformation in news articles with high accuracy using relation extraction and ensemble methods.

DetailsMotivation: Address the challenge of disinformation spread by developing an automated, scalable detection system.

Method: Combines four agents (ML, Wikipedia knowledge check, coherence detection, web-scraped data analyzer) orchestrated via Model Context Protocol (MCP).

Result: Achieves 95.3% accuracy and F1 score of 0.964, outperforming individual agents and traditional methods.

Conclusion: The modular, scalable system effectively detects disinformation while maintaining decision transparency.

Abstract: The large spread of disinformation across digital platforms creates significant challenges to information integrity. This paper presents a multi-agent system that uses relation extraction to detect disinformation in news articles, focusing on titles and short text snippets. The proposed Agentic AI system combines four agents: (i) a machine learning agent (logistic regression), (ii) a Wikipedia knowledge check agent (which relies on named entity recognition), (iii) a coherence detection agent (using LLM prompt engineering), and (iv) a web-scraped data analyzer that extracts relational triplets for fact checking. The system is orchestrated via the Model Context Protocol (MCP), offering shared context and live learning across components. Results demonstrate that the multi-agent ensemble achieves 95.3% accuracy with an F1 score of 0.964, significantly outperforming individual agents and traditional approaches. The weighted aggregation method, mathematically derived from individual agent misclassification rates, proves superior to algorithmic threshold optimization. The modular architecture makes the system easily scalable, while also maintaining details of the decision processes.

[284] Agentic Design Review System

Sayan Nag, K J Joseph, Koustava Goswami, Vlad I Morariu, Balaji Vasan Srinivasan

Main category: cs.AI

TL;DR: AgenticDRS is a system where multiple agents collaborate to evaluate graphic designs, using graph matching and prompt expansion for design awareness, outperforming baselines on the DRS-BENCH benchmark.

DetailsMotivation: The need for a holistic evaluation of graphic designs by aggregating expert feedback motivates the development of AgenticDRS.

Method: Proposes AgenticDRS with multi-agent collaboration, graph matching for exemplar selection, and prompt expansion for design awareness.

Result: AgenticDRS outperforms state-of-the-art baselines on DRS-BENCH, validated by ablation experiments.

Conclusion: AgenticDRS effectively evaluates designs and generates actionable feedback, highlighting an under-explored research direction.

Abstract: Evaluating graphic designs involves assessing it from multiple facets like alignment, composition, aesthetics and color choices. Evaluating designs in a holistic way involves aggregating feedback from individual expert reviewers. Towards this, we propose an Agentic Design Review System (AgenticDRS), where multiple agents collaboratively analyze a design, orchestrated by a meta-agent. A novel in-context exemplar selection approach based on graph matching and a unique prompt expansion method plays central role towards making each agent design aware. Towards evaluating this framework, we propose DRS-BENCH benchmark. Thorough experimental evaluation against state-of-the-art baselines adapted to the problem setup, backed-up with critical ablation experiments brings out the efficacy of Agentic-DRS in evaluating graphic designs and generating actionable feedback. We hope that this work will attract attention to this pragmatic, yet under-explored research direction.

[285] Modeling Human Responses to Multimodal AI Content

Zhiqi Shen, Shaojing Fan, Danni Xu, Terence Sim, Mohan Kankanhalli

Main category: cs.AI

TL;DR: The paper introduces the MhAIM Dataset and T-Lens, an LLM-based system, to study human responses to AI-generated content, proposing new metrics and strategies to mitigate misinformation risks.

DetailsMotivation: To understand how AI-generated content influences human perception and behavior, especially in high-stakes domains like trading, where human reaction is critical.

Method: The authors created the MhAIM Dataset (154,552 posts, 111,153 AI-generated) and conducted a human study. They introduced metrics (trustworthiness, impact, openness) and developed T-Lens, an LLM-based system with HR-MCP for human-aware responses.

Result: People identify AI content better with multimodal posts (text + visuals). T-Lens, leveraging HR-MCP, improves interpretability and interaction by predicting human responses.

Conclusion: The work offers empirical insights and tools to enhance LLMs’ human-awareness, suggesting strategies to counter AI-driven misinformation risks.

Abstract: As AI-generated content becomes widespread, so does the risk of misinformation. While prior research has primarily focused on identifying whether content is authentic, much less is known about how such content influences human perception and behavior. In domains like trading or the stock market, predicting how people react (e.g., whether a news post will go viral), can be more critical than verifying its factual accuracy. To address this, we take a human-centered approach and introduce the MhAIM Dataset, which contains 154,552 online posts (111,153 of them AI-generated), enabling large-scale analysis of how people respond to AI-generated content. Our human study reveals that people are better at identifying AI content when posts include both text and visuals, particularly when inconsistencies exist between the two. We propose three new metrics: trustworthiness, impact, and openness, to quantify how users judge and engage with online content. We present T-Lens, an LLM-based agent system designed to answer user queries by incorporating predicted human responses to multimodal information. At its core is HR-MCP (Human Response Model Context Protocol), built on the standardized Model Context Protocol (MCP), enabling seamless integration with any LLM. This integration allows T-Lens to better align with human reactions, enhancing both interpretability and interaction capabilities. Our work provides empirical insights and practical tools to equip LLMs with human-awareness capabilities. By highlighting the complex interplay among AI, human cognition, and information reception, our findings suggest actionable strategies for mitigating the risks of AI-driven misinformation.

[286] Agentic AI Frameworks: Architectures, Protocols, and Design Challenges

Hana Derouiche, Zaki Brahmi, Haithem Mazeni

Main category: cs.AI

TL;DR: A systematic review of Agentic AI frameworks, analyzing their architectures, communication protocols, and challenges, with proposed future research directions.

DetailsMotivation: To evaluate and compare leading Agentic AI frameworks, identify their limitations, and address emerging challenges in autonomous AI systems.

Method: Comparative analysis of frameworks like CrewAI, LangGraph, and others, along with in-depth study of communication protocols such as CNP and A2A.

Result: Established a foundational taxonomy for Agentic AI systems and highlighted key limitations and trends.

Conclusion: The paper provides a comprehensive reference for advancing scalable, robust, and interoperable autonomous AI systems.

Abstract: The emergence of Large Language Models (LLMs) has ushered in a transformative paradigm in artificial intelligence, Agentic AI, where intelligent agents exhibit goal-directed autonomy, contextual reasoning, and dynamic multi-agent coordination. This paper provides a systematic review and comparative analysis of leading Agentic AI frameworks, including CrewAI, LangGraph, AutoGen, Semantic Kernel, Agno, Google ADK, and MetaGPT, evaluating their architectural principles, communication mechanisms, memory management, safety guardrails, and alignment with service-oriented computing paradigms. Furthermore, we identify key limitations, emerging trends, and open challenges in the field. To address the issue of agent communication, we conduct an in-depth analysis of protocols such as the Contract Net Protocol (CNP), Agent-to-Agent (A2A), Agent Network Protocol (ANP), and Agora. Our findings not only establish a foundational taxonomy for Agentic AI systems but also propose future research directions to enhance scalability, robustness, and interoperability. This work serves as a comprehensive reference for researchers and practitioners working to advance the next generation of autonomous AI systems.

[287] Improving and Evaluating Open Deep Research Agents

Doaa Allabadi, Kyle Bradbury, Jordan M. Malof

Main category: cs.AI

TL;DR: The paper evaluates Deep Research Agents (DRAs) using the BrowseComp benchmark, introduces a smaller subset (BC-Small), and improves an open-source DRA (ODR) to achieve a 10% success rate.

DetailsMotivation: To address the lack of open-source DRAs and compare them to proprietary systems, the authors adapt the BrowseComp benchmark for academic use.

Method: The authors benchmark ODR and two proprietary systems on BC-Small, then enhance ODR with three strategic improvements to create ODR+.

Result: ODR+ achieves a 10% success rate on BC-Small, outperforming proprietary systems (0% accuracy).

Conclusion: The improvements to ODR demonstrate the potential of open-source DRAs, though challenges remain in achieving higher accuracy.

Abstract: We focus here on Deep Research Agents (DRAs), which are systems that can take a natural language prompt from a user, and then autonomously search for, and utilize, internet-based content to address the prompt. Recent DRAs have demonstrated impressive capabilities on public benchmarks however, recent research largely involves proprietary closed-source systems. At the time of this work, we only found one open-source DRA, termed Open Deep Research (ODR). In this work we adapt the challenging recent BrowseComp benchmark to compare ODR to existing proprietary systems. We propose BrowseComp-Small (BC-Small), comprising a subset of BrowseComp, as a more computationally-tractable DRA benchmark for academic labs. We benchmark ODR and two other proprietary systems on BC-Small: one system from Anthropic and one system from Google. We find that all three systems achieve 0% accuracy on the test set of 60 questions. We introduce three strategic improvements to ODR, resulting in the ODR+ model, which achieves a state-of-the-art 10% success rate on BC-Small among both closed-source and open-source systems. We report ablation studies indicating that all three of our improvements contributed to the success of ODR+.

[288] Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

Bin Hong, Jiayu Liu, Zhenya Huang, Kai Zhang, Mengdi Zhang

Main category: cs.AI

TL;DR: The paper proposes Length Controlled Preference Optimization (LCPO) to reduce the output length of Large Reasoning Models (LRMs) by 50% while maintaining reasoning performance, addressing efficiency challenges.

DetailsMotivation: Current LRMs generate lengthy outputs, increasing computational costs and risking overthinking, without efficient methods to balance reasoning quality and efficiency.

Method: Analyzes generation path distributions, filters trajectories via difficulty estimation, and proposes LCPO under a Bradley-Terry loss framework to balance length and performance.

Result: LCPO reduces average output length by over 50% across benchmarks without compromising reasoning performance.

Conclusion: LCPO demonstrates the feasibility of computationally efficient approaches for guiding LRMs toward efficient reasoning.

Abstract: Recent advances in Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning. However, their lengthy outputs increase computational costs and may lead to overthinking, raising challenges in balancing reasoning effectiveness and efficiency. Current methods for efficient reasoning often compromise reasoning quality or require extensive resources. This paper investigates efficient methods to reduce the generation length of LRMs. We analyze generation path distributions and filter generated trajectories through difficulty estimation. Subsequently, we analyze the convergence behaviors of the objectives of various preference optimization methods under a Bradley-Terry loss based framework. Based on the analysis, we propose Length Controlled Preference Optimization (LCPO) that directly balances the implicit reward related to NLL loss. LCPO can effectively learn length preference with limited data and training. Extensive experiments demonstrate that our approach significantly reduces the average output length by over 50% across multiple benchmarks while maintaining the reasoning performance. Our work highlights the potential for computationally efficient approaches in guiding LRMs toward efficient reasoning.

[289] KompeteAI: Accelerated Autonomous Multi-Agent System for End-to-End Pipeline Generation for Machine Learning Problems

Stepan Kulibaba, Artem Dzhalilov, Roman Pakhomov, Oleg Svidchenko, Alexander Gasnikov, Aleksei Shpilman

Main category: cs.AI

TL;DR: KompeteAI is a novel AutoML framework addressing exploration and execution bottlenecks in LLM-based AutoML systems by introducing dynamic solution space exploration, merging top candidates, and integrating RAG for diverse ideas. It accelerates pipeline evaluation and outperforms leading methods.

DetailsMotivation: Existing LLM-based AutoML systems face limitations like constrained exploration strategies and execution bottlenecks, hindering iterative refinement and diversity in solutions.

Method: KompeteAI introduces dynamic exploration with a merging stage for top candidates, integrates RAG for diverse ideas, and uses predictive scoring and accelerated debugging to address execution bottlenecks.

Result: KompeteAI accelerates pipeline evaluation 6.9 times and outperforms leading methods by 3% on MLE-Bench. It also achieves state-of-the-art results on Kompete-bench.

Conclusion: KompeteAI effectively overcomes exploration and execution challenges in AutoML, demonstrating superior performance and efficiency.

Abstract: Recent Large Language Model (LLM)-based AutoML systems demonstrate impressive capabilities but face significant limitations such as constrained exploration strategies and a severe execution bottleneck. Exploration is hindered by one-shot methods lacking diversity and Monte Carlo Tree Search (MCTS) approaches that fail to recombine strong partial solutions. The execution bottleneck arises from lengthy code validation cycles that stifle iterative refinement. To overcome these challenges, we introduce KompeteAI, a novel AutoML framework with dynamic solution space exploration. Unlike previous MCTS methods that treat ideas in isolation, KompeteAI introduces a merging stage that composes top candidates. We further expand the hypothesis space by integrating Retrieval-Augmented Generation (RAG), sourcing ideas from Kaggle notebooks and arXiv papers to incorporate real-world strategies. KompeteAI also addresses the execution bottleneck via a predictive scoring model and an accelerated debugging method, assessing solution potential using early stage metrics to avoid costly full-code execution. This approach accelerates pipeline evaluation 6.9 times. KompeteAI outperforms leading methods (e.g., RD-agent, AIDE, and Ml-Master) by an average of 3% on the primary AutoML benchmark, MLE-Bench. Additionally, we propose Kompete-bench to address limitations in MLE-Bench, where KompeteAI also achieves state-of-the-art results

[290] Extending the Entropic Potential of Events for Uncertainty Quantification and Decision-Making in Artificial Intelligence

Mark Zilberman

Main category: cs.AI

TL;DR: The paper introduces the entropic potential of events, a concept from physics adapted for AI, to improve uncertainty quantification, decision-making, and interpretability. It formalizes definitions for both original and AI-adjusted entropic potential, with applications in policy evaluation, reward design, explainable AI, and anomaly detection.

DetailsMotivation: To enhance AI's uncertainty modeling by leveraging the entropic potential of events, bridging thermodynamics, information theory, and machine learning.

Method: Adapts the entropic potential framework for AI, introducing an event-centric measure and conditional expectations for counterfactual scenarios.

Result: Demonstrates the framework’s versatility in reinforcement learning, Bayesian inference, and anomaly detection, with practical computational considerations.

Conclusion: The entropic potential framework provides a theoretically grounded, interpretable, and unified approach to managing uncertainty in AI.

Abstract: This work demonstrates how the concept of the entropic potential of events – a parameter quantifying the influence of discrete events on the expected future entropy of a system – can enhance uncertainty quantification, decision-making, and interpretability in artificial intelligence (AI). Building on its original formulation in physics, the framework is adapted for AI by introducing an event-centric measure that captures how actions, observations, or other discrete occurrences impact uncertainty at future time horizons. Both the original and AI-adjusted definitions of entropic potential are formalized, with the latter emphasizing conditional expectations to account for counterfactual scenarios. Applications are explored in policy evaluation, intrinsic reward design, explainable AI, and anomaly detection, highlighting the metric’s potential to unify and strengthen uncertainty modeling in intelligent systems. Conceptual examples illustrate its use in reinforcement learning, Bayesian inference, and anomaly detection, while practical considerations for computation in complex AI models are discussed. The entropic potential framework offers a theoretically grounded, interpretable, and versatile approach to managing uncertainty in AI, bridging principles from thermodynamics, information theory, and machine learning.

[291] Why Cannot Large Language Models Ever Make True Correct Reasoning?

Jingde Cheng

Main category: cs.AI

TL;DR: The paper argues that the perceived ‘understanding’ and ‘reasoning’ abilities of LLMs like ChatGPT are illusions due to their inherent limitations, not true capabilities.

DetailsMotivation: To clarify misconceptions about LLMs' abilities and highlight their fundamental limitations in achieving true reasoning.

Method: The author critiques the working principles of LLMs, emphasizing their inability to achieve genuine understanding or reasoning.

Result: LLMs lack true reasoning ability due to their design constraints.

Conclusion: The paper concludes that LLMs cannot possess true reasoning or understanding, as their mechanisms are inherently limited.

Abstract: Recently, with the application progress of AIGC tools based on large language models (LLMs), led by ChatGPT, many AI experts and more non-professionals are trumpeting the “understanding ability” and “reasoning ability” of the LLMs. The present author considers that the so-called “understanding ability” and “reasoning ability” of LLMs are just illusions of those people who with vague concepts. In fact, the LLMs can never have the true understanding ability and true reasoning ability. This paper intents to explain that, because the essential limitations of their working principle, the LLMs can never have the ability of true correct reasoning.

[292] Promoting Efficient Reasoning with Verifiable Stepwise Reward

Chuhuai Yue, Chengqi Dong, Yinan Gao, Hang He, Jiajun Chai, Guojun Yin, Wei Lin

Main category: cs.AI

TL;DR: The paper addresses overthinking in large reasoning models (LRMs) by proposing a verifiable stepwise reward mechanism (VSRM) to balance efficiency and accuracy.

DetailsMotivation: LRMs suffer from inefficiency due to overthinking, where they expend excessive computation on simple problems. Existing methods lack flexibility and reliability.

Method: The authors introduce VSRM, a rule-based reward mechanism that evaluates intermediate reasoning steps, integrated with PPO and Reinforce++.

Result: Experiments on AIME24 and AIME25 benchmarks show reduced output length without compromising performance, effectively suppressing overthinking.

Conclusion: VSRM successfully mitigates overthinking by penalizing ineffective steps and rewarding effective ones, improving LRM efficiency.

Abstract: Large reasoning models (LRMs) have recently achieved significant progress in complex reasoning tasks, aided by reinforcement learning with verifiable rewards. However, LRMs often suffer from overthinking, expending excessive computation on simple problems and reducing efficiency. Existing efficient reasoning methods typically require accurate task assessment to preset token budgets or select reasoning modes, which limits their flexibility and reliability. In this work, we revisit the essence of overthinking and identify that encouraging effective steps while penalizing ineffective ones is key to its solution. To this end, we propose a novel rule-based verifiable stepwise reward mechanism (VSRM), which assigns rewards based on the performance of intermediate states in the reasoning trajectory. This approach is intuitive and naturally fits the step-by-step nature of reasoning tasks. We conduct extensive experiments on standard mathematical reasoning benchmarks, including AIME24 and AIME25, by integrating VSRM with PPO and Reinforce++. Results show that our method achieves substantial output length reduction while maintaining original reasoning performance, striking an optimal balance between efficiency and accuracy. Further analysis of overthinking frequency and pass@k score before and after training demonstrates that our approach in deed effectively suppresses ineffective steps and encourages effective reasoning, fundamentally alleviating the overthinking problem. All code will be released upon acceptance.

[293] A Curriculum Learning Approach to Reinforcement Learning: Leveraging RAG for Multimodal Question Answering

Chenliang Zhang, Lin Wang, Yuanyuan Lu, Yusheng Qi, Kexin Wang, Peixu Hou, Wenshi Chen

Main category: cs.AI

TL;DR: The paper presents solutions by the Dianping-Trust-Safety team for the META CRAG-MM challenge, focusing on multi-modal, multi-turn question answering. Their approach integrates vision large language models, GPT-4.1 knowledge distillation, curriculum learning, and web search APIs, achieving top results in Tasks 1 and 3.

DetailsMotivation: The challenge required building a retrieval-augmented generation system for multi-modal, multi-turn question answering, addressing tasks like structured data retrieval, information synthesis, and context-aware conversations.

Method: For Task 1, a vision large language model was fine-tuned with GPT-4.1 knowledge and curriculum-guided reinforcement learning. Tasks 2 and 3 incorporated web search APIs for external knowledge.

Result: The team secured 1st place in Task 1 (52.38% lead) and 3rd in Task 3, showcasing the efficacy of their integrated training pipeline.

Conclusion: The integration of curriculum learning with reinforcement learning and external knowledge sources proved effective for complex multi-modal question answering.

Abstract: This paper describes the solutions of the Dianping-Trust-Safety team for the META CRAG-MM challenge. The challenge requires building a comprehensive retrieval-augmented generation system capable for multi-modal multi-turn question answering. The competition consists of three tasks: (1) answering questions using structured data retrieved from an image-based mock knowledge graph, (2) synthesizing information from both knowledge graphs and web search results, and (3) handling multi-turn conversations that require context understanding and information aggregation from multiple sources. For Task 1, our solution is based on the vision large language model, enhanced by supervised fine-tuning with knowledge distilled from GPT-4.1. We further applied curriculum learning strategies to guide reinforcement learning, resulting in improved answer accuracy and reduced hallucination. For Task 2 and Task 3, we additionally leveraged web search APIs to incorporate external knowledge, enabling the system to better handle complex queries and multi-turn conversations. Our approach achieved 1st place in Task 1 with a significant lead of 52.38%, and 3rd place in Task 3, demonstrating the effectiveness of the integration of curriculum learning with reinforcement learning in our training pipeline.

[294] Multi-Agent Trust Region Policy Optimisation: A Joint Constraint Approach

Chak Lam Shek, Guangyao Shi, Pratap Tokekar

Main category: cs.AI

TL;DR: HATRPO-W and HATRPO-G improve HATRPO by dynamically allocating KL divergence thresholds, enhancing performance and stability in MARL.

DetailsMotivation: Heterogeneous-agent settings in MARL require flexible KL threshold allocation to avoid slow or suboptimal updates.

Method: Proposes HATRPO-W (KKT-based) for global optimization and HATRPO-G (greedy) for prioritized threshold assignment.

Result: Both methods boost HATRPO performance by over 22.5%, with HATRPO-W showing lower variance.

Conclusion: Dynamic KL threshold allocation improves MARL training, with HATRPO-W offering stability and HATRPO-G flexibility.

Abstract: Multi-agent reinforcement learning (MARL) requires coordinated and stable policy updates among interacting agents. Heterogeneous-Agent Trust Region Policy Optimization (HATRPO) enforces per-agent trust region constraints using Kullback-Leibler (KL) divergence to stabilize training. However, assigning each agent the same KL threshold can lead to slow and locally optimal updates, especially in heterogeneous settings. To address this limitation, we propose two approaches for allocating the KL divergence threshold across agents: HATRPO-W, a Karush-Kuhn-Tucker-based (KKT-based) method that optimizes threshold assignment under global KL constraints, and HATRPO-G, a greedy algorithm that prioritizes agents based on improvement-to-divergence ratio. By connecting sequential policy optimization with constrained threshold scheduling, our approach enables more flexible and effective learning in heterogeneous-agent settings. Experimental results demonstrate that our methods significantly boost the performance of HATRPO, achieving faster convergence and higher final rewards across diverse MARL benchmarks. Specifically, HATRPO-W and HATRPO-G achieve comparable improvements in final performance, each exceeding 22.5%. Notably, HATRPO-W also demonstrates more stable learning dynamics, as reflected by its lower variance.

[295] What to Ask Next? Probing the Imaginative Reasoning of LLMs with TurtleSoup Puzzles

Mengtao Zhou, Sifan Wu, Huan Zhang, Qi Sima, Bang Liu

Main category: cs.AI

TL;DR: The paper introduces a framework for evaluating LLMs’ imaginative reasoning using the ‘Turtle Soup’ game, including a benchmark (TurtleSoup-Bench), an agent (Mosaic-Agent), and a multi-dimensional evaluation protocol. Results show LLMs’ limitations compared to humans.

DetailsMotivation: Existing benchmarks for imaginative reasoning are static or socially focused, lacking dynamic exploration. This work aims to fill that gap.

Method: Developed TurtleSoup-Bench (800 puzzles), Mosaic-Agent, and a multi-dimensional evaluation protocol (logical consistency, detail completion, conclusion alignment).

Result: Experiments reveal LLMs’ capability limits, failure patterns, and a performance gap versus humans.

Conclusion: The study provides insights into LLMs’ imaginative reasoning and sets a foundation for future research on exploratory agent behavior.

Abstract: We investigate the capacity of Large Language Models (LLMs) for imaginative reasoning–the proactive construction, testing, and revision of hypotheses in information-sparse environments. Existing benchmarks, often static or focused on social deduction, fail to capture the dynamic, exploratory nature of this reasoning process. To address this gap, we introduce a comprehensive research framework based on the classic “Turtle Soup” game, integrating a benchmark, an agent, and an evaluation protocol. We present TurtleSoup-Bench, the first large-scale, bilingual, interactive benchmark for imaginative reasoning, comprising 800 turtle soup puzzles sourced from both the Internet and expert authors. We also propose Mosaic-Agent, a novel agent designed to assess LLMs’ performance in this setting. To evaluate reasoning quality, we develop a multi-dimensional protocol measuring logical consistency, detail completion, and conclusion alignment. Experiments with leading LLMs reveal clear capability limits, common failure patterns, and a significant performance gap compared to humans. Our work offers new insights into LLMs’ imaginative reasoning and establishes a foundation for future research on exploratory agent behavior.

[296] LeanRAG: Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval

Yaoze Zhang, Rong Wu, Pinlong Cai, Xiaoman Wang, Guohang Yan, Song Mao, Ding Wang, Botian Shi

Main category: cs.AI

TL;DR: LeanRAG improves retrieval-augmented generation by addressing disconnected semantic islands and inefficient retrieval in hierarchical knowledge graphs, outperforming existing methods.

DetailsMotivation: Current knowledge graph-based RAG methods suffer from disconnected high-level summaries and inefficient retrieval, limiting their effectiveness.

Method: LeanRAG introduces semantic aggregation to form entity clusters and explicit relations, followed by a structure-guided retrieval strategy.

Result: LeanRAG outperforms existing methods on QA benchmarks, reducing retrieval redundancy by 46%.

Conclusion: LeanRAG enhances retrieval efficiency and response quality in RAG systems by leveraging structured knowledge aggregation and retrieval.

Abstract: Retrieval-Augmented Generation (RAG) plays a crucial role in grounding Large Language Models by leveraging external knowledge, whereas the effectiveness is often compromised by the retrieval of contextually flawed or incomplete information. To address this, knowledge graph-based RAG methods have evolved towards hierarchical structures, organizing knowledge into multi-level summaries. However, these approaches still suffer from two critical, unaddressed challenges: high-level conceptual summaries exist as disconnected ``semantic islands’’, lacking the explicit relations needed for cross-community reasoning; and the retrieval process itself remains structurally unaware, often degenerating into an inefficient flat search that fails to exploit the graph’s rich topology. To overcome these limitations, we introduce LeanRAG, a framework that features a deeply collaborative design combining knowledge aggregation and retrieval strategies. LeanRAG first employs a novel semantic aggregation algorithm that forms entity clusters and constructs new explicit relations among aggregation-level summaries, creating a fully navigable semantic network. Then, a bottom-up, structure-guided retrieval strategy anchors queries to the most relevant fine-grained entities and then systematically traverses the graph’s semantic pathways to gather concise yet contextually comprehensive evidence sets. The LeanRAG can mitigate the substantial overhead associated with path retrieval on graphs and minimizes redundant information retrieval. Extensive experiments on four challenging QA benchmarks with different domains demonstrate that LeanRAG significantly outperforming existing methods in response quality while reducing 46% retrieval redundancy. Code is available at: https://github.com/RaZzzyz/LeanRAG

[297] HiRef: Leveraging Hierarchical Ontology and Network Refinement for Robust Medication Recommendation

Yan Ting Chok, Soyon Park, Seungheun Baek, Hajung Kim, Junhyun Lee, Jaewoo Kang

Main category: cs.AI

TL;DR: HiRef is a framework combining hierarchical medical ontologies and refined EHR co-occurrence patterns to improve medication recommendation robustness, especially for unseen medical codes.

DetailsMotivation: Address challenges in EHR data like rare medical entities and incomplete records, which hinder generalization of data-driven models.

Method: Uses hyperbolic embeddings for ontology entities and a prior-guided sparse regularization to refine EHR co-occurrence graphs.

Result: Achieves strong performance on MIMIC-III and MIMIC-IV, with high accuracy in unseen-code settings.

Conclusion: HiRef enhances generalizability and robustness in medication recommendation by leveraging ontology semantics and refined co-occurrence patterns.

Abstract: Medication recommendation is a crucial task for assisting physicians in making timely decisions from longitudinal patient medical records. However, real-world EHR data present significant challenges due to the presence of rarely observed medical entities and incomplete records that may not fully capture the clinical ground truth. While data-driven models trained on longitudinal Electronic Health Records often achieve strong empirical performance, they struggle to generalize under missing or novel conditions, largely due to their reliance on observed co-occurrence patterns. To address these issues, we propose Hierarchical Ontology and Network Refinement for Robust Medication Recommendation (HiRef), a unified framework that combines two complementary structures: (i) the hierarchical semantics encoded in curated medical ontologies, and (ii) refined co-occurrence patterns derived from real-world EHRs. We embed ontology entities in hyperbolic space, which naturally captures tree-like relationships and enables knowledge transfer through shared ancestors, thereby improving generalizability to unseen codes. To further improve robustness, we introduce a prior-guided sparse regularization scheme that refines the EHR co-occurrence graph by suppressing spurious edges while preserving clinically meaningful associations. Our model achieves strong performance on EHR benchmarks (MIMIC-III and MIMIC-IV) and maintains high accuracy under simulated unseen-code settings. Extensive experiments with comprehensive ablation studies demonstrate HiRef’s resilience to unseen medical codes, supported by in-depth analyses of the learned sparsified graph structure and medical code embeddings.

[298] MM-Food-100K: A 100,000-Sample Multimodal Food Intelligence Dataset with Verifiable Provenance

Yi Dong, Yusuke Muraoka, Scott Shi, Yi Zhang

Main category: cs.AI

TL;DR: MM-Food-100K is a 100,000-sample multimodal food dataset with traceable provenance, derived from a larger 1.2M corpus. It supports fine-tuning vision-language models for tasks like nutrition prediction, showing consistent improvements over baselines. The dataset is partially open and partially reserved for commercial use.

DetailsMotivation: To address the need for a high-quality, traceable, and multimodal food dataset for AI applications, leveraging community sourcing and AI-assisted quality checks.

Method: The dataset was collected using the Codatta contribution model, combining community sourcing with AI-assisted quality checks. Each submission is traceable via a secure ledger. Fine-tuning of vision-language models (e.g., ChatGPT 5) was performed for validation.

Result: Fine-tuning on MM-Food-100K yielded consistent improvements over baseline models in image-based nutrition prediction tasks.

Conclusion: MM-Food-100K is a valuable resource for AI research, offering traceability and utility, with potential commercial applications and revenue sharing for contributors.

Abstract: We present MM-Food-100K, a public 100,000-sample multimodal food intelligence dataset with verifiable provenance. It is a curated approximately 10% open subset of an original 1.2 million, quality-accepted corpus of food images annotated for a wide range of information (such as dish name, region of creation). The corpus was collected over six weeks from over 87,000 contributors using the Codatta contribution model, which combines community sourcing with configurable AI-assisted quality checks; each submission is linked to a wallet address in a secure off-chain ledger for traceability, with a full on-chain protocol on the roadmap. We describe the schema, pipeline, and QA, and validate utility by fine-tuning large vision-language models (ChatGPT 5, ChatGPT OSS, Qwen-Max) on image-based nutrition prediction. Fine-tuning yields consistent gains over out-of-box baselines across standard metrics; we report results primarily on the MM-Food-100K subset. We release MM-Food-100K for publicly free access and retain approximately 90% for potential commercial access with revenue sharing to contributors.

[299] We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning

Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xiaowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, Jie Wang, Chong Sun, Chen Li, Honggang Zhang

Main category: cs.AI

TL;DR: We-Math 2.0 enhances MLLMs’ mathematical reasoning via a structured knowledge system, model-centric data modeling, and RL-based training, outperforming benchmarks.

DetailsMotivation: Existing MLLMs struggle with complex math reasoning due to gaps in knowledge-driven design and data space modeling.

Method: Integrates a hierarchical knowledge system, dual-dataset construction (MathBook-Standard & Pro), and a two-stage RL framework (Cold-Start Fine-tuning & Progressive Alignment RL).

Result: Competitive performance on benchmarks and strong results on MathBookEval, indicating robust generalization.

Conclusion: We-Math 2.0 effectively improves MLLMs’ math reasoning through structured knowledge and RL training.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various tasks, but still struggle with complex mathematical reasoning. Existing research primarily focuses on dataset construction and method optimization, often overlooking two critical aspects: comprehensive knowledge-driven design and model-centric data space modeling. In this paper, we introduce We-Math 2.0, a unified system that integrates a structured mathematical knowledge system, model-centric data space modeling, and a reinforcement learning (RL)-based training paradigm to comprehensively enhance the mathematical reasoning abilities of MLLMs. The key contributions of We-Math 2.0 are fourfold: (1) MathBook Knowledge System: We construct a five-level hierarchical system encompassing 491 knowledge points and 1,819 fundamental principles. (2) MathBook-Standard & Pro: We develop MathBook-Standard, a dataset that ensures broad conceptual coverage and flexibility through dual expansion. Additionally, we define a three-dimensional difficulty space and generate 7 progressive variants per problem to build MathBook-Pro, a challenging dataset for robust training. (3) MathBook-RL: We propose a two-stage RL framework comprising: (i) Cold-Start Fine-tuning, which aligns the model with knowledge-oriented chain-of-thought reasoning; and (ii) Progressive Alignment RL, leveraging average-reward learning and dynamic data scheduling to achieve progressive alignment across difficulty levels. (4) MathBookEval: We introduce a comprehensive benchmark covering all 491 knowledge points with diverse reasoning step distributions. Experimental results show that MathBook-RL performs competitively with existing baselines on four widely-used benchmarks and achieves strong results on MathBookEval, suggesting promising generalization in mathematical reasoning.

[300] FIRESPARQL: A LLM-based Framework for SPARQL Query Generation over Scholarly Knowledge Graphs

Xueli Pan, Victor de Boer, Jacco van Ossenbruggen

Main category: cs.AI

TL;DR: The paper addresses challenges in generating SPARQL queries for scholarly knowledge graphs using LLMs, proposing FIRESPARQL, a modular framework with fine-tuned LLMs, RAG, and query correction. Fine-tuning achieves the best performance.

DetailsMotivation: Question answering over Scholarly Knowledge Graphs (SKGs) is difficult due to complex content and graph structure. LLMs struggle with SPARQL query generation due to limited SKG-specific exposure.

Method: Proposes FIRESPARQL, a framework combining fine-tuned LLMs, retrieval-augmented generation (RAG), and a SPARQL query correction layer. Evaluated on SciQA Benchmark with various configurations.

Result: Fine-tuning achieves the highest performance: 0.90 ROUGE-L for query accuracy and 0.85 RelaxedEM for result accuracy.

Conclusion: FIRESPARQL effectively improves SPARQL query generation for SKGs, with fine-tuning being the most effective approach.

Abstract: Question answering over Scholarly Knowledge Graphs (SKGs) remains a challenging task due to the complexity of scholarly content and the intricate structure of these graphs. Large Language Model (LLM) approaches could be used to translate natural language questions (NLQs) into SPARQL queries; however, these LLM-based approaches struggle with SPARQL query generation due to limited exposure to SKG-specific content and the underlying schema. We identified two main types of errors in the LLM-generated SPARQL queries: (i) structural inconsistencies, such as missing or redundant triples in the queries, and (ii) semantic inaccuracies, where incorrect entities or properties are shown in the queries despite a correct query structure. To address these issues, we propose FIRESPARQL, a modular framework that supports fine-tuned LLMs as a core component, with optional context provided via retrieval-augmented generation (RAG) and a SPARQL query correction layer. We evaluate the framework on the SciQA Benchmark using various configurations (zero-shot, zero-shot with RAG, one-shot, fine-tuning, and fine-tuning with RAG) and compare the performance with baseline and state-of-the-art approaches. We measure query accuracy using BLEU and ROUGE metrics, and query result accuracy using relaxed exact match(RelaxedEM), with respect to the gold standards containing the NLQs, SPARQL queries, and the results of the queries. Experimental results demonstrate that fine-tuning achieves the highest overall performance, reaching 0.90 ROUGE-L for query accuracy and 0.85 RelaxedEM for result accuracy on the test set.

[301] SEQ-GPT: LLM-assisted Spatial Query via Example

Ivan Khai Ze Lim, Ningyi Liao, Yiming Yang, Gerald Wei Yong Yip, Siqiang Luo

Main category: cs.AI

TL;DR: SEQ-GPT enhances spatial searches using LLMs for natural language-based multi-location queries, improving interactivity and adaptability.

DetailsMotivation: Current spatial services lack efficiency in complex tasks like searching multiple locations simultaneously, prompting the need for a more versatile solution.

Method: Introduces SEQ-GPT, leveraging LLMs for natural language processing, interactive clarifications, and dynamic adjustments in spatial queries.

Result: SEQ-GPT demonstrates effective alignment of natural language with structured spatial data, enabling broader and more interactive spatial searches.

Conclusion: SEQ-GPT showcases the potential of LLMs in transforming spatial query systems for more intuitive and versatile user experiences.

Abstract: Contemporary spatial services such as online maps predominantly rely on user queries for location searches. However, the user experience is limited when performing complex tasks, such as searching for a group of locations simultaneously. In this study, we examine the extended scenario known as Spatial Exemplar Query (SEQ), where multiple relevant locations are jointly searched based on user-specified examples. We introduce SEQ-GPT, a spatial query system powered by Large Language Models (LLMs) towards more versatile SEQ search using natural language. The language capabilities of LLMs enable unique interactive operations in the SEQ process, including asking users to clarify query details and dynamically adjusting the search based on user feedback. We also propose a tailored LLM adaptation pipeline that aligns natural language with structured spatial data and queries through dialogue synthesis and multi-model cooperation. SEQ-GPT offers an end-to-end demonstration for broadening spatial search with realistic data and application scenarios.

[302] Reverse Physician-AI Relationship: Full-process Clinical Diagnosis Driven by a Large Language Model

Shicheng Xu, Xin Huang, Zihao Wei, Liang Pang, Huawei Shen, Xueqi Cheng

Main category: cs.AI

TL;DR: The paper proposes DxDirector-7B, an AI-driven LLM that leads full-process clinical diagnosis, reducing physician workload and improving accuracy.

DetailsMotivation: Current AI-assisted diagnosis lacks the ability to drive the entire diagnostic process from ambiguous complaints, limiting efficiency.

Method: Introduces DxDirector-7B, an LLM with deep thinking capabilities, to direct diagnosis with minimal physician input and a clear accountability framework.

Result: DxDirector-7B outperforms state-of-the-art medical and general-purpose LLMs in accuracy and workload reduction.

Conclusion: The study marks a shift from AI as an assistant to a primary director in diagnosis, offering an efficient and accurate solution.

Abstract: Full-process clinical diagnosis in the real world encompasses the entire diagnostic workflow that begins with only an ambiguous chief complaint. While artificial intelligence (AI), particularly large language models (LLMs), is transforming clinical diagnosis, its role remains largely as an assistant to physicians. This AI-assisted working pattern makes AI can only answer specific medical questions at certain parts within the diagnostic process, but lack the ability to drive the entire diagnostic process starting from an ambiguous complaint, which still relies heavily on human physicians. This gap limits AI’s ability to fully reduce physicians’ workload and enhance diagnostic efficiency. To address this, we propose a paradigm shift that reverses the relationship between physicians and AI: repositioning AI as the primary director, with physicians serving as its assistants. So we present DxDirector-7B, an LLM endowed with advanced deep thinking capabilities, enabling it to drive the full-process diagnosis with minimal physician involvement. Furthermore, DxDirector-7B establishes a robust accountability framework for misdiagnoses, delineating responsibility between AI and human physicians. In evaluations across rare, complex, and real-world cases under full-process diagnosis setting, DxDirector-7B not only achieves significant superior diagnostic accuracy but also substantially reduces physician workload than state-of-the-art medical LLMs as well as general-purpose LLMs. Fine-grained analyses across multiple clinical departments and tasks validate its efficacy, with expert evaluations indicating its potential to serve as a viable substitute for medical specialists. These findings mark a new era where AI, traditionally a physicians’ assistant, now drives the entire diagnostic process to drastically reduce physicians’ workload, indicating an efficient and accurate diagnostic solution.

[303] PASS: Probabilistic Agentic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning

Yushi Feng, Junye Du, Yingying Hong, Qifan Wang, Lequan Yu

Main category: cs.AI

TL;DR: PASS introduces a multimodal framework for Chest X-Ray reasoning, addressing black-box reasoning, poor multimodal integration, and rigid pipelines with adaptive sampling and interpretable probabilities.

DetailsMotivation: Existing agentic systems lack trust, safety, and efficiency in healthcare tasks, especially multimodal ones like CXR reasoning.

Method: PASS uses probabilistic sampling over a multi-tool graph, task-conditioned distributions, and a three-stage training procedure for efficiency and performance.

Result: PASS outperforms baselines in accuracy, AUC, and efficiency, validated on the CAB-E benchmark.

Conclusion: PASS advances interpretable, adaptive, and multimodal medical agentic systems, balancing performance and cost.

Abstract: Existing tool-augmented agentic systems are limited in the real world by (i) black-box reasoning steps that undermine trust of decision-making and pose safety risks, (ii) poor multimodal integration, which is inherently critical for healthcare tasks, and (iii) rigid and computationally inefficient agentic pipelines. We introduce PASS (Probabilistic Agentic Supernet Sampling), the first multimodal framework to address these challenges in the context of Chest X-Ray (CXR) reasoning. PASS adaptively samples agentic workflows over a multi-tool graph, yielding decision paths annotated with interpretable probabilities. Given the complex CXR reasoning task with multimodal medical data, PASS leverages its learned task-conditioned distribution over the agentic supernet. Thus, it adaptively selects the most suitable tool at each supernet layer, offering probability-annotated trajectories for post-hoc audits and directly enhancing medical AI safety. PASS also continuously compresses salient findings into an evolving personalized memory, while dynamically deciding whether to deepen its reasoning path or invoke an early exit for efficiency. To optimize a Pareto frontier balancing performance and cost, we design a novel three-stage training procedure, including expert knowledge warm-up, contrastive path-ranking, and cost-aware reinforcement learning. To facilitate rigorous evaluation, we introduce CAB-E, a comprehensive benchmark for multi-step, safety-critical, free-form CXR reasoning. Experiments across various benchmarks validate that PASS significantly outperforms strong baselines in multiple metrics (e.g., accuracy, AUC, LLM-J.) while balancing computational costs, pushing a new paradigm shift towards interpretable, adaptive, and multimodal medical agentic systems.

[304] Diversity First, Quality Later: A Two-Stage Assumption for Language Model Alignment

Zetian Sun, Dongfang Li, Baotian Hu

Main category: cs.AI

TL;DR: The paper explores the effectiveness of on-policy vs. static data in LM alignment, proposing a two-stage alignment process (preference injection and fine-tuning) and validating it across models and methods.

DetailsMotivation: To address the inconsistent effectiveness of on-policy data in LM alignment and improve alignment methods by understanding the distinct stages of the process.

Method: Proposes the alignment stage assumption, divides alignment into preference injection (diverse data) and fine-tuning (high-quality data), and develops an algorithm to identify stage boundaries.

Result: Shows varying effectiveness of on-policy data (e.g., 3x for Llama-3, 0.4x for Zephyr) and validates the two-stage assumption across 5 models and 2 alignment methods.

Conclusion: The alignment stage assumption generalizes well, providing a framework to optimize LM alignment by adapting data usage to distinct stages.

Abstract: The alignment of language models (LMs) with human preferences is critical for building reliable AI systems. The problem is typically framed as optimizing an LM policy to maximize the expected reward that reflects human preferences. Recently, Direct Preference Optimization (DPO) was proposed as a LM alignment method that directly optimize the policy from static preference data, and further improved by incorporating on-policy sampling (i.e., preference candidates generated during the training loop) for better LM alignment. However, we show on-policy data is not always optimal, with systematic effectiveness difference emerging between static and on-policy preference candidates. For example, on-policy data can result in a 3$\times$ effectiveness compared with static data for Llama-3, and a 0.4$\times$ effectiveness for Zephyr. To explain the phenomenon, we propose the alignment stage assumption, which divides the alignment process into two distinct stages: the preference injection stage, which benefits from diverse data, and the preference fine-tuning stage, which favors high-quality data. Through theoretical and empirical analysis, we characterize these stages and propose an effective algorithm to identify the boundaries between them. We perform experiments on 5 models (Llama, Zephyr, Phi-2, Qwen, Pythia) and 2 alignment methods (DPO, SLiC-HF) to show the generalizability of alignment stage assumption and boundary measurement.

[305] Improving Value-based Process Verifier via Low-Cost Variance Reduction

Zetian Sun, Dongfang Li, Baotian Hu, Min Zhang

Main category: cs.AI

TL;DR: The paper introduces ComMCS, a method to reduce variance in value-based process verifiers for LLMs, improving reasoning accuracy without extra computational cost.

DetailsMotivation: Addressing the high variance in Monte Carlo estimators used for training value-based process verifiers in LLMs, which hinders reasoning performance.

Method: Proposes ComMCS, a compound Monte Carlo sampling method that combines estimators from current and subsequent steps to reduce variance while remaining unbiased.

Result: ComMCS outperforms baselines by 2.8 and 2.2 points on MATH-500 and GSM8K benchmarks, respectively.

Conclusion: ComMCS effectively reduces variance in estimators, enhancing LLM reasoning without additional inference cost.

Abstract: Large language models (LLMs) have achieved remarkable success in a wide range of tasks. However, their reasoning capabilities, particularly in complex domains like mathematics, remain a significant challenge. Value-based process verifiers, which estimate the probability of a partial reasoning chain leading to a correct solution, are a promising approach for improving reasoning. Nevertheless, their effectiveness is often hindered by estimation error in their training annotations, a consequence of the limited number of Monte Carlo (MC) samples feasible due to the high cost of LLM inference. In this paper, we identify that the estimation error primarily arises from high variance rather than bias, and the MC estimator is a Minimum Variance Unbiased Estimator (MVUE). To address the problem, we propose the \textsc{Com}pound \textsc{M}onte \textsc{C}arlo \textsc{S}ampling (ComMCS) method, which constructs an unbiased estimator by linearly combining the MC estimators from the current and subsequent steps. Theoretically, we show that our method leads to a predictable reduction in variance, while maintaining an unbiased estimation without additional LLM inference cost. We also perform empirical experiments on the MATH-500 and GSM8K benchmarks to demonstrate the effectiveness of our method. Notably, ComMCS outperforms regression-based optimization method by 2.8 points, the non-variance-reduced baseline by 2.2 points on MATH-500 on Best-of-32 sampling experiment.

[306] MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models

Xinyan Jiang, Lin Zhang, Jiayi Zhang, Qingsong Yang, Guimin Hu, Di Wang, Lijie Hu

Main category: cs.AI

TL;DR: MSRS is a novel framework for multi-attribute steering in LLMs, reducing interference via orthogonal subspaces and dynamic token-level control.

DetailsMotivation: Existing methods struggle with multi-attribute steering due to interference and trade-offs.

Method: MSRS allocates orthogonal subspaces for attributes, combines shared and attribute-specific subspaces, and uses dynamic weighting and token-level steering.

Result: MSRS reduces attribute conflicts, outperforms existing methods, and generalizes well to downstream tasks.

Conclusion: MSRS provides effective multi-attribute steering with minimal interference and precise control.

Abstract: Activation steering offers a promising approach to controlling the behavior of Large Language Models by directly manipulating their internal activations. However, most existing methods struggle to jointly steer multiple attributes, often resulting in interference and undesirable trade-offs. To address this challenge, we propose Multi-Subspace Representation Steering (MSRS), a novel framework for effective multi-attribute steering via subspace representation fine-tuning. MSRS reduces inter-attribute interference by allocating orthogonal subspaces to each attribute, isolating their influence within the model’s representation space. MSRS also incorporates a hybrid subspace composition strategy: it combines attribute-specific subspaces for unique steering directions with a shared subspace for common steering directions. A dynamic weighting function learns to efficiently integrate these components for precise control. During inference, MSRS introduces a token-level steering mechanism that dynamically identifies and intervenes on the most semantically relevant tokens, enabling fine-grained behavioral modulation. Experimental results show that MSRS significantly reduces attribute conflicts, surpasses existing methods across a range of attributes, and generalizes effectively to diverse downstream tasks.

[307] STEP: Stepwise Curriculum Learning for Context-Knowledge Fusion in Conversational Recommendation

Zhenye Yang, Jinpeng Chen, Huan Li, Xiongnan Jin, Xuanyang Li, Junwei Zhang, Hongbo Gao, Kaimin Wei, Senzhang Wang

Main category: cs.AI

TL;DR: STEP is a conversational recommender system using pre-trained models and curriculum-guided fusion to improve recommendation precision and dialogue quality.

DetailsMotivation: Existing CRSs struggle with deep semantics and integrating external knowledge graphs effectively.

Method: STEP uses a three-stage curriculum for context-KG alignment and dual-prompt tuning for dialogue and recommendation tasks.

Result: Outperforms mainstream methods in recommendation precision and dialogue quality on public datasets.

Conclusion: STEP effectively integrates KG information and improves CRS performance through adaptive prompt tuning.

Abstract: Conversational recommender systems (CRSs) aim to proactively capture user preferences through natural language dialogue and recommend high-quality items. To achieve this, CRS gathers user preferences via a dialog module and builds user profiles through a recommendation module to generate appropriate recommendations. However, existing CRS faces challenges in capturing the deep semantics of user preferences and dialogue context. In particular, the efficient integration of external knowledge graph (KG) information into dialogue generation and recommendation remains a pressing issue. Traditional approaches typically combine KG information directly with dialogue content, which often struggles with complex semantic relationships, resulting in recommendations that may not align with user expectations. To address these challenges, we introduce STEP, a conversational recommender centered on pre-trained language models that combines curriculum-guided context-knowledge fusion with lightweight task-specific prompt tuning. At its heart, an F-Former progressively aligns the dialogue context with knowledge-graph entities through a three-stage curriculum, thus resolving fine-grained semantic mismatches. The fused representation is then injected into the frozen language model via two minimal yet adaptive prefix prompts: a conversation prefix that steers response generation toward user intent and a recommendation prefix that biases item ranking toward knowledge-consistent candidates. This dual-prompt scheme allows the model to share cross-task semantics while respecting the distinct objectives of dialogue and recommendation. Experimental results show that STEP outperforms mainstream methods in the precision of recommendation and dialogue quality in two public datasets.

[308] GenOM: Ontology Matching with Description Generation and Large Language Model

Yiping Song, Jiaoyan Chen, Renate A. Schmidt

Main category: cs.AI

TL;DR: GenOM is an LLM-based ontology alignment framework that enhances semantic representations, retrieves alignment candidates, and improves precision, outperforming traditional and recent methods in experiments.

DetailsMotivation: Ontology matching is crucial for semantic interoperability, especially in the biomedical domain with complex concepts.

Method: GenOM uses LLMs to generate textual definitions, retrieves candidates with embeddings, and incorporates exact matching tools.

Result: GenOM achieves competitive performance on the OAEI Bio-ML track, surpassing baselines and confirming robustness via ablation studies.

Conclusion: The framework’s semantic enrichment and few-shot prompting enhance its effectiveness and adaptability in ontology alignment.

Abstract: Ontology matching (OM) plays an essential role in enabling semantic interoperability and integration across heterogeneous knowledge sources, particularly in the biomedical domain which contains numerous complex concepts related to diseases and pharmaceuticals. This paper introduces GenOM, a large language model (LLM)-based ontology alignment framework, which enriches the semantic representations of ontology concepts via generating textual definitions, retrieves alignment candidates with an embedding model, and incorporates exact matching-based tools to improve precision. Extensive experiments conducted on the OAEI Bio-ML track demonstrate that GenOM can often achieve competitive performance, surpassing many baselines including traditional OM systems and recent LLM-based methods. Further ablation studies confirm the effectiveness of semantic enrichment and few-shot prompting, highlighting the framework’s robustness and adaptability.

[309] Scaling Up without Fading Out: Goal-Aware Sparse GNN for RL-based Generalized Planning

Sangwoo Jeon, Juchul Shin, Gyeong-Tae Kim, YeonJe Cho, Seongwoo Kim

Main category: cs.AI

TL;DR: A sparse, goal-aware GNN representation is proposed to address scalability and efficiency issues in generalized planning, improving performance in large grid-based environments.

DetailsMotivation: Existing dense graph representations in GNN-based planning lead to combinatorial explosion and sparsity, hindering scalability and learning feasibility.

Method: A sparse, goal-aware GNN selectively encodes local relationships and integrates spatial goal features, validated in drone mission scenarios.

Result: The method scales to larger grid sizes, improves policy generalization, and increases success rates compared to dense representations.

Conclusion: The approach provides a practical solution for large-scale generalized planning tasks, enhancing feasibility and performance.

Abstract: Generalized planning using deep reinforcement learning (RL) combined with graph neural networks (GNNs) has shown promising results in various symbolic planning domains described by PDDL. However, existing approaches typically represent planning states as fully connected graphs, leading to a combinatorial explosion in edge information and substantial sparsity as problem scales grow, especially evident in large grid-based environments. This dense representation results in diluted node-level information, exponentially increases memory requirements, and ultimately makes learning infeasible for larger-scale problems. To address these challenges, we propose a sparse, goal-aware GNN representation that selectively encodes relevant local relationships and explicitly integrates spatial features related to the goal. We validate our approach by designing novel drone mission scenarios based on PDDL within a grid world, effectively simulating realistic mission execution environments. Our experimental results demonstrate that our method scales effectively to larger grid sizes previously infeasible with dense graph representations and substantially improves policy generalization and success rates. Our findings provide a practical foundation for addressing realistic, large-scale generalized planning tasks.

[310] The Knowledge-Reasoning Dissociation: Fundamental Limitations of LLMs in Clinical Natural Language Inference

Maël Jullien, Marco Valentino, André Freitas

Main category: cs.AI

TL;DR: LLMs perform poorly on clinical reasoning tasks despite high accuracy on factual probes, revealing limitations in structured reasoning.

DetailsMotivation: To assess whether scaling data and parameters in LLMs leads to structured, generalizable internal representations, especially in high-stakes domains like clinical trials.

Method: Introduced a Clinical Trial NLI benchmark with four reasoning families and GKMRV probes to dissociate factual access from inference failures. Evaluated six LLMs under direct and chain-of-thought prompting.

Result: High GKMRV accuracy (0.918) but poor reasoning task performance (0.25), with consistent yet incorrect inferences (0.87 consistency), indicating reliance on heuristics.

Conclusion: LLMs lack structured, composable representations for reliable reasoning, despite possessing relevant knowledge. GKMRV probes effectively measure this dissociation.

Abstract: Large language models are often assumed to acquire increasingly structured, generalizable internal representations simply by scaling data and parameters. We interrogate this assumption by introducing a Clinical Trial Natural Language Inference benchmark comprising four reasoning families, Causal Attribution, Compositional Grounding, Epistemic Verification, and Risk State Abstraction. Each item is paired with a targeted Ground Knowledge and Meta-Level Reasoning Verification (GKMRV) probe, allowing us to dissociate failures of factual access from failures of inference. We evaluate six contemporary LLMs under both direct and chain of thought prompting. Models achieve near-ceiling GKMRV accuracy (mean accuracy 0.918) yet perform poorly on the main reasoning tasks (mean accuracy 0.25). Despite low accuracy, output inferences are highly consistent across samples (mean 0.87), indicating a systematic application of underlying heuristics and shortcuts. These results reveal fundamental structural and representational limitations: current LLMs often possess the relevant clinical knowledge but lack the structured, composable internal representations needed to deploy it reliably (e.g., integrating constraints, weighing evidence, or simulating counterfactuals). Decoupling knowledge from reasoning with GKMRV makes this dissociation explicit and measurable, providing an effective framework for probing the reliability of LLMs in high-stakes domains.

[311] Who Benefits from AI Explanations? Towards Accessible and Interpretable Systems

Maria J. P. Peixoto, Akriti Pandey, Ahsan Zaman, Peter R. Lewis

Main category: cs.AI

TL;DR: The paper highlights accessibility gaps in eXplainable AI (XAI) for vision-impaired users, revealing a lack of inclusion in evaluations and proposing a four-part method for inclusive XAI design.

DetailsMotivation: To address the underexplored accessibility of XAI methods for users with vision impairments, ensuring equitable interpretability.

Method: A two-pronged approach: literature review of 79 studies and a four-part methodological proof of concept (categorization, persona definition, prototype design, and expert/user assessment).

Result: Simplified explanations are more comprehensible for non-visual users, and multimodal presentation enhances equitable interpretability.

Conclusion: Inclusive XAI design requires addressing accessibility gaps and adopting multimodal approaches for broader usability.

Abstract: As AI systems are increasingly deployed to support decision-making in critical domains, explainability has become a means to enhance the understandability of these outputs and enable users to make more informed and conscious choices. However, despite growing interest in the usability of eXplainable AI (XAI), the accessibility of these methods, particularly for users with vision impairments, remains underexplored. This paper investigates accessibility gaps in XAI through a two-pronged approach. First, a literature review of 79 studies reveals that evaluations of XAI techniques rarely include disabled users, with most explanations relying on inherently visual formats. Second, we present a four-part methodological proof of concept that operationalizes inclusive XAI design: (1) categorization of AI systems, (2) persona definition and contextualization, (3) prototype design and implementation, and (4) expert and user assessment of XAI techniques for accessibility. Preliminary findings suggest that simplified explanations are more comprehensible for non-visual users than detailed ones, and that multimodal presentation is required for more equitable interpretability.

Shengjie Ma, Qi Chu, Jiaxin Mao, Xuhui Jiang, Haozhe Duan, Chong Chen

Main category: cs.AI

TL;DR: A novel few-shot approach using LLMs for interpretable and expert-aligned relevance judgments in legal case retrieval, improving accuracy and transparency.

DetailsMotivation: Traditional legal case relevance judgments require expertise and time, lack interpretability, and LLMs' potential in this domain is underexplored.

Method: Decomposes judgment into stages mimicking human annotators, incorporates expert reasoning, and ensures interpretable labeling.

Result: LLMs produce reliable, valid relevance assessments, and transfer expertise to smaller models via knowledge distillation.

Conclusion: The approach enhances legal case retrieval with interpretability, accuracy, and scalability.

Abstract: Determining which legal cases are relevant to a given query involves navigating lengthy texts and applying nuanced legal reasoning. Traditionally, this task has demanded significant time and domain expertise to identify key Legal Facts and reach sound juridical conclusions. In addition, existing data with legal case similarities often lack interpretability, making it difficult to understand the rationale behind relevance judgments. With the growing capabilities of large language models (LLMs), researchers have begun investigating their potential in this domain. Nonetheless, the method of employing a general large language model for reliable relevance judgments in legal case retrieval remains largely unexplored. To address this gap in research, we propose a novel few-shot approach where LLMs assist in generating expert-aligned interpretable relevance judgments. The proposed approach decomposes the judgment process into several stages, mimicking the workflow of human annotators and allowing for the flexible incorporation of expert reasoning to improve the accuracy of relevance judgments. Importantly, it also ensures interpretable data labeling, providing transparency and clarity in the relevance assessment process. Through a comparison of relevance judgments made by LLMs and human experts, we empirically demonstrate that the proposed approach can yield reliable and valid relevance assessments. Furthermore, we demonstrate that with minimal expert supervision, our approach enables a large language model to acquire case analysis expertise and subsequently transfers this ability to a smaller model via annotation-based knowledge distillation.

[313] Federated Cross-Training Learners for Robust Generalization under Data Heterogeneity

Zhuang Qi, Lei Meng, Ruohan Zhang, Yu Wang, Xin Qi, Xiangxu Meng, Han Yu, Qiang Yang

Main category: cs.AI

TL;DR: FedCT introduces a cross-training scheme for federated learning, combining local and global knowledge distillation to align feature spaces and improve generalization. It includes modules for consistency-aware broadcasting, multi-view representation learning, and feature augmentation, outperforming state-of-the-art methods.

DetailsMotivation: Addressing feature space heterogeneity and misaligned optimization goals in federated learning due to differing data distributions.

Method: FedCT uses three modules: consistency-aware knowledge broadcasting, multi-view knowledge-guided representation learning, and mixup-based feature augmentation.

Result: FedCT outperforms state-of-the-art methods, alleviating knowledge forgetting and improving feature alignment.

Conclusion: FedCT effectively balances local and global knowledge, enhancing federated learning performance.

Abstract: Federated learning benefits from cross-training strategies, which enables models to train on data from distinct sources to improve generalization capability. However, due to inherent differences in data distributions, the optimization goals of local models remain misaligned, and this mismatch continues to manifest as feature space heterogeneity even after cross-training. We argue that knowledge distillation from the personalized view preserves client-specific characteristics and expands the local knowledge base, while distillation from the global view provides consistent semantic anchors that facilitate feature alignment across clients. To achieve this goal, this paper presents a cross-training scheme, termed FedCT, includes three main modules, where the consistency-aware knowledge broadcasting module aims to optimize model assignment strategies, which enhances collaborative advantages between clients and achieves an efficient federated learning process. The multi-view knowledge-guided representation learning module leverages fused prototypical knowledge from both global and local views to enhance the preservation of local knowledge before and after model exchange, as well as to ensure consistency between local and global knowledge. The mixup-based feature augmentation module aggregates rich information to further increase the diversity of feature spaces, which enables the model to better discriminate complex samples. Extensive experiments were conducted on four datasets in terms of performance comparison, ablation study, in-depth analysis and case study. The results demonstrated that FedCT alleviates knowledge forgetting from both local and global views, which enables it outperform state-of-the-art methods.

[314] A Random-Key Optimizer for Combinatorial Optimization

Antonio A. Chaves, Mauricio G. C. Resende, Martin J. A. Schuetz, J. Kyle Brubaker, Helmut G. Katzgraber, Edilson F. de Arruda, Ricardo M. A. Silva

Main category: cs.AI

TL;DR: The paper introduces the Random-Key Optimizer (RKO), a stochastic local search method for combinatorial optimization, using random-key encoding and modular metaheuristics.

DetailsMotivation: To address combinatorial optimization problems efficiently by leveraging a flexible and modular framework that integrates various metaheuristics.

Method: RKO encodes solutions as random-key vectors, decoded into feasible solutions, and combines metaheuristics like simulated annealing and iterated local search, facilitated by an elite solution pool.

Result: RKO demonstrates high-quality solutions for NP-hard problems (alpha-neighborhood p-median, tree of hubs location, and node-capacitated graph partitioning).

Conclusion: RKO is a robust and versatile tool for combinatorial optimization, adaptable to diverse problem domains.

Abstract: This paper introduces the Random-Key Optimizer (RKO), a versatile and efficient stochastic local search method tailored for combinatorial optimization problems. Using the random-key concept, RKO encodes solutions as vectors of random keys that are subsequently decoded into feasible solutions via problem-specific decoders. The RKO framework is able to combine a plethora of classic metaheuristics, each capable of operating independently or in parallel, with solution sharing facilitated through an elite solution pool. This modular approach allows for the adaptation of various metaheuristics, including simulated annealing, iterated local search, and greedy randomized adaptive search procedures, among others. The efficacy of the RKO framework, implemented in C++ and publicly available (Github public repository: github.com/RKO-solver), is demonstrated through its application to three NP-hard combinatorial optimization problems: the alpha-neighborhood p-median problem, the tree of hubs location problem, and the node-capacitated graph partitioning problem. The results highlight the framework’s ability to produce high-quality solutions across diverse problem domains, underscoring its potential as a robust tool for combinatorial optimization.

[315] MedRep: Medical Concept Representation for General Electronic Health Record Foundation Models

Junmo Kim, Namkyeong Lee, Jiwon Kim, Kwangsoo Kim

Main category: cs.AI

TL;DR: Proposes MedRep, a novel medical concept representation for EHR foundation models to address out-of-vocabulary medical codes, enhancing generalizability and performance.

DetailsMotivation: Address the limitation of EHR foundation models in handling unseen medical codes, which restricts their generalizability and interoperability.

Method: Uses OMOP CDM, enriches concept definitions via LLM prompts, and integrates graph ontology for representation learning.

Result: Outperforms vanilla EHR models and prior tokenizers in prediction tasks, validated externally.

Conclusion: MedRep improves EHR model generalizability and performance by addressing vocabulary limitations.

Abstract: Electronic health record (EHR) foundation models have been an area ripe for exploration with their improved performance in various medical tasks. Despite the rapid advances, there exists a fundamental limitation: Processing unseen medical codes out of vocabulary. This problem limits the generalizability of EHR foundation models and the integration of models trained with different vocabularies. To alleviate this problem, we propose a set of novel medical concept representations (MedRep) for EHR foundation models based on the observational medical outcome partnership (OMOP) common data model (CDM). For concept representation learning, we enrich the information of each concept with a minimal definition through large language model (LLM) prompts and complement the text-based representations through the graph ontology of OMOP vocabulary. Our approach outperforms the vanilla EHR foundation model and the model with a previously introduced medical code tokenizer in diverse prediction tasks. We also demonstrate the generalizability of MedRep through external validation.

[316] FAIRGAME: a Framework for AI Agents Bias Recognition using Game Theory

Alessio Buscemi, Daniele Proverbio, Alessandro Di Stefano, The Anh Han, German Castignani, Pietro Liò

Main category: cs.AI

TL;DR: FAIRGAME is a framework for detecting biases in AI agent interactions using game theory, enabling reproducible and standardized analysis of strategic outcomes.

DetailsMotivation: The complexity of multi-agent AI interactions necessitates tools for interpretability and bias detection to ensure trustworthy adoption.

Method: FAIRGAME leverages game theory and provides a user-friendly IT framework to simulate games, compare results, and analyze biases based on LLMs, language, and agent traits.

Result: The framework successfully identifies biased outcomes in AI agent interactions, facilitating comparison with game-theoretic predictions.

Conclusion: FAIRGAME empowers systematic bias discovery, behavior anticipation, and advances research in strategic decision-making with LLM agents.

Abstract: Letting AI agents interact in multi-agent applications adds a layer of complexity to the interpretability and prediction of AI outcomes, with profound implications for their trustworthy adoption in research and society. Game theory offers powerful models to capture and interpret strategic interaction among agents, but requires the support of reproducible, standardized and user-friendly IT frameworks to enable comparison and interpretation of results. To this end, we present FAIRGAME, a Framework for AI Agents Bias Recognition using Game Theory. We describe its implementation and usage, and we employ it to uncover biased outcomes in popular games among AI agents, depending on the employed Large Language Model (LLM) and used language, as well as on the personality trait or strategic knowledge of the agents. Overall, FAIRGAME allows users to reliably and easily simulate their desired games and scenarios and compare the results across simulation campaigns and with game-theoretic predictions, enabling the systematic discovery of biases, the anticipation of emerging behavior out of strategic interplays, and empowering further research into strategic decision-making using LLM agents.

[317] LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang

Main category: cs.AI

TL;DR: LAPO optimizes reasoning length in models, reducing token usage by 40.9% while improving accuracy by 2.3%.

DetailsMotivation: Address excessive token generation in large reasoning models by making length control intrinsic.

Method: Two-stage reinforcement learning: learning natural reasoning patterns and embedding them as meta-cognitive guidance.

Result: 40.9% reduction in token usage and 2.3% accuracy improvement.

Conclusion: LAPO enables efficient, flexible reasoning without quality loss.

Abstract: Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model’s reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9% while improving accuracy by 2.3%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.

[318] Beyond Accuracy: How AI Metacognitive Sensitivity improves AI-assisted Decision Making

ZhaoBin Li, Mark Steyvers

Main category: cs.AI

TL;DR: AI’s metacognitive sensitivity (accurate confidence scoring) can improve human decision-making, sometimes outperforming higher-accuracy AI with lower sensitivity.

DetailsMotivation: To understand how AI's confidence estimates (metacognitive sensitivity) impact human decision-making alongside predictive accuracy.

Method: Introduced a theoretical framework and conducted a behavioral experiment to assess AI’s joint impact on decision quality.

Result: AI with lower accuracy but higher metacognitive sensitivity can enhance human decision-making.

Conclusion: AI assistance should be evaluated and optimized for both accuracy and metacognitive sensitivity to improve decision outcomes.

Abstract: In settings where human decision-making relies on AI input, both the predictive accuracy of the AI system and the reliability of its confidence estimates influence decision quality. We highlight the role of AI metacognitive sensitivity – its ability to assign confidence scores that accurately distinguish correct from incorrect predictions – and introduce a theoretical framework for assessing the joint impact of AI’s predictive accuracy and metacognitive sensitivity in hybrid decision-making settings. Our analysis identifies conditions under which an AI with lower predictive accuracy but higher metacognitive sensitivity can enhance the overall accuracy of human decision making. Finally, a behavioral experiment confirms that greater AI metacognitive sensitivity improves human decision performance. Together, these findings underscore the importance of evaluating AI assistance not only by accuracy but also by metacognitive sensitivity, and of optimizing both to achieve superior decision outcomes.

[319] On the Definition of Intelligence

Kei-Sing Ng

Main category: cs.AI

TL;DR: The paper proposes a species-agnostic definition of intelligence based on ’entity fidelity,’ formalizing it as ε-concept intelligence, and discusses its implications for evaluation, safety, and generalization.

DetailsMotivation: To define intelligence in a generalizable way that encompasses diverse paradigms of intelligent behavior, enabling better evaluation and engineering of AGI.

Method: Introduces the concept of ε-concept intelligence, formalizing intelligence as the ability to generate entities exemplifying a concept with fidelity ε. Outlines empirical protocols for evaluation.

Result: A formal framework for evaluating intelligence based on entity fidelity, with potential applications in AGI evaluation, safety, and generalization.

Conclusion: The proposed ε-concept intelligence provides a measurable and generalizable criterion for intelligence, offering a foundation for AGI engineering and evaluation.

Abstract: To engineer AGI, we should first capture the essence of intelligence in a species-agnostic form that can be evaluated, while being sufficiently general to encompass diverse paradigms of intelligent behavior, including reinforcement learning, generative models, classification, analogical reasoning, and goal-directed decision-making. We propose a general criterion based on \textit{entity fidelity}: Intelligence is the ability, given entities exemplifying a concept, to generate entities exemplifying the same concept. We formalise this intuition as (\varepsilon)-concept intelligence: it is (\varepsilon)-intelligent with respect to a concept if no chosen admissible distinguisher can separate generated entities from original entities beyond tolerance (\varepsilon). We present the formal framework, outline empirical protocols, and discuss implications for evaluation, safety, and generalization.

[320] TextQuests: How Good are LLMs at Text-Based Video Games?

Long Phan, Mantas Mazeika, Andy Zou, Dan Hendrycks

Main category: cs.AI

TL;DR: TextQuests is a benchmark for evaluating AI agents’ autonomous problem-solving in exploratory environments using text-based adventures.

DetailsMotivation: Existing benchmarks fail to assess AI agents' long-context reasoning and self-directed problem-solving in exploratory settings.

Method: TextQuests uses Infocom interactive fiction games to create a benchmark that tests intrinsic reasoning without external tools.

Result: The benchmark evaluates agents’ trial-and-error learning and sustained problem-solving in long, interactive sessions.

Conclusion: TextQuests provides a more accurate assessment of AI agents’ capabilities in challenging, exploratory environments.

Abstract: Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent’s ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context. To enable a more accurate assessment of AI agents in challenging exploratory environments, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent’s capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session. We release TextQuests at https://textquests.ai.

[321] Compass-Thinker-7B Technical Report

Anxiang Zeng, Haibo Zhang, Kaixiang Mo, Long Zhang, Shuman Liu, Yanhui Huang, Yawen Liu, Yuepeng Sheng, Yuwei Huang

Main category: cs.AI

TL;DR: The paper introduces Compass-Thinker-7B, a model exploring Reinforcement Learning (RL) for reasoning in LLMs with reduced computational costs, achieving strong performance on math tasks.

DetailsMotivation: Hyperscale RL experiments are costly and risky; the goal is to explore RL's potential for reasoning in smaller models like Compass-Thinker-7B.

Method: Train Compass-Thinker-7B using a custom RL pipeline on 30k math problems, with staged difficulty distributions to improve efficiency.

Result: Compass-Thinker-7B shows exceptional reasoning, outperforming same-sized RL models, achieving 40% accuracy on AIME2024.

Conclusion: The model demonstrates RL’s viability for reasoning in smaller LLMs, offering insights for scaling to larger models.

Abstract: Recent R1-Zero-like research further demonstrates that reasoning extension has given large language models (LLMs) unprecedented reasoning capabilities, and Reinforcement Learning is the core technology to elicit its complex reasoning. However, conducting RL experiments directly on hyperscale models involves high computational costs and resource demands, posing significant risks. We propose the Compass-Thinker-7B model, which aims to explore the potential of Reinforcement Learning with less computational resources and costs, and provides insights for further research into RL recipes for larger models. Compass-Thinker-7B is trained from an open source model through a specially designed Reinforcement Learning Pipeline. We curate a dataset of 30k verifiable mathematics problems for the Reinforcement Learning Pipeline. By configuring data and training settings with different difficulty distributions for different stages, the potential of the model is gradually released and the training efficiency is improved. Extensive evaluations show that Compass-Thinker-7B possesses exceptional reasoning potential, and achieves superior performance on mathematics compared to the same-sized RL model. Especially in the challenging AIME2024 evaluation, Compass-Thinker-7B achieves 40% accuracy.

[322] OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Huarong Chen, Zaida Zhou, Haotian Yao, Ziwei Chen, Qizheng Gu, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Flood Sung, Y. Charles, Zhilin Yang, Tao Yu

Main category: cs.AI

TL;DR: OpenCUA is an open-source framework for vision-language models (CUAs) to automate computer tasks, offering tools, datasets, and models for research.

DetailsMotivation: The lack of open frameworks for studying CUAs' capabilities and risks motivates the creation of OpenCUA.

Method: OpenCUA includes an annotation infrastructure, AgentNet dataset, and a scalable pipeline for state-action pairs with Chain-of-Thought reasoning.

Result: OpenCUA-32B achieves 34.8% success on OSWorld-Verified, surpassing GPT-4o, and shows strong generalization and scalability.

Conclusion: OpenCUA provides open foundations for CUA research, demonstrating SOTA performance and scalability.

Abstract: Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 34.8% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.

[323] Mathematical Computation and Reasoning Errors by Large Language Models

Liang Zhang, Edith Aurora Graf

Main category: cs.AI

TL;DR: The study evaluates four LLMs (OpenAI GPT-4o, o1, DeepSeek-V3, R1) on math tasks, highlighting OpenAI o1’s superior accuracy and dual-agent improvements.

DetailsMotivation: To assess LLM accuracy in math education and identify step-level errors for reliable feedback.

Method: Built challenging math tasks, analyzed answer accuracy and step errors, tested single/dual-agent setups.

Result: OpenAI o1 excelled; procedural slips were common, dual-agent setups boosted performance.

Conclusion: Findings guide LLM enhancement and effective integration into math education.

Abstract: Large Language Models (LLMs) are increasingly utilized in AI-driven educational instruction and assessment, particularly within mathematics education. The capability of LLMs to generate accurate answers and detailed solutions for math problem-solving tasks is foundational for ensuring reliable and precise feedback and assessment in math education practices. Our study focuses on evaluating the accuracy of four LLMs (OpenAI GPT-4o and o1, DeepSeek-V3 and DeepSeek-R1) solving three categories of math tasks, including arithmetic, algebra, and number theory, and identifies step-level reasoning errors within their solutions. Instead of relying on standard benchmarks, we intentionally build math tasks (via item models) that are challenging for LLMs and prone to errors. The accuracy of final answers and the presence of errors in individual solution steps were systematically analyzed and coded. Both single-agent and dual-agent configurations were tested. It is observed that the reasoning-enhanced OpenAI o1 model consistently achieved higher or nearly perfect accuracy across all three math task categories. Analysis of errors revealed that procedural slips were the most frequent and significantly impacted overall performance, while conceptual misunderstandings were less frequent. Deploying dual-agent configurations substantially improved overall performance. These findings offer actionable insights into enhancing LLM performance and underscore effective strategies for integrating LLMs into mathematics education, thereby advancing AI-driven instructional practices and assessment precision.

cs.SD

[324] Whisper Smarter, not Harder: Adversarial Attack on Partial Suppression

Zheng Jie Wong, Bingquan Shen

Main category: cs.SD

TL;DR: The paper investigates adversarial attacks on ASR models, focusing on improving their imperceptibility and exploring defenses like low-pass filters.

DetailsMotivation: To understand and mitigate vulnerabilities in ASR models against adversarial attacks.

Method: Relaxing the optimization objective from complete to partial suppression of model output and testing defenses.

Result: Partial suppression increases attack imperceptibility; low-pass filters show promise as a defense.

Conclusion: Adversarial attacks on ASR models can be made less perceptible, and low-pass filters may effectively defend against them.

Abstract: Currently, Automatic Speech Recognition (ASR) models are deployed in an extensive range of applications. However, recent studies have demonstrated the possibility of adversarial attack on these models which could potentially suppress or disrupt model output. We investigate and verify the robustness of these attacks and explore if it is possible to increase their imperceptibility. We additionally find that by relaxing the optimisation objective from complete suppression to partial suppression, we can further decrease the imperceptibility of the attack. We also explore possible defences against these attacks and show a low-pass filter defence could potentially serve as an effective defence.

[325] Dynamic Synchronization and Resonance as a Universal Origin of 1/f Fluctuations – Amplitude Modulation Across Music and Nature

Akika Nakamichi, Izumi Uesaka, Masahiro Morikawa

Main category: cs.SD

TL;DR: The paper proposes a universal mechanism for 1/f fluctuations via amplitude modulation/demodulation, validated in acoustic cases and other domains.

DetailsMotivation: To explain the widespread occurrence of 1/f fluctuations in diverse systems.

Method: Uses amplitude modulation (AM) and demodulation (DM), with two processes: stochastic synchronization (Kuramoto model) and frequency-selective resonance.

Result: Both mechanisms robustly produce 1/f spectra when DM is applied, without requiring the Kuramoto critical point.

Conclusion: Demodulation is a general route to 1/f fluctuations, explaining its ubiquity in natural and engineered systems.

Abstract: We propose a universal physical mechanism for the emergence of 1/f fluctuations, observed across a wide range of systems. In particular, we verify this on acoustic cases. The mechanism is based on amplitude modulation (AM) and demodulation (DM), where the 1/f spectral law arises not in the raw waveform but in its demodulated amplitude envelope. Two distinct yet complementary processes generate the required AM: (i) stochastic synchronization among oscillators, modeled via an extended Kuramoto framework that captures perpetual synchronization-desynchronization cycles, and (ii) frequency-selective resonance, modeled by spectral accumulation of eigenmodes in acoustic or structural environments. Numerical simulations demonstrate that both mechanisms, acting separately or in combination, robustly produce 1/f spectra over several decades when DM is applied, and that the classical Kuramoto critical point is not necessary for their emergence. We demonstrate the cross-domain relevance of this AM/DM framework through analyses of musical performances, seismic records, and astrophysical time series, revealing a common underlying structure. This work establishes demodulation as a general route to 1/f fluctuations, providing a simple and scalable explanation for its ubiquity in both natural and engineered systems. Keywords: 1/f fluctuation, amplitude modulation, synchronization, resonance, Kuramoto model, music, natural noise, demodulation

[326] No Free Lunch from Audio Pretraining in Bioacoustics: A Benchmark Study of Embeddings

Chenggang Chen, Zhiyu Yang

Main category: cs.SD

TL;DR: Benchmarking 11 DL models for bioacoustic tasks shows fine-tuning is crucial, with ResNet excelling in separating background sounds.

DetailsMotivation: To evaluate the effectiveness of audio-pretrained DL models for bioacoustic tasks, highlighting the need for fine-tuning.

Method: Benchmarked 11 DL models by reducing embedding dimensionality and evaluating via clustering.

Result: Fine-tuned models outperform non-fine-tuned ones; ResNet excels in background separation.

Conclusion: Fine-tuning is essential for audio-pretrained models, and embeddings should be checked post-tuning.

Abstract: Bioacoustics, the study of animal sounds, offers a non-invasive method to monitor ecosystems. Extracting embeddings from audio-pretrained deep learning (DL) models without fine-tuning has become popular for obtaining bioacoustic features for tasks. However, a recent benchmark study reveals that while fine-tuned audio-pretrained VGG and transformer models achieve state-of-the-art performance in some tasks, they fail in others. This study benchmarks 11 DL models on the same tasks by reducing their learned embeddings’ dimensionality and evaluating them through clustering. We found that audio-pretrained DL models 1) without fine-tuning even underperform fine-tuned AlexNet, 2) both with and without fine-tuning fail to separate the background from labeled sounds, but ResNet does, and 3) outperform other models when fewer background sounds are included during fine-tuning. This study underscores the necessity of fine-tuning audio-pretrained models and checking the embeddings after fine-tuning. Our codes are available: https://github.com/NeuroscienceAI/Audio_Embeddings

[327] A dataset and model for recognition of audiologically relevant environments for hearing aids: AHEAD-DS and YAMNet+

Henry Zhong, Jörg M. Buchholz, Julian Maclaren, Simon Carlile, Richard Lyon

Main category: cs.SD

TL;DR: The paper introduces AHEAD-DS, a dataset for audiologically relevant scene recognition, and YAMNet+, a model for edge devices, achieving high accuracy and low latency.

DetailsMotivation: Existing datasets lack accessibility and relevance for hearing aids, and deploying models on edge devices is challenging.

Method: Created AHEAD-DS from open-source datasets and developed YAMNet+, a sound recognition model for edge deployment.

Result: YAMNet+ achieved 0.83 mAP and 0.93 accuracy on AHEAD-DS, with real-time performance on smartphones.

Conclusion: The work provides a standardized dataset and efficient model for hearing aid applications, demonstrating practical edge deployment.

Abstract: Scene recognition of audiologically relevant environments is important for hearing aids; however, it is challenging, in part because of the limitations of existing datasets. Datasets often lack public accessibility, completeness, or audiologically relevant labels, hindering systematic comparison of machine learning models. Deploying these models on resource-constrained edge devices presents another challenge. Our solution is two-fold: we leverage several open source datasets to create AHEAD-DS, a dataset designed for scene recognition of audiologically relevant environments, and introduce YAMNet+, a sound recognition model. AHEAD-DS aims to provide a standardised, publicly available dataset with consistent labels relevant to hearing aids, facilitating model comparison. YAMNet+ is designed for deployment on edge devices like smartphones connected to hearing devices, such as hearing aids and wireless earphones with hearing aid functionality; serving as a baseline model for sound-based scene recognition. YAMNet+ achieved a mean average precision of 0.83 and accuracy of 0.93 on the testing set of AHEAD-DS across fourteen categories of audiologically relevant environments. We found that applying transfer learning from the pretrained YAMNet model was essential. We demonstrated real-time sound-based scene recognition capabilities on edge devices by deploying YAMNet+ to an Android smartphone. Even with a Google Pixel 3 (a phone with modest specifications, released in 2018), the model processes audio with approximately 50ms of latency to load the model, and an approximate linear increase of 30ms per 1 second of audio. Our website and code https://github.com/Australian-Future-Hearing-Initiative .

[328] Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning

Yejin Jeon, Solee Im, Youngjae Kim, Gary Geunbae Lee

Main category: cs.SD

TL;DR: A framework for personalized TTS for dysarthric speakers using domain transfer and teacher-student models improves speech synthesis with fewer errors and high fidelity.

DetailsMotivation: Dysarthric speakers struggle with speech intelligibility, making dataset curation and personalized TTS model training difficult due to limited and error-prone audio data.

Method: The approach uses a knowledge anchoring framework with a teacher-student model and curriculum learning via audio augmentation to address domain transfer.

Result: The zero-shot multi-speaker TTS model reduces articulation errors, maintains high speaker fidelity, and preserves prosodic naturalness.

Conclusion: The proposed method effectively addresses challenges in personalized speech synthesis for dysarthric speakers, improving synthetic speech quality.

Abstract: Dysarthric speakers experience substantial communication challenges due to impaired motor control of the speech apparatus, which leads to reduced speech intelligibility. This creates significant obstacles in dataset curation since actual recording of long, articulate sentences for the objective of training personalized TTS models becomes infeasible. Thus, the limited availability of audio data, in addition to the articulation errors that are present within the audio, complicates personalized speech synthesis for target dysarthric speaker adaptation. To address this, we frame the issue as a domain transfer task and introduce a knowledge anchoring framework that leverages a teacher-student model, enhanced by curriculum learning through audio augmentation. Experimental results show that the proposed zero-shot multi-speaker TTS model effectively generates synthetic speech with markedly reduced articulation errors and high speaker fidelity, while maintaining prosodic naturalness.

[329] Alternating Approach-Putt Models for Multi-Stage Speech Enhancement

Iksoon Jeong, Kyung-Joong Kim, Kang-Hun Ahn

Main category: cs.SD

TL;DR: A post-processing neural network (PuttNet) is proposed to reduce artifacts from speech enhancement models, improving speech quality metrics like PESQ, STOI, and CBAK.

DetailsMotivation: Speech enhancement networks often introduce distortions (artifacts) that degrade audio quality, necessitating a solution to mitigate these artifacts.

Method: The proposed PuttNet acts as a post-processor, alternating with a speech enhancement model to iteratively improve speech quality.

Result: The alternating approach outperforms repeated use of either model alone, enhancing PESQ, STOI, and CBAK scores.

Conclusion: PuttNet effectively reduces artifacts and improves speech quality when alternated with a speech enhancement model.

Abstract: Speech enhancement using artificial neural networks aims to remove noise from noisy speech signals while preserving the speech content. However, speech enhancement networks often introduce distortions to the speech signal, referred to as artifacts, which can degrade audio quality. In this work, we propose a post-processing neural network designed to mitigate artifacts introduced by speech enhancement models. Inspired by the analogy of making a Putt' after an Approach’ in golf, we name our model PuttNet. We demonstrate that alternating between a speech enhancement model and the proposed Putt model leads to improved speech quality, as measured by perceptual quality scores (PESQ), objective intelligibility (STOI), and background noise intrusiveness (CBAK) scores. Furthermore, we illustrate with graphical analysis why this alternating Approach outperforms repeated application of either model alone.

[330] Advances in Speech Separation: Techniques, Challenges, and Future Trends

Kai Li, Guo Chen, Wendi Sang, Yi Luo, Zhuo Chen, Shuai Wang, Shulin He, Zhong-Qiu Wang, Andong Li, Zhiyong Wu, Xiaolin Hu

Main category: cs.SD

TL;DR: This survey systematically reviews DNN-based speech separation, covering learning paradigms, scenarios, frameworks, and architectures, while offering insights and evaluations.

DetailsMotivation: To address the fragmented understanding in speech separation literature by providing a comprehensive and timely review of DNN-based techniques.

Method: Systematic examination of learning paradigms, separation scenarios, frameworks (supervised/self-supervised/unsupervised), and architectural components, with quantitative evaluations on standard datasets.

Result: Reveals capabilities and limitations of methods, identifies emerging patterns, and highlights promising directions like domain-robust frameworks and multimodal integration.

Conclusion: The survey serves as a valuable reference for researchers, offering insights into current innovations and future directions in speech separation.

Abstract: The field of speech separation, addressing the “cocktail party problem”, has seen revolutionary advances with DNNs. Speech separation enhances clarity in complex acoustic environments and serves as crucial pre-processing for speech recognition and speaker recognition. However, current literature focuses narrowly on specific architectures or isolated approaches, creating fragmented understanding. This survey addresses this gap by providing systematic examination of DNN-based speech separation techniques. Our work differentiates itself through: (I) Comprehensive perspective: We systematically investigate learning paradigms, separation scenarios with known/unknown speakers, comparative analysis of supervised/self-supervised/unsupervised frameworks, and architectural components from encoders to estimation strategies. (II) Timeliness: Coverage of cutting-edge developments ensures access to current innovations and benchmarks. (III) Unique insights: Beyond summarization, we evaluate technological trajectories, identify emerging patterns, and highlight promising directions including domain-robust frameworks, efficient architectures, multimodal integration, and novel self-supervised paradigms. (IV) Fair evaluation: We provide quantitative evaluations on standard datasets, revealing true capabilities and limitations of different methods. This comprehensive survey serves as an accessible reference for experienced researchers and newcomers navigating speech separation’s complex landscape.

[331] Motive-level Analysis of Form-functions Association in Korean Folk song

Danbinaerin Han, Dasaem Jeong, Juhan Nam

Main category: cs.SD

TL;DR: A method for automatic motive segmentation in Korean folk songs is proposed, using a fine-tuned speech transcription model. Structural features like motif count and duration entropy were analyzed, showing systematic variation by social function.

DetailsMotivation: Challenges in computational analysis of folk song audio due to structural irregularities and manual annotation needs.

Method: Fine-tuning a speech transcription model on annotated audio lyrics for automatic motive segmentation, applied to 856 songs.

Result: Extracted structural features (motif count, duration entropy) vary by social function (e.g., collective labor vs. entertainment).

Conclusion: Provides a scalable approach for quantitative structural analysis of oral music traditions.

Abstract: Computational analysis of folk song audio is challenging due to structural irregularities and the need for manual annotation. We propose a method for automatic motive segmentation in Korean folk songs by fine-tuning a speech transcription model on audio lyric with motif boundary annotation. Applying this to 856 songs, we extracted motif count and duration entropy as structural features. Statistical analysis revealed that these features vary systematically according to the social function of the songs. Songs associated with collective labor, for instance, showed different structural patterns from those for entertainment or personal settings. This work offers a scalable approach for quantitative structural analysis of oral music traditions.

[332] Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform

Yuankun Xie, Ruibo Fu, Xiaopeng Wang, Zhiyong Wang, Ya Li, Zhengqi Wen, Haonnan Cheng, Long Ye

Main category: cs.SD

TL;DR: The paper introduces the Fake Speech Wild (FSW) dataset to improve deepfake speech detection in real-world scenarios, benchmarks current countermeasures, and enhances performance using data augmentation.

DetailsMotivation: To address the performance degradation of deepfake audio countermeasures in cross-domain scenarios and improve real-world detection.

Method: Proposes the FSW dataset, benchmarks existing countermeasures using SSL-based methods, and evaluates data augmentation strategies.

Result: Achieves an average EER of 3.54% across evaluation sets by augmenting public datasets and incorporating FSW.

Conclusion: The FSW dataset and data augmentation significantly enhance real-world deepfake speech detection performance.

Abstract: The rapid advancement of speech generation technology has led to the widespread proliferation of deepfake speech across social media platforms. While deepfake audio countermeasures (CMs) achieve promising results on public datasets, their performance degrades significantly in cross-domain scenarios. To advance CMs for real-world deepfake detection, we first propose the Fake Speech Wild (FSW) dataset, which includes 254 hours of real and deepfake audio from four different media platforms, focusing on social media. As CMs, we establish a benchmark using public datasets and advanced selfsupervised learning (SSL)-based CMs to evaluate current CMs in real-world scenarios. We also assess the effectiveness of data augmentation strategies in enhancing CM robustness for detecting deepfake speech on social media. Finally, by augmenting public datasets and incorporating the FSW training set, we significantly advanced real-world deepfake audio detection performance, achieving an average equal error rate (EER) of 3.54% across all evaluation sets.

[333] A Training-Free Approach for Music Style Transfer with Latent Diffusion Models

Heehwan Wang, Joonwoo Kwon, Sooyoung Kim, Shinjae Yoo, Yuewei Lin, Jiook Cha

Main category: cs.SD

TL;DR: Stylus is a training-free framework for music style transfer using pre-trained LDMs, enhancing quality and controllability without fine-tuning.

DetailsMotivation: Existing methods for music style transfer require extensive training, paired datasets, or annotations, limiting accessibility and efficiency.

Method: Stylus manipulates self-attention layers of a pre-trained LDM, replacing key and value representations from content audio with style references, and incorporates techniques like query preservation and CFG-inspired guidance.

Result: The method improves perceptual quality and structural preservation compared to prior work, remaining lightweight and easy to deploy.

Conclusion: Stylus demonstrates the potential of diffusion-based attention manipulation for efficient, high-fidelity music generation without training.

Abstract: Music style transfer enables personalized music creation by combining the structure of one piece with the stylistic characteristics of another. While recent approaches have explored text-conditioned generation and diffusion-based synthesis, most require extensive training, paired datasets, or detailed textual annotations. In this work, we introduce Stylus, a novel training-free framework for music style transfer that directly manipulates the self-attention layers of a pre-trained Latent Diffusion Model (LDM). Operating in the mel-spectrogram domain, Stylus transfers musical style by replacing key and value representations from the content audio with those of the style reference, without any fine-tuning. To enhance stylization quality and controllability, we further incorporate query preservation, CFG-inspired guidance scaling, multi-style interpolation, and phase-preserving reconstruction. Our method significantly improves perceptual quality and structural preservation compared to prior work, while remaining lightweight and easy to deploy. This work highlights the potential of diffusion-based attention manipulation for efficient, high-fidelity, and interpretable music generation-without training. Codes will be released upon acceptance.

cs.LG

[334] OpenFPL: An open-source forecasting method rivaling state-of-the-art Fantasy Premier League services

Daniel Groos

Main category: cs.LG

TL;DR: OpenFPL democratizes accurate Fantasy Premier League forecasts using open-source methods and public data, matching commercial service accuracy.

DetailsMotivation: To provide accessible, high-accuracy player performance forecasts for Fantasy Premier League, currently dominated by proprietary services.

Method: Developed OpenFPL, an open-source forecasting method using position-specific ensemble models and public data from 2020-21 to 2023-24 seasons.

Result: OpenFPL matches commercial service accuracy and surpasses it for high-return players (>2 points), effective across 1-3 gameweek horizons.

Conclusion: OpenFPL offers a viable open-source alternative to commercial forecasts, aiding long-term and final-day decision-making in Fantasy Premier League.

Abstract: Fantasy Premier League engages the football community in selecting the Premier League players who will perform best from gameweek to gameweek. Access to accurate performance forecasts gives participants an edge over competitors by guiding expectations about player outcomes and reducing uncertainty in squad selection. However, high-accuracy forecasts are currently limited to commercial services whose inner workings are undisclosed and that rely on proprietary data. This paper aims to democratize access to highly accurate forecasts of player performance by presenting OpenFPL, an open-source Fantasy Premier League forecasting method developed exclusively from public data. Comprising position-specific ensemble models optimized on Fantasy Premier League and Understat data from four previous seasons (2020-21 to 2023-24), OpenFPL achieves accuracy comparable to a leading commercial service when tested prospectively on data from the 2024-25 season. OpenFPL also surpasses the commercial benchmark for high-return players ($>$ 2 points), which are most influential for rank gains. These findings hold across one-, two-, and three-gameweek forecast horizons, supporting long-term planning of transfers and strategies while also informing final-day decisions.

[335] A Unified Evaluation Framework for Multi-Annotator Tendency Learning

Liyun Zhang, Jingcheng Ke, Shenli Fan, Xuanmeng Sha, Zheng Lian

Main category: cs.LG

TL;DR: The paper introduces a unified evaluation framework for Individual Tendency Learning (ITL) with two new metrics: DIC and BAE, to assess how well ITL methods capture annotator tendencies and provide meaningful explanations.

DetailsMotivation: Existing ITL methods lack a proper evaluation framework to verify if they truly capture annotator-specific tendencies and provide useful behavioral explanations.

Method: Proposes two novel metrics: Difference of Inter-annotator Consistency (DIC) and Behavior Alignment Explainability (BAE), using ground-truth comparison and Multidimensional Scaling (MDS).

Result: Experiments confirm the framework’s effectiveness in evaluating ITL methods.

Conclusion: The proposed framework fills a critical gap in assessing ITL methods, ensuring they accurately model annotator tendencies and provide meaningful explanations.

Abstract: Recent works have emerged in multi-annotator learning that shift focus from Consensus-oriented Learning (CoL), which aggregates multiple annotations into a single ground-truth prediction, to Individual Tendency Learning (ITL), which models annotator-specific labeling behavior patterns (i.e., tendency) to provide explanation analysis for understanding annotator decisions. However, no evaluation framework currently exists to assess whether ITL methods truly capture individual tendencies and provide meaningful behavioral explanations. To address this gap, we propose the first unified evaluation framework with two novel metrics: (1) Difference of Inter-annotator Consistency (DIC) quantifies how well models capture annotator tendencies by comparing predicted inter-annotator similarity structures with ground-truth; (2) Behavior Alignment Explainability (BAE) evaluates how well model explanations reflect annotator behavior and decision relevance by aligning explainability-derived with ground-truth labeling similarity structures via Multidimensional Scaling (MDS). Extensive experiments validate the effectiveness of our proposed evaluation framework.

[336] xRFM: Accurate, scalable, and interpretable feature learning models for tabular data

Daniel Beaglehole, David Holzmüller, Adityanarayanan Radhakrishnan, Mikhail Belkin

Main category: cs.LG

TL;DR: xRFM combines feature learning kernel machines with tree structures to outperform 31 methods, including GBDTs and TabPFNv2, in regression and classification tasks.

DetailsMotivation: Despite advancements in AI, tabular data inference still relies heavily on GBDTs. The paper aims to innovate by integrating neural networks and feature learning methods for better performance.

Method: xRFM algorithm merges feature learning kernel machines with a tree structure, adapting to local data structure and scaling to large datasets.

Result: xRFM outperforms 31 methods in regression (100 datasets) and competes well in classification (200 datasets), surpassing GBDTs. It also offers interpretability via the Average Gradient Outer Product.

Conclusion: xRFM presents a scalable, interpretable, and high-performing alternative to traditional GBDTs for tabular data inference.

Abstract: Inference from tabular data, collections of continuous and categorical variables organized into matrices, is a foundation for modern technology and science. Yet, in contrast to the explosive changes in the rest of AI, the best practice for these predictive tasks has been relatively unchanged and is still primarily based on variations of Gradient Boosted Decision Trees (GBDTs). Very recently, there has been renewed interest in developing state-of-the-art methods for tabular data based on recent developments in neural networks and feature learning methods. In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to both adapt to the local structure of the data and scale to essentially unlimited amounts of training data. We show that compared to $31$ other methods, including recently introduced tabular foundation models (TabPFNv2) and GBDTs, xRFM achieves best performance across $100$ regression datasets and is competitive to the best methods across $200$ classification datasets outperforming GBDTs. Additionally, xRFM provides interpretability natively through the Average Gradient Outer Product.

[337] A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation

Jiulin Li, Ping Huang, Yexin Li, Shuo Chen, Juewen Hu, Ye Tian

Main category: cs.LG

TL;DR: MAGUS is a modular framework unifying multimodal understanding and generation via decoupled Cognition and Deliberation phases, outperforming state-of-the-art models.

DetailsMotivation: Addressing the challenge of integrating autoregressive LLMs and diffusion models for flexible, scalable any-to-any multimodal capabilities.

Method: Uses a two-phase approach: Cognition (multimodal LLM agents collaborate) and Deliberation (Growth-Aware Search orchestrates reasoning and generation).

Result: Outperforms baselines and GPT-4o on benchmarks like MME, supporting plug-and-play extensibility and semantic alignment.

Conclusion: MAGUS offers a scalable, flexible solution for unified multimodal understanding and generation without joint training.

Abstract: Real-world multimodal applications often require any-to-any capabilities, enabling both understanding and generation across modalities including text, image, audio, and video. However, integrating the strengths of autoregressive language models (LLMs) for reasoning and diffusion models for high-fidelity generation remains challenging. Existing approaches rely on rigid pipelines or tightly coupled architectures, limiting flexibility and scalability. We propose MAGUS (Multi-Agent Guided Unified Multimodal System), a modular framework that unifies multimodal understanding and generation via two decoupled phases: Cognition and Deliberation. MAGUS enables symbolic multi-agent collaboration within a shared textual workspace. In the Cognition phase, three role-conditioned multimodal LLM agents - Perceiver, Planner, and Reflector - engage in collaborative dialogue to perform structured understanding and planning. The Deliberation phase incorporates a Growth-Aware Search mechanism that orchestrates LLM-based reasoning and diffusion-based generation in a mutually reinforcing manner. MAGUS supports plug-and-play extensibility, scalable any-to-any modality conversion, and semantic alignment - all without the need for joint training. Experiments across multiple benchmarks, including image, video, and audio generation, as well as cross-modal instruction following, demonstrate that MAGUS outperforms strong baselines and state-of-the-art systems. Notably, on the MME benchmark, MAGUS surpasses the powerful closed-source model GPT-4o.

[338] A Personalized Exercise Assistant using Reinforcement Learning (PEARL): Results from a four-arm Randomized-controlled Trial

Amy Armento Lee, Narayan Hegde, Nina Deliu, Emily Rosenzweig, Arun Suggala, Sriram Lakshminarasimhan, Qian He, John Hernandez, Martin Seneviratne, Rahul Singh, Pradnesh Kalkar, Karthikeyan Shanmugam, Aravindan Raghuveer, Abhimanyu Singh, My Nguyen, James Taylor, Jatin Alla, Sofia S. Villar, Hulya Emir-Farinas

Main category: cs.LG

TL;DR: The PEARL study tested a reinforcement learning (RL) algorithm to personalize physical activity (PA) nudges via a Fitbit app, showing significant PA increases in the RL group compared to control, random, and fixed groups.

DetailsMotivation: Addressing global physical inactivity through scalable, personalized mHealth interventions, overcoming methodological challenges in integrating behavioral science.

Method: A large-scale RCT with 13,463 Fitbit users randomized into four arms (control, random, fixed, RL) to test personalized nudges via an RL algorithm.

Result: The RL group showed significant PA increases at 1 and 2 months compared to other groups, with sustained effects.

Conclusion: Behaviorally-informed RL can effectively personalize digital health interventions for scalable PA promotion.

Abstract: Consistent physical inactivity poses a major global health challenge. Mobile health (mHealth) interventions, particularly Just-in-Time Adaptive Interventions (JITAIs), offer a promising avenue for scalable, personalized physical activity (PA) promotion. However, developing and evaluating such interventions at scale, while integrating robust behavioral science, presents methodological hurdles. The PEARL study was the first large-scale, four-arm randomized controlled trial to assess a reinforcement learning (RL) algorithm, informed by health behavior change theory, to personalize the content and timing of PA nudges via a Fitbit app. We enrolled and randomized 13,463 Fitbit users into four study arms: control, random, fixed, and RL. The control arm received no nudges. The other three arms received nudges from a bank of 155 nudges based on behavioral science principles. The random arm received nudges selected at random. The fixed arm received nudges based on a pre-set logic from survey responses about PA barriers. The RL group received nudges selected by an adaptive RL algorithm. We included 7,711 participants in primary analyses (mean age 42.1, 86.3% female, baseline steps 5,618.2). We observed an increase in PA for the RL group compared to all other groups from baseline to 1 and 2 months. The RL group had significantly increased average daily step count at 1 month compared to all other groups: control (+296 steps, p=0.0002), random (+218 steps, p=0.005), and fixed (+238 steps, p=0.002). At 2 months, the RL group sustained a significant increase compared to the control group (+210 steps, p=0.0122). Generalized estimating equation models also revealed a sustained increase in daily steps in the RL group vs. control (+208 steps, p=0.002). These findings demonstrate the potential of a scalable, behaviorally-informed RL approach to personalize digital health interventions for PA.

[339] Measuring Time Series Forecast Stability for Demand Planning

Steven Klee, Yuntian Xia

Main category: cs.LG

TL;DR: The paper highlights the importance of forecast stability alongside accuracy in time series forecasting for supply chains, demonstrating that ensemble models improve stability without compromising accuracy.

DetailsMotivation: Demand planners prioritize consistency and stability over minor accuracy gains, as unstable forecasts require excessive human intervention and erode trust in ML models.

Method: A case study evaluates stability and accuracy of state-of-the-art forecasting models (e.g., Chronos, DeepAR) on M5 and Favorita datasets, focusing on model-induced stochasticity.

Result: Ensemble models enhance stability without significantly reducing (or even improving) forecast accuracy.

Conclusion: The paper advocates for further research on forecast stability for production-deployed models, emphasizing its practical importance.

Abstract: Time series forecasting is a critical first step in generating demand plans for supply chains. Experiments on time series models typically focus on demonstrating improvements in forecast accuracy over existing/baseline solutions, quantified according to some accuracy metric. There is no doubt that forecast accuracy is important; however in production systems, demand planners often value consistency and stability over incremental accuracy improvements. Assuming that the inputs have not changed significantly, forecasts that vary drastically from one planning cycle to the next require high amounts of human intervention, which frustrates demand planners and can even cause them to lose trust in ML forecasting models. We study model-induced stochasticity, which quantifies the variance of a set of forecasts produced by a single model when the set of inputs is fixed. Models with lower variance are more stable. Recently the forecasting community has seen significant advances in forecast accuracy through the development of deep machine learning models for time series forecasting. We perform a case study measuring the stability and accuracy of state-of-the-art forecasting models (Chronos, DeepAR, PatchTST, Temporal Fusion Transformer, TiDE, and the AutoGluon best quality ensemble) on public data sets from the M5 competition and Favorita grocery sales. We show that ensemble models improve stability without significantly deteriorating (or even improving) forecast accuracy. While these results may not be surprising, the main point of this paper is to propose the need for further study of forecast stability for models that are being deployed in production systems.

[340] Constrained Decoding of Diffusion LLMs with Context-Free Grammars

Niels Mündler, Jasper Dekoninck, Martin Vechev

Main category: cs.LG

TL;DR: The paper introduces the first constrained decoding method for diffusion LLMs to ensure syntactic correctness in formal languages like C++ or JSON.

DetailsMotivation: Existing constrained decoding methods don't work for diffusion LLMs, which are increasingly used for tasks requiring formal language adherence.

Method: The method reduces constrained decoding to additive infilling, then to checking language intersection emptiness, with an efficient algorithm for context-free languages.

Result: Empirical tests show near-perfect syntactic correctness and preserved/improved functional correctness in tasks like C++ code infilling and JSON extraction.

Conclusion: The proposed method effectively ensures syntactic correctness for diffusion LLMs with practical computational overhead.

Abstract: Large language models (LLMs) have shown promising performance across diverse domains. Many practical applications of LLMs, such as code completion and structured data extraction, require adherence to syntactic constraints specified by a formal language. Yet, due to their probabilistic nature, LLM output is not guaranteed to adhere to such formal languages. Prior work has proposed constrained decoding as a means to restrict LLM generation to particular formal languages. However, existing works are not applicable to the emerging paradigm of diffusion LLMs, when used in practical scenarios such as the generation of formally correct C++ or JSON output. In this paper we address this challenge and present the first constrained decoding method for diffusion models, one that can handle formal languages captured by context-free grammars. We begin by reducing constrained decoding to the more general additive infilling problem, which asks whether a partial output can be completed to a valid word in the target language. This problem also naturally subsumes the previously unaddressed multi-region infilling constrained decoding. We then reduce this problem to the task of deciding whether the intersection of the target language and a regular language is empty and present an efficient algorithm to solve it for context-free languages. Empirical results on various applications, such as C++ code infilling and structured data extraction in JSON, demonstrate that our method achieves near-perfect syntactic correctness while consistently preserving or improving functional correctness. Importantly, our efficiency optimizations ensure that the computational overhead remains practical.

[341] Less is More: Learning Graph Tasks with Just LLMs

Sola Shirai, Kavitha Srinivas, Julian Dolby, Michael Katz, Horst Samulowitz, Shirin Sohrabi

Main category: cs.LG

TL;DR: Small LLMs can solve graph tasks using instructive chain-of-thought training, generalizing to new tasks and structures without specialized encoders.

DetailsMotivation: To determine if LLMs can solve fundamental graph tasks without specialized models and generalize to unseen structures or tasks.

Method: Training small LLMs with instructive chain-of-thought solutions for graph tasks.

Result: LLMs can learn and generalize graph solutions without specialized encoders.

Conclusion: Instructive chain-of-thought training enables LLMs to handle graph tasks effectively.

Abstract: For large language models (LLMs), reasoning over graphs could help solve many problems. Prior work has tried to improve LLM graph reasoning by examining how best to serialize graphs as text and by combining GNNs and LLMs. However, the merits of such approaches remain unclear, so we empirically answer the following research questions: (1) Can LLMs learn to solve fundamental graph tasks without specialized graph encoding models?, (2) Can LLMs generalize learned solutions to unseen graph structures or tasks?, and (3) What are the merits of competing approaches to learn graph tasks? We show that even small LLMs can learn to solve graph tasks by training them with instructive chain-of-thought solutions, and this training generalizes, without specialized graph encoders, to new tasks and graph structures.

[342] From Intent to Execution: Multimodal Chain-of-Thought Reinforcement Learning for Precise CAD Code Generation

Ke Niu, Haiyang Yu, Zhuofan Chen, Mengyang Zhao, Teng Fu, Bin Li, Xiangyang Xue

Main category: cs.LG

TL;DR: CAD-RL, a multimodal Chain-of-Thought guided reinforcement learning framework, automates CAD code generation with improved reasoning, precision, and executability.

DetailsMotivation: Current CAD workflows are manual and expertise-heavy; LLMs offer potential but face challenges in translating design intent into precise, executable code.

Method: Combines CoT-based Cold Start with reinforcement learning, using three rewards (executability, geometric accuracy, external evaluation) and optimization strategies (Trust Region Stretch, Precision Token Loss, Overlong Filtering).

Result: CAD-RL outperforms existing VLMs in reasoning, precision, and executability, validated on the ExeCAD dataset.

Conclusion: CAD-RL advances automated CAD modeling by addressing key challenges in code generation, supported by a novel dataset and targeted optimizations.

Abstract: Computer-Aided Design (CAD) plays a vital role in engineering and manufacturing, yet current CAD workflows require extensive domain expertise and manual modeling effort. Recent advances in large language models (LLMs) have made it possible to generate code from natural language, opening new opportunities for automating parametric 3D modeling. However, directly translating human design intent into executable CAD code remains highly challenging, due to the need for logical reasoning, syntactic correctness, and numerical precision. In this work, we propose CAD-RL, a multimodal Chain-of-Thought (CoT) guided reinforcement learning post training framework for CAD modeling code generation. Our method combines CoT-based Cold Start with goal-driven reinforcement learning post training using three task-specific rewards: executability reward, geometric accuracy reward, and external evaluation reward. To ensure stable policy learning under sparse and high-variance reward conditions, we introduce three targeted optimization strategies: Trust Region Stretch for improved exploration, Precision Token Loss for enhanced dimensions parameter accuracy, and Overlong Filtering to reduce noisy supervision. To support training and benchmarking, we release ExeCAD, a noval dataset comprising 16,540 real-world CAD examples with paired natural language and structured design language descriptions, executable CADQuery scripts, and rendered 3D models. Experiments demonstrate that CAD-RL achieves significant improvements in reasoning quality, output precision, and code executability over existing VLMs.

[343] SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning

Weijian Mai, Jiamin Wu, Yu Zhu, Zhouheng Yao, Dongzhan Zhou, Andrew F. Luo, Qihao Zheng, Wanli Ouyang, Chunfeng Song

Main category: cs.LG

TL;DR: SynBrain is a generative framework for probabilistic modeling of visual-to-neural mapping, outperforming existing methods in fMRI synthesis and adaptability.

DetailsMotivation: Existing deterministic methods fail to model biological variability and functional consistency in visual-to-neural mapping.

Method: SynBrain uses BrainVAE for probabilistic neural representations and a Semantic-to-Neural Mapper for high-fidelity fMRI synthesis.

Result: SynBrain excels in subject-specific encoding, few-shot adaptation, and improves fMRI-to-image decoding.

Conclusion: SynBrain captures interpretable neural variability and functional consistency, with code to be released.

Abstract: Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience. This visual-to-neural mapping is inherently a one-to-many relationship, as identical visual inputs reliably evoke variable hemodynamic responses across trials, contexts, and subjects. However, existing deterministic methods struggle to simultaneously model this biological variability while capturing the underlying functional consistency that encodes stimulus information. To address these limitations, we propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner. SynBrain introduces two key components: (i) BrainVAE models neural representations as continuous probability distributions via probabilistic learning while maintaining functional consistency through visual semantic constraints; (ii) A Semantic-to-Neural Mapper acts as a semantic transmission pathway, projecting visual semantics into the neural response manifold to facilitate high-fidelity fMRI synthesis. Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance. Furthermore, SynBrain adapts efficiently to new subjects with few-shot data and synthesizes high-quality fMRI signals that are effective in improving data-limited fMRI-to-image decoding performance. Beyond that, SynBrain reveals functional consistency across trials and subjects, with synthesized signals capturing interpretable patterns shaped by biological neural variability. The code will be made publicly available.

[344] Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts

Maxime Heuillet, Yufei Cui, Boxing Chen, Audrey Durand, Prasanna Parthasarathi

Main category: cs.LG

TL;DR: Nested-ReFT introduces a computationally efficient ReFT framework by using a subset of model layers as a behavior model, reducing inference costs while maintaining performance.

DetailsMotivation: Standard ReFT frameworks are computationally expensive due to multiple inference steps during training.

Method: Nested-ReFT leverages off-policy RL and speculative decoding, using dynamic layer skipping to reduce inference costs.

Result: Theoretical analysis confirms unbiased gradient estimates; empirical results show improved efficiency (tokens/sec) on math reasoning benchmarks.

Conclusion: Nested-ReFT balances computational efficiency and performance, with bias mitigation variants matching baseline ReFT results.

Abstract: Advanced reasoning in LLMs on challenging domains like mathematical reasoning can be tackled using verifiable rewards based reinforced fine-tuning (ReFT). In standard ReFT frameworks, a behavior model generates multiple completions with answers per problem, for the answer to be then scored by a reward function. While such RL post-training methods demonstrate significant performance improvements across challenging reasoning domains, the computational cost of generating completions during training with multiple inference steps makes the training cost non-trivial. To address this, we draw inspiration from off-policy RL, and speculative decoding to introduce a novel ReFT framework, dubbed Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training. The behavior model configured with dynamic layer skipping per batch during training decreases the inference cost compared to the standard ReFT frameworks. Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance. Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes. Additionally, we explore three variants of bias mitigation to minimize the off-policyness in the gradient updates that allows for maintaining performance that matches the baseline ReFT performance.

[345] Competitive Algorithms for Multi-Agent Ski-Rental Problems

Xuchuang Wang, Bo Sun, Hedyeh Beyhaghi, John C. S. Lui, Mohammad Hajiesmaili, Adam Wierman

Main category: cs.LG

TL;DR: The paper generalizes the ski-rental problem to a multi-agent setting with individual and shared costs, introducing dynamic states and three competitive ratios. It designs optimal deterministic and randomized policies, showing symmetric policies outperform asymmetric ones.

DetailsMotivation: To extend the classical ski-rental dilemma to a group context, addressing individual and shared costs in dynamic settings where agents' active days vary.

Method: Defines three competitive ratios (overall, state-dependent, individual rational) and designs deterministic (state-aware threshold functions) and randomized (sampling from tailored distributions) policies.

Result: Symmetric policies outperform asymmetric ones, with competitive ratio bounds provided, extending classical insights to multi-agent scenarios.

Conclusion: The study advances group decision-making under uncertainty, offering theoretical and practical insights for multi-agent ski-rental problems.

Abstract: This paper introduces a novel multi-agent ski-rental problem that generalizes the classical ski-rental dilemma to a group setting where agents incur individual and shared costs. In our model, each agent can either rent at a fixed daily cost, or purchase a pass at an individual cost, with an additional third option of a discounted group pass available to all. We consider scenarios in which agents’ active days differ, leading to dynamic states as agents drop out of the decision process. To address this problem from different perspectives, we define three distinct competitive ratios: overall, state-dependent, and individual rational. For each objective, we design and analyze optimal deterministic and randomized policies. Our deterministic policies employ state-aware threshold functions that adapt to the dynamic states, while our randomized policies sample and resample thresholds from tailored state-aware distributions. The analysis reveals that symmetric policies, in which all agents use the same threshold, outperform asymmetric ones. Our results provide competitive ratio upper and lower bounds and extend classical ski-rental insights to multi-agent settings, highlighting both theoretical and practical implications for group decision-making under uncertainty.

[346] rETF-semiSL: Semi-Supervised Learning for Neural Collapse in Temporal Data

Yuhan Xie, William Cappelletti, Mahsa Shoaran, Pascal Frossard

Main category: cs.LG

TL;DR: A novel semi-supervised pre-training strategy for time series models aligns with Neural Collapse theory, outperforming traditional methods on classification tasks.

DetailsMotivation: Current pretext tasks for pre-training lack theoretical grounding and transferability to downstream classification, prompting a need for a more effective approach.

Method: Proposes a semi-supervised strategy using rotational equiangular tight frame-classifiers, pseudo-labeling, and generative pretext tasks with sequential augmentation.

Result: Outperforms previous methods on LSTMs, transformers, and state-space models across three datasets, demonstrating improved embedding separability.

Conclusion: Aligning pre-training objectives with embedding geometry theory enhances performance in time series classification.

Abstract: Deep neural networks for time series must capture complex temporal patterns, to effectively represent dynamic data. Self- and semi-supervised learning methods show promising results in pre-training large models, which – when finetuned for classification – often outperform their counterparts trained from scratch. Still, the choice of pretext training tasks is often heuristic and their transferability to downstream classification is not granted, thus we propose a novel semi-supervised pre-training strategy to enforce latent representations that satisfy the Neural Collapse phenomenon observed in optimally trained neural classifiers. We use a rotational equiangular tight frame-classifier and pseudo-labeling to pre-train deep encoders with few labeled samples. Furthermore, to effectively capture temporal dynamics while enforcing embedding separability, we integrate generative pretext tasks with our method, and we define a novel sequential augmentation strategy. We show that our method significantly outperforms previous pretext tasks when applied to LSTMs, transformers, and state-space models on three multivariate time series classification datasets. These results highlight the benefit of aligning pre-training objectives with theoretically grounded embedding geometry.

[347] Out-of-Distribution Detection using Counterfactual Distance

Maria Stoica, Francesco Leofante, Alessio Lomuscio

Main category: cs.LG

TL;DR: A post-hoc OOD detection method using counterfactual explanations to calculate decision boundary distances, improving scalability and interpretability.

DetailsMotivation: Accurate and explainable OOD detection is needed for safe machine learning systems.

Method: Proposes a post-hoc OOD detection method leveraging counterfactual explanations for decision boundary distance calculation, with strategies to enhance scalability by computing counterfactuals in embedding space.

Result: Achieves state-of-the-art performance: 93.50% AUROC and 25.80% FPR95 on CIFAR-10, 97.05% AUROC and 13.79% FPR95 on CIFAR-100, and 92.55% AUROC and 33.55% FPR95 on ImageNet-200.

Conclusion: The method effectively detects OOD data while providing interpretable results through counterfactual explanations.

Abstract: Accurate and explainable out-of-distribution (OOD) detection is required to use machine learning systems safely. Previous work has shown that feature distance to decision boundaries can be used to identify OOD data effectively. In this paper, we build on this intuition and propose a post-hoc OOD detection method that, given an input, calculates the distance to decision boundaries by leveraging counterfactual explanations. Since computing explanations can be expensive for large architectures, we also propose strategies to improve scalability by computing counterfactuals directly in embedding space. Crucially, as the method employs counterfactual explanations, we can seamlessly use them to help interpret the results of our detector. We show that our method is in line with the state of the art on CIFAR-10, achieving 93.50% AUROC and 25.80% FPR95. Our method outperforms these methods on CIFAR-100 with 97.05% AUROC and 13.79% FPR95 and on ImageNet-200 with 92.55% AUROC and 33.55% FPR95 across four OOD datasets

[348] Characterizing Evolution in Expectation-Maximization Estimates for Overspecified Mixed Linear Regression

Zhankun Luo, Abolfazl Hashemi

Main category: cs.LG

TL;DR: The paper analyzes the EM algorithm’s behavior in overspecified two-component Mixed Linear Regression (2MLR), showing convergence rates vary with initial mixing weight guesses and linking population-level and finite-sample results.

DetailsMotivation: To understand how the EM algorithm performs under model misspecification, particularly in overspecified 2MLR, and to bridge theoretical insights between population and finite-sample levels.

Method: Theoretical analysis of the EM algorithm for 2MLR, focusing on convergence rates for regression parameters and mixing weights under different initial conditions.

Result: Linear convergence for unbalanced initial weights, sublinear for balanced ones; finite-sample accuracy varies with mixing weight balance.

Conclusion: The study connects population and finite-sample results, providing insights into EM behavior in overspecified models and low SNR regimes.

Abstract: Mixture models have attracted significant attention due to practical effectiveness and comprehensive theoretical foundations. A persisting challenge is model misspecification, which occurs when the model to be fitted has more mixture components than those in the data distribution. In this paper, we develop a theoretical understanding of the Expectation-Maximization (EM) algorithm’s behavior in the context of targeted model misspecification for overspecified two-component Mixed Linear Regression (2MLR) with unknown $d$-dimensional regression parameters and mixing weights. In Theorem 5.1 at the population level, with an unbalanced initial guess for mixing weights, we establish linear convergence of regression parameters in $O(\log(1/\epsilon))$ steps. Conversely, with a balanced initial guess for mixing weights, we observe sublinear convergence in $O(\epsilon^{-2})$ steps to achieve the $\epsilon$-accuracy at Euclidean distance. In Theorem 6.1 at the finite-sample level, for mixtures with sufficiently unbalanced fixed mixing weights, we demonstrate a statistical accuracy of $O((d/n)^{1/2})$, whereas for those with sufficiently balanced fixed mixing weights, the accuracy is $O((d/n)^{1/4})$ given $n$ data samples. Furthermore, we underscore the connection between our population level and finite-sample level results: by setting the desired final accuracy $\epsilon$ in Theorem 5.1 to match that in Theorem 6.1 at the finite-sample level, namely letting $\epsilon = O((d/n)^{1/2})$ for sufficiently unbalanced fixed mixing weights and $\epsilon = O((d/n)^{1/4})$ for sufficiently balanced fixed mixing weights, we intuitively derive iteration complexity bounds $O(\log (1/\epsilon))=O(\log (n/d))$ and $O(\epsilon^{-2})=O((n/d)^{1/2})$ at the finite-sample level for sufficiently unbalanced and balanced initial mixing weights. We further extend our analysis in overspecified setting to low SNR regime.

[349] Benchmark-Driven Selection of AI: Evidence from DeepSeek-R1

Petr Spelda, Vit Stritecky

Main category: cs.LG

TL;DR: The paper discusses how reasoning language models improve performance through benchmark-driven learning rather than just model size or algorithmic enhancements, using DeepSeek-R1 and a decision-making problem as examples.

DetailsMotivation: To understand how reasoning models generalize better through intermediate reasoning steps and the role of benchmarks in their learning process.

Method: Analyzes the impact of benchmarks as curricula for learning, demonstrated via DeepSeek-R1 and a sequential decision-making task.

Result: Shows that benchmark-driven learning enhances model performance and generalization, making test task novelty crucial.

Conclusion: Benchmarks can serve as training curricula, not just evaluation tools, highlighting the importance of novel tasks for measuring reasoning model capabilities.

Abstract: Evaluation of reasoning language models gained importance after it was observed that they can combine their existing capabilities into novel traces of intermediate steps before task completion and that the traces can sometimes help them to generalize better than past models. As reasoning becomes the next scaling dimension of large language models, careful study of their capabilities in critical tasks is needed. We show that better performance is not always caused by test-time algorithmic improvements or model sizes but also by using impactful benchmarks as curricula for learning. We call this benchmark-driven selection of AI and show its effects on DeepSeek-R1 using our sequential decision-making problem from Humanity’s Last Exam. Steering development of AI by impactful benchmarks trades evaluation for learning and makes novelty of test tasks key for measuring generalization capabilities of reasoning models. Consequently, some benchmarks could be seen as curricula for training rather than unseen test sets.

[350] An Explainable AI based approach for Monitoring Animal Health

Rahul Janaa, Shubham Dixit, Mrityunjay Sharma, Ritesh Kumar

Main category: cs.LG

TL;DR: The paper presents a data-driven approach using explainable ML to monitor dairy cattle health and behavior, leveraging IoT sensors and robust ML methods for actionable insights.

DetailsMotivation: Addressing the challenge of tracking cattle health and optimizing yield for dairy farmers by providing transparent, data-driven solutions.

Method: Utilizes Bluetooth-based IoT devices and 4G networks for data collection, pre-processes accelerometer data, and employs hyperparameter-optimized ML models (e.g., k-nearest neighbor) for activity classification.

Result: The k-nearest neighbor classifier achieved high performance (AUC ~0.99) and explainability frameworks like SHAP provided feature importance insights.

Conclusion: The study demonstrates practical, explainable ML models for sustainable livestock management, aiding farmers in decision-making.

Abstract: Monitoring cattle health and optimizing yield are key challenges faced by dairy farmers due to difficulties in tracking all animals on the farm. This work aims to showcase modern data-driven farming practices based on explainable machine learning(ML) methods that explain the activity and behaviour of dairy cattle (cows). Continuous data collection of 3-axis accelerometer sensors and usage of robust ML methodologies and algorithms, provide farmers and researchers with actionable information on cattle activity, allowing farmers to make informed decisions and incorporate sustainable practices. This study utilizes Bluetooth-based Internet of Things (IoT) devices and 4G networks for seamless data transmission, immediate analysis, inference generation, and explains the models performance with explainability frameworks. Special emphasis is put on the pre-processing of the accelerometers time series data, including the extraction of statistical characteristics, signal processing techniques, and lag-based features using the sliding window technique. Various hyperparameter-optimized ML models are evaluated across varying window lengths for activity classification. The k-nearest neighbour Classifier achieved the best performance, with AUC of mean 0.98 and standard deviation of 0.0026 on the training set and 0.99 on testing set). In order to ensure transparency, Explainable AI based frameworks such as SHAP is used to interpret feature importance that can be understood and used by practitioners. A detailed comparison of the important features, along with the stability analysis of selected features, supports development of explainable and practical ML models for sustainable livestock management.

[351] AI-Driven Detection and Analysis of Handwriting on Seized Ivory: A Tool to Uncover Criminal Networks in the Illicit Wildlife Trade

Will Fein, Ryan J. Horwitz, John E. Brown III, Amit Misra, Felipe Oviedo, Kevin White, Juan M. Lavista Ferres, Samuel K. Wasser

Main category: cs.LG

TL;DR: An AI-driven pipeline analyzes handwritten markings on seized elephant tusks to provide low-cost forensic evidence, linking ivory shipments and traffickers.

DetailsMotivation: The transnational ivory trade drives elephant decline, but current forensic methods (like DNA) are costly or unavailable. Handwritten markings on tusks are underutilized.

Method: AI tools extract and analyze 17,000+ markings from 6,085 photos of seized tusks, identifying 184 recurring ‘signature markings’ to link shipments.

Result: 20 signature markings connected multiple seizures, revealing traffickers’ involvement in multiple shipments.

Conclusion: AI-based handwriting analysis offers scalable, low-cost forensic evidence, enhancing efforts to disrupt wildlife crime.

Abstract: The transnational ivory trade continues to drive the decline of elephant populations across Africa, and trafficking networks remain difficult to disrupt. Tusks seized by law enforcement officials carry forensic information on the traffickers responsible for their export, including DNA evidence and handwritten markings made by traffickers. For 20 years, analyses of tusk DNA have identified where elephants were poached and established connections among shipments of ivory. While the links established using genetic evidence are extremely conclusive, genetic data is expensive and sometimes impossible to obtain. But though handwritten markings are easy to photograph, they are rarely documented or analyzed. Here, we present an AI-driven pipeline for extracting and analyzing handwritten markings on seized elephant tusks, offering a novel, scalable, and low-cost source of forensic evidence. Having collected 6,085 photographs from eight large seizures of ivory over a 6-year period (2014-2019), we used an object detection model to extract over 17,000 individual markings, which were then labeled and described using state-of-the-art AI tools. We identified 184 recurring “signature markings” that connect the tusks on which they appear. 20 signature markings were observed in multiple seizures, establishing forensic links between these seizures through traffickers involved in both shipments. This work complements other investigative techniques by filling in gaps where other data sources are unavailable. The study demonstrates the transformative potential of AI in wildlife forensics and highlights practical steps for integrating handwriting analysis into efforts to disrupt organized wildlife crime.

[352] Comparison of D-Wave Quantum Annealing and Markov Chain Monte Carlo for Sampling from a Probability Distribution of a Restricted Boltzmann Machine

Abdelmoula El Yazizi, Samee U. Khan, Yaroslav Koshka

Main category: cs.LG

TL;DR: The study compares D-Wave quantum annealer and Gibbs sampling for RBM training, finding limited overlap and potential for combined approaches.

DetailsMotivation: To assess the quality of sampling from RBMs using a local-valley approach and compare D-Wave quantum annealing with classical Gibbs sampling.

Method: Applied a local-valley-centered approach to compare D-Wave and Gibbs samples from a classically trained RBM under contrastive-divergence-based learning conditions.

Result: D-Wave samples covered more local valleys but overlapped less with Gibbs samples, especially for high-probability states. Some important minima were unique to each method.

Conclusion: The results explain prior failures of D-Wave-based sampling but suggest potential for improvement via hybrid classical-quantum approaches.

Abstract: A local-valley (LV) centered approach to assessing the quality of sampling from Restricted Boltzmann Machines (RBMs) was applied to the latest generation of the D-Wave quantum annealer. D-Wave and Gibbs samples from a classically trained RBM were obtained at conditions relevant to the contrastive-divergence-based RBM learning. The samples were compared for the number of the LVs to which they belonged and the energy of the corresponding local minima. No significant (desirable) increase in the number of the LVs has been achieved by decreasing the D-Wave annealing time. At any training epoch, the states sampled by the D-Wave belonged to a somewhat higher number of LVs than in the Gibbs sampling. However, many of those LVs found by the two techniques differed. For high-probability sampled states, the two techniques were (unfavorably) less complementary and more overlapping. Nevertheless, many potentially “important” local minima, i.e., those having intermediate, even if not high, probability values, were found by only one of the two sampling techniques while missed by the other. The two techniques overlapped less at later than earlier training epochs, which is precisely the stage of the training when modest improvements to the sampling quality could make meaningful differences for the RBM trainability. The results of this work may explain the failure of previous investigations to achieve substantial (or any) improvement when using D-Wave-based sampling. However, the results reveal some potential for improvement, e.g., using a combined classical-quantum approach.

[353] Interpretable Machine Learning Model for Early Prediction of Acute Kidney Injury in Critically Ill Patients with Cirrhosis: A Retrospective Study

Li Sun, Shuheng Chen, Junyi Fan, Yong Si, Minoo Ahmadi, Elham Pishgar, Kamiar Alaei, Maryam Pishgar

Main category: cs.LG

TL;DR: An interpretable machine learning model (LightGBM) was developed to predict early acute kidney injury (AKI) in ICU patients with cirrhosis, achieving high accuracy and actionable insights.

DetailsMotivation: AKI in cirrhosis patients is common and worsens outcomes, but existing predictive tools lack accuracy and interpretability. This study aimed to create a better model for early detection.

Method: Retrospective analysis of MIMIC-IV data, preprocessing, feature selection, and evaluation of six machine learning algorithms using clinical variables from the first 48 ICU hours.

Result: LightGBM performed best (AUROC 0.808, accuracy 0.704), with key predictors like prolonged partial thromboplastin time and low pH. High negative predictive value (0.911) supports clinical utility.

Conclusion: The model accurately predicts AKI risk, aids in clinical decision-making, and is interpretable for clinician trust. External validation and EHR integration are recommended.

Abstract: Background: Cirrhosis is a progressive liver disease with high mortality and frequent complications, notably acute kidney injury (AKI), which occurs in up to 50% of hospitalized patients and worsens outcomes. AKI stems from complex hemodynamic, inflammatory, and metabolic changes, making early detection essential. Many predictive tools lack accuracy, interpretability, and alignment with intensive care unit (ICU) workflows. This study developed an interpretable machine learning model for early AKI prediction in critically ill patients with cirrhosis. Methods: We conducted a retrospective analysis of the MIMIC-IV v2.2 database, identifying 1240 adult ICU patients with cirrhosis and excluding those with ICU stays under 48 hours or missing key data. Laboratory and physiological variables from the first 48 hours were extracted. The pipeline included preprocessing, missingness filtering, LASSO feature selection, and SMOTE class balancing. Six algorithms-LightGBM, CatBoost, XGBoost, logistic regression, naive Bayes, and neural networks-were trained and evaluated using AUROC, accuracy, F1-score, sensitivity, specificity, and predictive values. Results: LightGBM achieved the best performance (AUROC 0.808, 95% CI 0.741-0.856; accuracy 0.704; NPV 0.911). Key predictors included prolonged partial thromboplastin time, absence of outside-facility 20G placement, low pH, and altered pO2, consistent with known cirrhosis-AKI mechanisms and suggesting actionable targets. Conclusion: The LightGBM-based model enables accurate early AKI risk stratification in ICU patients with cirrhosis using routine clinical variables. Its high negative predictive value supports safe de-escalation for low-risk patients, and interpretability fosters clinician trust and targeted prevention. External validation and integration into electronic health record systems are warranted.

[354] Can Transformers Break Encryption Schemes via In-Context Learning?

Jathin Korrapati, Patrick Mendoza, Aditya Tomar, Abein Abraham

Main category: cs.LG

TL;DR: The paper explores in-context learning (ICL) for cryptographic function learning, specifically mono-alphabetic substitution and Vigenère ciphers, to evaluate transformer models’ inductive biases and generalization.

DetailsMotivation: To extend ICL beyond simple function classes to structured cryptographic tasks, assessing transformers' ability to infer hidden mappings from minimal examples.

Method: Uses ICL to train transformer models on cipher tasks, given small (cipher text, plain text) pairs, to infer and decode hidden substitution rules.

Result: Demonstrates transformers’ capability to generalize and infer cryptographic functions purely from context, without parameter updates.

Conclusion: ICL is effective for structured cryptographic tasks, highlighting transformers’ potential in learning complex hidden mappings.

Abstract: In-context learning (ICL) has emerged as a powerful capability of transformer-based language models, enabling them to perform tasks by conditioning on a small number of examples presented at inference time, without any parameter updates. Prior work has shown that transformers can generalize over simple function classes like linear functions, decision trees, even neural networks, purely from context, focusing on numerical or symbolic reasoning over underlying well-structured functions. Instead, we propose a novel application of ICL into the domain of cryptographic function learning, specifically focusing on ciphers such as mono-alphabetic substitution and Vigen`ere ciphers, two classes of private-key encryption schemes. These ciphers involve a fixed but hidden bijective mapping between plain text and cipher text characters. Given a small set of (cipher text, plain text) pairs, the goal is for the model to infer the underlying substitution and decode a new cipher text word. This setting poses a structured inference challenge, which is well-suited for evaluating the inductive biases and generalization capabilities of transformers under the ICL paradigm. Code is available at https://github.com/adistomar/CS182-project.

[355] Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models

Taibiao Zhao, Mingxuan Sun, Hao Wang, Xiaobing Chen, Xiangwei Zhou

Main category: cs.LG

TL;DR: HPMI is a retraining-free backdoor attack on transformers, using head-wise pruning and malicious injection, achieving high attack success and bypassing defenses without altering model architecture.

DetailsMotivation: Transformers are vulnerable to backdoor attacks, but existing methods are resource-intensive or intrusive. HPMI aims to provide a lightweight, non-intrusive solution.

Method: Prunes the least important head and injects a pre-trained malicious head, requiring minimal data and no retraining.

Result: Achieves 99.55% attack success, negligible clean accuracy loss, and bypasses four advanced defenses.

Conclusion: HPMI is effective, concealable, and robust against defenses, offering a practical alternative to retraining-dependent attacks.

Abstract: Transformer models have demonstrated exceptional performance and have become indispensable in computer vision (CV) and natural language processing (NLP) tasks. However, recent studies reveal that transformers are susceptible to backdoor attacks. Prior backdoor attack methods typically rely on retraining with clean data or altering the model architecture, both of which can be resource-intensive and intrusive. In this paper, we propose Head-wise Pruning and Malicious Injection (HPMI), a novel retraining-free backdoor attack on transformers that does not alter the model’s architecture. Our approach requires only a small subset of the original data and basic knowledge of the model architecture, eliminating the need for retraining the target transformer. Technically, HPMI works by pruning the least important head and injecting a pre-trained malicious head to establish the backdoor. We provide a rigorous theoretical justification demonstrating that the implanted backdoor resists detection and removal by state-of-the-art defense techniques, under reasonable assumptions. Experimental evaluations across multiple datasets further validate the effectiveness of HPMI, showing that it 1) incurs negligible clean accuracy loss, 2) achieves at least 99.55% attack success rate, and 3) bypasses four advanced defense mechanisms. Additionally, relative to state-of-the-art retraining-dependent attacks, HPMI achieves greater concealment and robustness against diverse defense strategies, while maintaining minimal impact on clean accuracy.

[356] Convergence Analysis of Max-Min Exponential Neural Network Operators in Orlicz Space

Satyaranjan Pradhan, Madan Mohan Soren

Main category: cs.LG

TL;DR: The paper introduces Max Min Kantorovich-type exponential neural network operators for function approximation, analyzing their convergence properties in pointwise, uniform, and Orlicz space settings.

DetailsMotivation: To extend the Max Min framework for approximating functions using exponential neural network operators and investigate their convergence properties.

Method: Develops Max Min Kantorovich-type operators, studies pointwise and uniform convergence, uses logarithmic modulus of continuity for convergence rate, and examines behavior in Orlicz space.

Result: The operators demonstrate convergence, with graphical representations illustrating approximation error using kernel and sigmoidal activation functions.

Conclusion: The proposed operators effectively approximate functions, with analyzed convergence rates and practical validation through graphical examples.

Abstract: In this current work, we propose a Max Min approach for approximating functions using exponential neural network operators. We extend this framework to develop the Max Min Kantorovich-type exponential neural network operators and investigate their approximation properties. We study both pointwise and uniform convergence for univariate functions. To analyze the order of convergence, we use the logarithmic modulus of continuity and estimate the corresponding rate of convergence. Furthermore, we examine the convergence behavior of the Max Min Kantorovich type exponential neural network operators within the Orlicz space setting. We provide some graphical representations to illustrate the approximation error of the function through suitable kernel and sigmoidal activation functions.

[357] Multi-Agent Reinforcement Learning for Adaptive Resource Orchestration in Cloud-Native Clusters

Guanzi Yao, Heyao Liu, Linyan Dai

Main category: cs.LG

TL;DR: An adaptive resource orchestration method using multi-agent reinforcement learning improves cloud-native database scheduling by addressing high resource dynamism and complexity.

DetailsMotivation: To tackle challenges like high resource dynamism and scheduling complexity in cloud-native databases.

Method: Proposes a heterogeneous role-based agent modeling mechanism with a reward-shaping mechanism for better coordination and policy convergence.

Result: Outperforms traditional methods in resource utilization, scheduling latency, convergence speed, stability, and fairness.

Conclusion: The method is effective for large-scale, high-concurrency scheduling tasks, demonstrating strong generalization and practical utility.

Abstract: This paper addresses the challenges of high resource dynamism and scheduling complexity in cloud-native database systems. It proposes an adaptive resource orchestration method based on multi-agent reinforcement learning. The method introduces a heterogeneous role-based agent modeling mechanism. This allows different resource entities, such as compute nodes, storage nodes, and schedulers, to adopt distinct policy representations. These agents are better able to reflect diverse functional responsibilities and local environmental characteristics within the system. A reward-shaping mechanism is designed to integrate local observations with global feedback. This helps mitigate policy learning bias caused by incomplete state observations. By combining real-time local performance signals with global system value estimation, the mechanism improves coordination among agents and enhances policy convergence stability. A unified multi-agent training framework is developed and evaluated on a representative production scheduling dataset. Experimental results show that the proposed method outperforms traditional approaches across multiple key metrics. These include resource utilization, scheduling latency, policy convergence speed, system stability, and fairness. The results demonstrate strong generalization and practical utility. Across various experimental scenarios, the method proves effective in handling orchestration tasks with high concurrency, high-dimensional state spaces, and complex dependency relationships. This confirms its advantages in real-world, large-scale scheduling environments.

[358] Federated Anomaly Detection for Multi-Tenant Cloud Platforms with Personalized Modeling

Yuxi Wang, Heyao Liu, Nyutian Long, Guanzi Yao

Main category: cs.LG

TL;DR: A federated learning-based anomaly detection method for multi-tenant cloud environments addresses privacy, heterogeneity, and centralized modeling limitations by training locally and aggregating parameters globally, with personalized adjustments and Mahalanobis distance for scoring.

DetailsMotivation: To tackle data privacy leakage, heterogeneous resource behavior, and centralized modeling constraints in multi-tenant cloud environments.

Method: Federated training framework with local model training, parameter aggregation, personalized adjustments, and Mahalanobis distance-based anomaly scoring.

Result: Outperforms existing models in Precision, Recall, and F1-Score, demonstrating robustness in varied scenarios.

Conclusion: The method shows practical potential for intelligent resource monitoring and anomaly diagnosis in cloud computing.

Abstract: This paper proposes an anomaly detection method based on federated learning to address key challenges in multi-tenant cloud environments, including data privacy leakage, heterogeneous resource behavior, and the limitations of centralized modeling. The method establishes a federated training framework involving multiple tenants. Each tenant trains the model locally using private resource usage data. Through parameter aggregation, a global model is optimized, enabling cross-tenant collaborative anomaly detection while preserving data privacy. To improve adaptability to diverse resource usage patterns, a personalized parameter adjustment mechanism is introduced. This allows the model to retain tenant-specific feature representations while sharing global knowledge. In the model output stage, the Mahalanobis distance is used to compute anomaly scores. This enhances both the accuracy and stability of anomaly detection. The experiments use real telemetry data from a cloud platform to construct a simulated multi-tenant environment. The study evaluates the model’s performance under varying participation rates and noise injection levels. These comparisons demonstrate the proposed method’s robustness and detection accuracy. Experimental results show that the proposed method outperforms existing mainstream models across key metrics such as Precision, Recall, and F1-Score. It also maintains stable performance in various complex scenarios. These findings highlight the method’s practical potential for intelligent resource monitoring and anomaly diagnosis in cloud computing environments.

[359] Source Component Shift Adaptation via Offline Decomposition and Online Mixing Approach

Ryuta Matsuno

Main category: cs.LG

TL;DR: A method for adapting to source component shifts in data streams via offline decomposition and online mixing, outperforming baselines by reducing cumulative test loss by up to 67.4%.

DetailsMotivation: Existing methods fail to utilize recurring shifts or capture individual source components effectively, leading to poor adaptation.

Method: Offline decomposition of source components via the EM algorithm, followed by online mixing weight adaptation using convex optimization.

Result: Superior adaptation performance, reducing cumulative test loss by up to 67.4% in experiments on real-world regression datasets.

Conclusion: The proposed method effectively leverages shift characteristics, offering significant improvements over existing approaches.

Abstract: This paper addresses source component shift adaptation, aiming to update predictions adapting to source component shifts for incoming data streams based on past training data. Existing online learning methods often fail to utilize recurring shifts effectively, while model-pool-based methods struggle to capture individual source components, leading to poor adaptation. In this paper, we propose a source component shift adaptation method via an offline decomposition and online mixing approach. We theoretically identify that the problem can be divided into two subproblems: offline source component decomposition and online mixing weight adaptation. Based on this, our method first determines prediction models, each of which learns a source component solely based on past training data offline through the EM algorithm. Then, it updates the mixing weight of the prediction models for precise prediction through online convex optimization. Thanks to our theoretical derivation, our method fully leverages the characteristics of the shifts, achieving superior adaptation performance over existing methods. Experiments conducted on various real-world regression datasets demonstrate that our method outperforms baselines, reducing the cumulative test loss by up to 67.4%.

[360] Uncertainty-Aware Prediction of Parkinson’s Disease Medication Needs: A Two-Stage Conformal Prediction Approach

Ricardo Diaz-Rincon, Muxuan Liang, Adolfo Ramirez-Zamora, Benjamin Shickel

Main category: cs.LG

TL;DR: A conformal prediction framework for Parkinson’s Disease medication management provides reliable, uncertainty-quantified forecasts of medication needs up to two years ahead, improving clinical decision-making.

DetailsMotivation: Current PD medication management lacks systematic predictive methods, relying on trial-and-error. Machine learning predictions without uncertainty measures undermine clinical trust and utility.

Method: A two-stage conformal prediction framework using EHR data from 631 inpatient admissions: identifies patients needing medication changes, then predicts levodopa equivalent daily dose adjustments with reliable prediction intervals.

Result: Achieved marginal coverage with reduced prediction interval lengths, offering precise short-term and wider long-term forecasts.

Conclusion: Quantifying uncertainty enables evidence-based levodopa dosing, optimizing symptom control and minimizing side effects, thereby improving quality of life.

Abstract: Parkinson’s Disease (PD) medication management presents unique challenges due to heterogeneous disease progression and treatment response. Neurologists must balance symptom control with optimal dopaminergic dosing based on functional disability while minimizing side effects. This balance is crucial as inadequate or abrupt changes can cause levodopa-induced dyskinesia, wearing off, and neuropsychiatric effects, significantly reducing quality of life. Current approaches rely on trial-and-error decisions without systematic predictive methods. Despite machine learning advances, clinical adoption remains limited due to reliance on point predictions that do not account for prediction uncertainty, undermining clinical trust and utility. Clinicians require not only predictions of future medication needs but also reliable confidence measures. Without quantified uncertainty, adjustments risk premature escalation to maximum doses or prolonged inadequate symptom control. We developed a conformal prediction framework anticipating medication needs up to two years in advance with reliable prediction intervals and statistical guarantees. Our approach addresses zero-inflation in PD inpatient data, where patients maintain stable medication regimens between visits. Using electronic health records from 631 inpatient admissions at University of Florida Health (2011-2021), our two-stage approach identifies patients likely to need medication changes, then predicts required levodopa equivalent daily dose adjustments. Our framework achieved marginal coverage while reducing prediction interval lengths compared to traditional approaches, providing precise predictions for short-term planning and wider ranges for long-term forecasting. By quantifying uncertainty, our approach enables evidence-based decisions about levodopa dosing, optimizing symptom control while minimizing side effects and improving life quality.

[361] Improving Learning of New Diseases through Knowledge-Enhanced Initialization for Federated Adapter Tuning

Danni Peng, Yuan Wang, Kangning Cai, Peiyan Ning, Jiming Xu, Yong Liu, Rick Siow Mong Goh, Qingsong Wei, Huazhu Fu

Main category: cs.LG

TL;DR: FedKEI is a federated learning framework that enhances adapter tuning for new tasks in healthcare by leveraging past knowledge through clustering and bi-level optimization.

DetailsMotivation: To enable quick adaptation to new healthcare tasks while preserving privacy and leveraging past experiences.

Method: Uses global clustering and bi-level optimization to personalize knowledge transfer via inter- and intra-cluster weights.

Result: Outperforms state-of-the-art methods in adapting to new diseases across three benchmark datasets.

Conclusion: FedKEI effectively improves task adaptation in federated healthcare settings by combining knowledge transfer and personalization.

Abstract: In healthcare, federated learning (FL) is a widely adopted framework that enables privacy-preserving collaboration among medical institutions. With large foundation models (FMs) demonstrating impressive capabilities, using FMs in FL through cost-efficient adapter tuning has become a popular approach. Given the rapidly evolving healthcare environment, it is crucial for individual clients to quickly adapt to new tasks or diseases by tuning adapters while drawing upon past experiences. In this work, we introduce Federated Knowledge-Enhanced Initialization (FedKEI), a novel framework that leverages cross-client and cross-task transfer from past knowledge to generate informed initializations for learning new tasks with adapters. FedKEI begins with a global clustering process at the server to generalize knowledge across tasks, followed by the optimization of aggregation weights across clusters (inter-cluster weights) and within each cluster (intra-cluster weights) to personalize knowledge transfer for each new task. To facilitate more effective learning of the inter- and intra-cluster weights, we adopt a bi-level optimization scheme that collaboratively learns the global intra-cluster weights across clients and optimizes the local inter-cluster weights toward each client’s task objective. Extensive experiments on three benchmark datasets of different modalities, including dermatology, chest X-rays, and retinal OCT, demonstrate FedKEI’s advantage in adapting to new diseases compared to state-of-the-art methods.

[362] A Vision-Language Pre-training Model-Guided Approach for Mitigating Backdoor Attacks in Federated Learning

Keke Gai, Dongjue Wang, Jing Yu, Liehuang Zhu, Qi Wu

Main category: cs.LG

TL;DR: CLIP-Fed is a federated learning backdoor defense framework using vision-language models to address heterogeneous data and privacy concerns, improving defense effectiveness and model accuracy.

DetailsMotivation: Existing FL backdoor defenses assume homogeneous data or require clean server datasets, limiting practicality. CLIP-Fed aims to defend against attacks under heterogeneous data while preserving performance.

Method: CLIP-Fed integrates pre- and post-aggregation defenses, uses multimodal models for dataset augmentation, and aligns global model knowledge with CLIP via prototype contrastive loss and KL divergence.

Result: CLIP-Fed reduces attack success rates (ASR) by 2.03% on CIFAR-10 and 1.35% on CIFAR-10-LT, while improving model accuracy (MA) by 7.92% and 0.48%, respectively.

Conclusion: CLIP-Fed effectively defends against backdoor attacks in FL under heterogeneous data, outperforming state-of-the-art methods in both security and performance.

Abstract: Existing backdoor defense methods in Federated Learning (FL) rely on the assumption of homogeneous client data distributions or the availability of a clean serve dataset, which limits the practicality and effectiveness. Defending against backdoor attacks under heterogeneous client data distributions while preserving model performance remains a significant challenge. In this paper, we propose a FL backdoor defense framework named CLIP-Fed, which leverages the zero-shot learning capabilities of vision-language pre-training models. By integrating both pre-aggregation and post-aggregation defense strategies, CLIP-Fed overcomes the limitations of Non-IID imposed on defense effectiveness. To address privacy concerns and enhance the coverage of the dataset against diverse triggers, we construct and augment the server dataset using the multimodal large language model and frequency analysis without any client samples. To address class prototype deviations caused by backdoor samples and eliminate the correlation between trigger patterns and target labels, CLIP-Fed aligns the knowledge of the global model and CLIP on the augmented dataset using prototype contrastive loss and Kullback-Leibler divergence. Extensive experiments on representative datasets validate the effectiveness of CLIP-Fed. Compared to state-of-the-art methods, CLIP-Fed achieves an average reduction in ASR, i.e., 2.03% on CIFAR-10 and 1.35% on CIFAR-10-LT, while improving average MA by 7.92% and 0.48%, respectively.

[363] Welfare-Centric Clustering

Claire Jie Zhang, Seyed A. Esmaeili, Jamie Morgenstern

Main category: cs.LG

TL;DR: The paper critiques traditional fair clustering methods, proposing a welfare-centric approach with group utility modeling. It introduces Rawlsian and Utilitarian objectives, novel algorithms, and shows superior empirical performance.

DetailsMotivation: Traditional fair clustering methods may produce undesirable outcomes. The authors advocate for a welfare-centric approach to better model group utilities.

Method: Model group utilities using distances and proportional representation. Introduce algorithms for Rawlsian and Utilitarian objectives with theoretical guarantees.

Result: Empirical evaluations show the methods outperform existing fair clustering baselines.

Conclusion: The welfare-centric approach improves fairness in clustering by focusing on group utilities, validated by theory and experiments.

Abstract: Fair clustering has traditionally focused on ensuring equitable group representation or equalizing group-specific clustering costs. However, Dickerson et al. (2025) recently showed that these fairness notions may yield undesirable or unintuitive clustering outcomes and advocated for a welfare-centric clustering approach that models the utilities of the groups. In this work, we model group utilities based on both distances and proportional representation and formalize two optimization objectives based on welfare-centric clustering: the Rawlsian (Egalitarian) objective and the Utilitarian objective. We introduce novel algorithms for both objectives and prove theoretical guarantees for them. Empirical evaluations on multiple real-world datasets demonstrate that our methods significantly outperform existing fair clustering baselines.

[364] A Hierarchical IDS for Zero-Day Attack Detection in Internet of Medical Things Networks

Md Ashraf Uddin, Nam H. Chu, Reza Rafeh

Main category: cs.LG

TL;DR: A multi-level IoMT IDS framework is proposed to detect zero-day attacks and distinguish known/unknown threats, achieving high accuracy and F1-score.

DetailsMotivation: IoMT networks are vulnerable to cyberattacks, and traditional centralized IDSs are unsuitable due to delays, privacy risks, and device constraints.

Method: A multi-level framework with edge and cloud layers uses meta-learning and OCC for coarse and fine-grained attack detection.

Result: Achieves 99.77% accuracy and 97.8% F1-score on the CICIoMT2024 dataset, with high zero-day attack detection.

Conclusion: The framework is effective for IoMT security, addressing limitations of traditional IDSs and ensuring applicability in resource-constrained environments.

Abstract: The Internet of Medical Things (IoMT) is driving a healthcare revolution but remains vulnerable to cyberattacks such as denial of service, ransomware, data hijacking, and spoofing. These networks comprise resource constrained, heterogeneous devices (e.g., wearable sensors, smart pills, implantables), making traditional centralized Intrusion Detection Systems (IDSs) unsuitable due to response delays, privacy risks, and added vulnerabilities. Centralized IDSs require all sensors to transmit data to a central server, causing delays or network disruptions in dense environments. Running IDSs locally on IoMT devices is often infeasible due to limited computation, and even lightweight IDS components remain at risk if updated models are delayed leaving them exposed to zero-day attacks that threaten patient health and data security. We propose a multi level IoMT IDS framework capable of detecting zero day attacks and distinguishing between known and unknown threats. The first layer (near Edge) filters traffic at a coarse level (attack or not) using meta-learning or One Class Classification (OCC) with the usfAD algorithm. Subsequent layers (far Edge, Cloud) identify attack type and novelty. Experiments on the CICIoMT2024 dataset show 99.77 percentage accuracy and 97.8 percentage F1-score. The first layer detects zero-day attacks with high accuracy without needing new datasets, ensuring strong applicability in IoMT environments. Additionally, the meta-learning approach achieves high.

[365] Semantic Communication with Distribution Learning through Sequential Observations

Samer Lahoud, Kinda Khawam

Main category: cs.LG

TL;DR: The paper explores distribution learning in semantic communication, proving learnability conditions, convergence rates, and trade-offs between immediate performance and long-term learnability.

DetailsMotivation: Traditional semantic communication focuses on individual meaning transmission, but this paper addresses the challenge of learning source statistics when priors are unknown.

Method: The study establishes fundamental conditions for learnability, analyzes convergence rates of distribution estimation, and quantifies semantic distortion from estimation errors. Experiments on CIFAR-10 validate the theoretical framework.

Result: Learnability requires a full-rank effective transmission matrix. Encoding schemes optimized for immediate performance often hinder long-term learnability. System conditioning critically impacts learning rate and performance.

Conclusion: The paper provides the first rigorous characterization of statistical learning in semantic communication, offering design principles to balance immediate performance with adaptation capability.

Abstract: Semantic communication aims to convey meaning rather than bit-perfect reproduction, representing a paradigm shift from traditional communication. This paper investigates distribution learning in semantic communication where receivers must infer the underlying meaning distribution through sequential observations. While semantic communication traditionally optimizes individual meaning transmission, we establish fundamental conditions for learning source statistics when priors are unknown. We prove that learnability requires full rank of the effective transmission matrix, characterize the convergence rate of distribution estimation, and quantify how estimation errors translate to semantic distortion. Our analysis reveals a fundamental trade-off: encoding schemes optimized for immediate semantic performance often sacrifice long-term learnability. Experiments on CIFAR-10 validate our theoretical framework, demonstrating that system conditioning critically impacts both learning rate and achievable performance. These results provide the first rigorous characterization of statistical learning in semantic communication and offer design principles for systems that balance immediate performance with adaptation capability.

[366] eMamba: Efficient Acceleration Framework for Mamba Models in Edge Computing

Jiyong Kim, Jaeho Lee, Jiahao Lin, Alish Kanani, Miao Sun, Umit Y. Ogras, Jaehyun Park

Main category: cs.LG

TL;DR: eMamba is a hardware acceleration framework for Mamba SSM models on edge devices, offering efficiency and accuracy.

DetailsMotivation: To address the lack of optimized hardware acceleration for deploying Mamba models on resource-constrained edge devices.

Method: eMamba replaces complex layers with lightweight alternatives, approximates expensive operations, and uses approximation-aware NAS.

Result: Achieves comparable accuracy with fewer parameters, lower latency, higher throughput, and reduced power/energy consumption.

Conclusion: eMamba is efficient and effective for deploying Mamba models on edge platforms, outperforming baselines.

Abstract: State Space Model (SSM)-based machine learning architectures have recently gained significant attention for processing sequential data. Mamba, a recent sequence-to-sequence SSM, offers competitive accuracy with superior computational efficiency compared to state-of-the-art transformer models. While this advantage makes Mamba particularly promising for resource-constrained edge devices, no hardware acceleration frameworks are currently optimized for deploying it in such environments. This paper presents eMamba, a comprehensive end-to-end hardware acceleration framework explicitly designed for deploying Mamba models on edge platforms. eMamba maximizes computational efficiency by replacing complex normalization layers with lightweight hardware-aware alternatives and approximating expensive operations, such as SiLU activation and exponentiation, considering the target applications. Then, it performs an approximation-aware neural architecture search (NAS) to tune the learnable parameters used during approximation. Evaluations with Fashion-MNIST, CIFAR-10, and MARS, an open-source human pose estimation dataset, show eMamba achieves comparable accuracy to state-of-the-art techniques using 1.63-19.9$\times$ fewer parameters. In addition, it generalizes well to large-scale natural language tasks, demonstrating stable perplexity across varying sequence lengths on the WikiText2 dataset. We also quantize and implement the entire eMamba pipeline on an AMD ZCU102 FPGA and ASIC using GlobalFoundries (GF) 22 nm technology. Experimental results show 4.95-5.62$\times$ lower latency and 2.22-9.95$\times$ higher throughput, with 4.77$\times$ smaller area, 9.84$\times$ lower power, and 48.6$\times$ lower energy consumption than baseline solutions while maintaining competitive accuracy.

[367] XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Main category: cs.LG

TL;DR: XQuant reduces memory consumption for LLM inference by quantizing and caching layer input activations, achieving significant memory savings with minimal accuracy loss.

DetailsMotivation: Efficient LLM inference is challenging due to high memory and bandwidth demands, exacerbated by the mismatch between compute capabilities and memory resources.

Method: XQuant quantizes and caches layer input activations (X) instead of KV caching, rematerializing Keys and Values on-the-fly. XQuant-CL further exploits cross-layer similarity for extreme compression.

Result: XQuant achieves up to 7.7× memory savings with <0.1 perplexity degradation. XQuant-CL attains up to 10× savings with 0.01 degradation and 12.5× with 0.1 degradation.

Conclusion: XQuant leverages compute capabilities to overcome memory bottlenecks, outperforming KV cache quantization methods while maintaining near-FP16 accuracy.

Abstract: Although LLM inference has emerged as a critical workload for many downstream applications, efficiently inferring LLMs is challenging due to the substantial memory footprint and bandwidth requirements. In parallel, compute capabilities have steadily outpaced both memory capacity and bandwidth over the last few decades, a trend that remains evident in modern GPU hardware and exacerbates the challenge of LLM inference. As such, new algorithms are emerging that trade increased computation for reduced memory operations. To that end, we present XQuant, which takes advantage of this trend, enabling an order-of-magnitude reduction in memory consumption through low-bit quantization with substantial accuracy benefits relative to state-of-the-art KV cache quantization methods. We accomplish this by quantizing and caching the layer input activations X, instead of using standard KV caching, and then rematerializing the Keys and Values on-the-fly during inference. This results in an immediate 2$\times$ memory savings compared to KV caching. By applying XQuant, we achieve up to $\sim 7.7\times$ memory savings with $<0.1$ perplexity degradation compared to the FP16 baseline. Furthermore, our approach leverages the fact that X values are similar across layers. Building on this observation, we introduce XQuant-CL, which exploits the cross-layer similarity in the X embeddings for extreme compression. Across different models, XQuant-CL attains up to 10$\times$ memory savings relative to the FP16 baseline with only 0.01 perplexity degradation, and 12.5$\times$ memory savings with only $0.1$ perplexity degradation. XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck, while surpassing state-of-the-art KV cache quantization methods and achieving near-FP16 accuracy across a wide range of models.

[368] SC2Arena and StarEvolve: Benchmark and Self-Improvement Framework for LLMs in Complex Decision-Making Tasks

Pengbo Shen, Yaqing Wang, Ni Mu, Yao Luan, Runpeng Xie, Senhao Yang, Lexiang Wang, Hao Hu, Shuang Xu, Yiqin Yang, Bo Xu

Main category: cs.LG

TL;DR: SC2Arena and StarEvolve address limitations in evaluating LLMs for complex decision-making in StarCraft II by providing a comprehensive benchmark and a hierarchical framework for strategic planning and tactical execution.

DetailsMotivation: Existing benchmarks for StarCraft II lack full complexity, such as complete game context and diverse action spaces, limiting AI's strategic planning and real-time adaptation capabilities.

Method: SC2Arena supports all playable races and low-level actions, while StarEvolve integrates planning and execution with self-correction and fine-tuning.

Result: StarEvolve outperforms in strategic planning, and SC2Arena enables insights for developing generalist agents.

Conclusion: The proposed tools advance AI’s decision-making in complex environments, with publicly available resources for further research.

Abstract: Evaluating large language models (LLMs) in complex decision-making is essential for advancing AI’s ability for strategic planning and real-time adaptation. However, existing benchmarks for tasks like StarCraft II fail to capture the game’s full complexity, such as its complete game context, diverse action spaces, and all playable races. To address this gap, we present SC2Arena, a benchmark that fully supports all playable races, low-level action spaces, and optimizes text-based observations to tackle spatial reasoning challenges. Complementing this, we introduce StarEvolve, a hierarchical framework that integrates strategic planning with tactical execution, featuring iterative self-correction and continuous improvement via fine-tuning on high-quality gameplay data. Its key components include a Planner-Executor-Verifier structure to break down gameplay, and a scoring system for selecting high-quality training samples. Comprehensive analysis using SC2Arena provides valuable insights into developing generalist agents that were not possible with previous benchmarks. Experimental results also demonstrate that our proposed StarEvolve achieves superior performance in strategic planning. Our code, environment, and algorithms are publicly available.

[369] Unpacking the Implicit Norm Dynamics of Sharpness-Aware Minimization in Tensorized Models

Tianxiao Cao, Kyohei Atarashi, Hisashi Kashima

Main category: cs.LG

TL;DR: The paper explores SAM’s behavior in tensorized models, introduces Norm Deviation, and proposes DAS for improved performance with less computation.

DetailsMotivation: To understand SAM's implicit regularization in general tensorized models and improve its efficiency.

Method: Analyze SAM’s norm dynamics, introduce Norm Deviation, and propose Deviation-Aware Scaling (DAS).

Result: DAS matches or outperforms SAM in tasks like tensor completion and model compression, with lower computational cost.

Conclusion: DAS effectively mimics SAM’s regularization, offering a simpler and more efficient alternative.

Abstract: Sharpness-Aware Minimization (SAM) has been proven to be an effective optimization technique for improving generalization in overparameterized models. While prior works have explored the implicit regularization of SAM in simple two-core scale-invariant settings, its behavior in more general tensorized or scale-invariant models remains underexplored. In this work, we leverage scale-invariance to analyze the norm dynamics of SAM in general tensorized models. We introduce the notion of \emph{Norm Deviation} as a global measure of core norm imbalance, and derive its evolution under SAM using gradient flow analysis. We show that SAM’s implicit control of Norm Deviation is governed by the covariance between core norms and their gradient magnitudes. Motivated by these findings, we propose a simple yet effective method, \emph{Deviation-Aware Scaling (DAS)}, which explicitly mimics this regularization behavior by scaling core norms in a data-adaptive manner. Our experiments across tensor completion, noisy training, model compression, and parameter-efficient fine-tuning confirm that DAS achieves competitive or improved performance over SAM, while offering reduced computational overhead.

[370] RealAC: A Domain-Agnostic Framework for Realistic and Actionable Counterfactual Explanations

Asiful Arefeen, Shovito Barua Soumma, Hassan Ghasemzadeh

Main category: cs.LG

TL;DR: RealAC is a framework for generating realistic and actionable counterfactual explanations by preserving inter-feature dependencies and accommodating user constraints, outperforming existing methods.

DetailsMotivation: Existing counterfactual explanation methods lack realism, generalizability, and user-centric flexibility, often ignoring complex data dependencies and user preferences.

Method: RealAC aligns joint feature distributions between factual and counterfactual instances and allows users to freeze certain attributes, ensuring feasibility.

Result: RealAC outperforms state-of-the-art baselines and LLM-based techniques in realism, dependency preservation, and causal plausibility metrics.

Conclusion: RealAC provides a causality-aware, user-centric solution for generating actionable and realistic counterfactual explanations.

Abstract: Counterfactual explanations provide human-understandable reasoning for AI-made decisions by describing minimal changes to input features that would alter a model’s prediction. To be truly useful in practice, such explanations must be realistic and feasible – they should respect both the underlying data distribution and user-defined feasibility constraints. Existing approaches often enforce inter-feature dependencies through rigid, hand-crafted constraints or domain-specific knowledge, which limits their generalizability and ability to capture complex, nonlinear relations inherent in data. Moreover, they rarely accommodate user-specified preferences and suggest explanations that are causally implausible or infeasible to act upon. We introduce RealAC, a domain-agnostic framework for generating realistic and actionable counterfactuals. RealAC automatically preserves complex inter-feature dependencies without relying on explicit domain knowledge – by aligning the joint distributions of feature pairs between factual and counterfactual instances. The framework also allows end-users to ``freeze’’ attributes they cannot or do not wish to change by suppressing change in frozen features during optimization. Evaluations on three synthetic and two real datasets demonstrate that RealAC balances realism with actionability. Our method outperforms state-of-the-art baselines and Large Language Model-based counterfactual generation techniques in causal edge score, dependency preservation score, and IM1 realism metric and offers a solution for causality-aware and user-centric counterfactual generation.

[371] X-Node: Self-Explanation is All We Need

Prajit Sengupta, Islem Rekik

Main category: cs.LG

TL;DR: X-Node is a self-explaining GNN framework that generates per-node explanations during prediction, maintaining accuracy while improving interpretability.

DetailsMotivation: GNNs lack transparency in decision-making, especially in high-stakes clinical applications, where interpretability is crucial. Existing explainability methods are post-hoc and global, failing to provide local insights.

Method: X-Node constructs interpretable context vectors for each node, uses a Reasoner module to map these into explanations, and integrates them back into the GNN via text-injection.

Result: X-Node achieves competitive classification accuracy on MedMNIST and MorphoMNIST datasets while providing faithful per-node explanations.

Conclusion: X-Node enhances GNN interpretability without sacrificing performance, making it suitable for clinical applications.

Abstract: Graph neural networks (GNNs) have achieved state-of-the-art results in computer vision and medical image classification tasks by capturing structural dependencies across data instances. However, their decision-making remains largely opaque, limiting their trustworthiness in high-stakes clinical applications where interpretability is essential. Existing explainability techniques for GNNs are typically post-hoc and global, offering limited insight into individual node decisions or local reasoning. We introduce X-Node, a self-explaining GNN framework in which each node generates its own explanation as part of the prediction process. For every node, we construct a structured context vector encoding interpretable cues such as degree, centrality, clustering, feature saliency, and label agreement within its local topology. A lightweight Reasoner module maps this context into a compact explanation vector, which serves three purposes: (1) reconstructing the node’s latent embedding via a decoder to enforce faithfulness, (2) generating a natural language explanation using a pre-trained LLM (e.g., Grok or Gemini), and (3) guiding the GNN itself via a “text-injection” mechanism that feeds explanations back into the message-passing pipeline. We evaluate X-Node on two graph datasets derived from MedMNIST and MorphoMNIST, integrating it with GCN, GAT, and GIN backbones. Our results show that X-Node maintains competitive classification accuracy while producing faithful, per-node explanations. Repository: https://github.com/basiralab/X-Node.

[372] GraphFedMIG: Tackling Class Imbalance in Federated Graph Learning via Mutual Information-Guided Generation

Xinrui Li, Qilin Fan, Tianfu Wang, Kaiwen Wei, Ke Yu, Xu Zhang

Main category: cs.LG

TL;DR: GraphFedMIG addresses class imbalance in federated graph learning by using a hierarchical GAN for data augmentation and mutual information-guided training.

DetailsMotivation: Class imbalance in federated graph learning biases models toward majority classes, hindering performance on rare events.

Method: Proposes GraphFedMIG, a framework with local generators and clustered discriminators, guided by mutual information to focus on minority-class features.

Result: Outperforms baselines on four real-world datasets.

Conclusion: GraphFedMIG effectively mitigates class imbalance in federated graph learning.

Abstract: Federated graph learning (FGL) enables multiple clients to collaboratively train powerful graph neural networks without sharing their private, decentralized graph data. Inherited from generic federated learning, FGL is critically challenged by statistical heterogeneity, where non-IID data distributions across clients can severely impair model performance. A particularly destructive form of this is class imbalance, which causes the global model to become biased towards majority classes and fail at identifying rare but critical events. This issue is exacerbated in FGL, as nodes from a minority class are often surrounded by biased neighborhood information, hindering the learning of expressive embeddings. To grapple with this challenge, we propose GraphFedMIG, a novel FGL framework that reframes the problem as a federated generative data augmentation task. GraphFedMIG employs a hierarchical generative adversarial network where each client trains a local generator to synthesize high-fidelity feature representations. To provide tailored supervision, clients are grouped into clusters, each sharing a dedicated discriminator. Crucially, the framework designs a mutual information-guided mechanism to steer the evolution of these client generators. By calculating each client’s unique informational value, this mechanism corrects the local generator parameters, ensuring that subsequent rounds of mutual information-guided generation are focused on producing high-value, minority-class features. We conduct extensive experiments on four real-world datasets, and the results demonstrate the superiority of the proposed GraphFedMIG compared with other baselines.

[373] EDAPT: Towards Calibration-Free BCIs with Continual Online Adaptation

Lisa Haxel, Jaivardhan Kapoor, Ulf Ziemann, Jakob H. Macke

Main category: cs.LG

TL;DR: EDAPT is a framework for brain-computer interfaces (BCIs) that eliminates the need for frequent recalibration by combining population-level pretraining and continual online finetuning, improving accuracy and practicality.

DetailsMotivation: BCIs suffer from accuracy degradation due to neural signal drift and user variability, requiring frequent recalibration, which limits practical deployment.

Method: EDAPT trains a baseline decoder using multi-user data, then continually personalizes it via supervised finetuning as neural patterns evolve. It also uses unsupervised domain adaptation for further gains.

Result: EDAPT consistently outperformed static methods across nine datasets, improving accuracy efficiently (updates in 200ms) and scaling with total data budget.

Conclusion: EDAPT offers a practical solution for calibration-free BCIs, reducing deployment barriers by combining pretraining and continual adaptation.

Abstract: Brain-computer interfaces (BCIs) suffer from accuracy degradation as neural signals drift over time and vary across users, requiring frequent recalibration that limits practical deployment. We introduce EDAPT, a task- and model-agnostic framework that eliminates calibration through continual model adaptation. EDAPT first trains a baseline decoder using data from multiple users, then continually personalizes this model via supervised finetuning as the neural patterns evolve during use. We tested EDAPT across nine datasets covering three BCI tasks, and found that it consistently improved accuracy over conventional, static methods. These improvements primarily stem from combining population-level pretraining and online continual finetuning, with unsupervised domain adaptation providing further gains on some datasets. EDAPT runs efficiently, updating models within 200 milliseconds on consumer-grade hardware. Finally, decoding accuracy scales with total data budget rather than its allocation between subjects and trials. EDAPT provides a practical pathway toward calibration-free BCIs, reducing a major barrier to BCI deployment.

[374] Confounding is a Pervasive Problem in Real World Recommender Systems

Alexander Merkov, David Rohde, Alexandre Gilotte, Benjamin Heymann

Main category: cs.LG

TL;DR: The paper highlights how unobserved confounding, typically a problem in observational studies, can also affect recommender systems due to practices like feature engineering and A/B testing, and offers solutions.

DetailsMotivation: To address the overlooked issue of confounding in recommender systems, which can bias results despite using fully observed data.

Method: The paper uses simulation studies and illustrations to demonstrate how common practices introduce confounding.

Result: Confounding in recommender systems is shown to degrade performance, with practical solutions provided.

Conclusion: Practitioners should be aware of and mitigate confounding in recommender systems to improve accuracy and reliability.

Abstract: Unobserved confounding arises when an unmeasured feature influences both the treatment and the outcome, leading to biased causal effect estimates. This issue undermines observational studies in fields like economics, medicine, ecology or epidemiology. Recommender systems leveraging fully observed data seem not to be vulnerable to this problem. However many standard practices in recommender systems result in observed features being ignored, resulting in effectively the same problem. This paper will show that numerous common practices such as feature engineering, A/B testing and modularization can in fact introduce confounding into recommendation systems and hamper their performance. Several illustrations of the phenomena are provided, supported by simulation studies with practical suggestions about how practitioners may reduce or avoid the affects of confounding in real systems.

[375] Pinet: Optimizing hard-constrained neural networks with orthogonal projection layers

Panagiotis D. Grontas, Antonio Terpin, Efe C. Balta, Raffaello D’Andrea, John Lygeros

Main category: cs.LG

TL;DR: Error: OutputParser failed

DetailsMotivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: We introduce an output layer for neural networks that ensures satisfaction of convex constraints. Our approach, $\Pi$net, leverages operator splitting for rapid and reliable projections in the forward pass, and the implicit function theorem for backpropagation. We deploy $\Pi$net as a feasible-by-design optimization proxy for parametric constrained optimization problems and obtain modest-accuracy solutions faster than traditional solvers when solving a single problem, and significantly faster for a batch of problems. We surpass state-of-the-art learning approaches in terms of training time, solution quality, and robustness to hyperparameter tuning, while maintaining similar inference times. Finally, we tackle multi-vehicle motion planning with non-convex trajectory preferences and provide $\Pi$net as a GPU-ready package implemented in JAX with effective tuning heuristics.

[376] Learning State-Space Models of Dynamic Systems from Arbitrary Data using Joint Embedding Predictive Architectures

Jonas Ulmen, Ganesh Sundaram, Daniel Görges

Main category: cs.LG

TL;DR: The paper introduces a novel method using Joint Embedding Predictive Architectures (JEPAs) and neural ODEs to create world models from observation data, demonstrated on a pendulum system.

DetailsMotivation: To improve upon reconstruction-based methods by leveraging JEPAs and continuous-time dynamics for more capable world modeling.

Method: Integrates sequence embeddings with neural ODEs, enforcing contractive embeddings and Lipschitz constants for a structured latent state space.

Result: Successfully generates structured latent state-space models for a pendulum system using image data.

Conclusion: The technique offers a promising approach for general control algorithms and robotics applications.

Abstract: With the advent of Joint Embedding Predictive Architectures (JEPAs), which appear to be more capable than reconstruction-based methods, this paper introduces a novel technique for creating world models using continuous-time dynamic systems from arbitrary observation data. The proposed method integrates sequence embeddings with neural ordinary differential equations (neural ODEs). It employs loss functions that enforce contractive embeddings and Lipschitz constants in state transitions to construct a well-organized latent state space. The approach’s effectiveness is demonstrated through the generation of structured latent state-space models for a simple pendulum system using only image data. This opens up a new technique for developing more general control algorithms and estimation techniques with broad applications in robotics.

[377] On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations

Amir Mehrpanah, Matteo Gamba, Kevin Smith, Hossein Azizpour

Main category: cs.LG

TL;DR: A spectral framework is introduced to analyze and quantify the trade-off between smoothness and faithfulness in explanations of ReLU networks, addressing noisy gradient-based interpretations.

DetailsMotivation: ReLU networks' sharp transitions and reliance on individual pixels make gradient-based explanations noisy and hard to interpret, while existing methods like GradCAM sacrifice faithfulness for smoothness.

Method: A spectral framework is developed to analyze smoothness and faithfulness in explanations, regularizing high-frequency contributions of ReLU networks to balance these factors.

Result: The framework identifies an “explanation gap” caused by surrogate-based smoothing, quantifying distortions in explanations. Theoretical findings are validated across datasets and design choices.

Conclusion: The proposed framework provides a principled approach to understanding and optimizing the trade-off between smoothness and faithfulness in explanations for ReLU networks.

Abstract: ReLU networks, while prevalent for visual data, have sharp transitions, sometimes relying on individual pixels for predictions, making vanilla gradient-based explanations noisy and difficult to interpret. Existing methods, such as GradCAM, smooth these explanations by producing surrogate models at the cost of faithfulness. We introduce a unifying spectral framework to systematically analyze and quantify smoothness, faithfulness, and their trade-off in explanations. Using this framework, we quantify and regularize the contribution of ReLU networks to high-frequency information, providing a principled approach to identifying this trade-off. Our analysis characterizes how surrogate-based smoothing distorts explanations, leading to an ``explanation gap’’ that we formally define and measure for different post-hoc methods. Finally, we validate our theoretical findings across different design choices, datasets, and ablations.

[378] Contrastive ECOC: Learning Output Codes for Adversarial Defense

Che-Yu Chou, Hung-Hsuan Chen

Main category: cs.LG

TL;DR: The paper introduces automated codebook learning models for multiclass classification using ECOC, outperforming traditional methods in robustness.

DetailsMotivation: Traditional ECOC methods rely on manual or random codebooks, which are inefficient and may not adapt well to datasets.

Method: Three models based on contrastive learning are proposed to learn codebooks adaptively from data.

Result: The models show superior robustness to adversarial attacks on four datasets compared to baselines.

Conclusion: Automated codebook learning via contrastive learning improves ECOC effectiveness for multiclass classification.

Abstract: Although one-hot encoding is commonly used for multiclass classification, it is not always the most effective encoding mechanism. Error Correcting Output Codes (ECOC) address multiclass classification by mapping each class to a unique codeword used as a label. Traditional ECOC methods rely on manually designed or randomly generated codebooks, which are labor-intensive and may yield suboptimal, dataset-agnostic results. This paper introduces three models for automated codebook learning based on contrastive learning, allowing codebooks to be learned directly and adaptively from data. Across four datasets, our proposed models demonstrate superior robustness to adversarial attacks compared to two baselines. The source is available at https://github.com/YuChou20/Automated-Codebook-Learning-with-Error-Correcting-Output-Code-Technique.

[379] Nonlocal Monte Carlo via Reinforcement Learning

Dmitrii Dobrynin, Masoud Mohseni, John Paul Strachan

Main category: cs.LG

TL;DR: The paper proposes using deep reinforcement learning to train nonlocal transition policies for Nonequilibrium Nonlocal Monte Carlo (NMC) algorithms, improving performance on hard combinatorial optimization problems.

DetailsMotivation: Conventional MCMC methods struggle with hard benchmarks due to homogeneous temperature profiles, leading to inefficiency in exploring solution spaces.

Method: Deep reinforcement learning is used to train NMC’s nonlocal transition policies, leveraging energy changes and landscape geometry as rewards and states.

Result: The trained policies outperform standard MCMC and nonlocal simulated annealing on hard 4-SAT benchmarks in energy, speed, and solution diversity.

Conclusion: Deep RL-enhanced NMC offers a promising approach to tackle challenging combinatorial optimization problems more effectively.

Abstract: Optimizing or sampling complex cost functions of combinatorial optimization problems is a longstanding challenge across disciplines and applications. When employing family of conventional algorithms based on Markov Chain Monte Carlo (MCMC) such as simulated annealing or parallel tempering, one assumes homogeneous (equilibrium) temperature profiles across input. This instance independent approach was shown to be ineffective for the hardest benchmarks near a computational phase transition when the so-called overlap-gap-property holds. In these regimes conventional MCMC struggles to unfreeze rigid variables, escape suboptimal basins of attraction, and sample high-quality and diverse solutions. In order to mitigate these challenges, Nonequilibrium Nonlocal Monte Carlo (NMC) algorithms were proposed that leverage inhomogeneous temperature profiles thereby accelerating exploration of the configuration space without compromising its exploitation. Here, we employ deep reinforcement learning (RL) to train the nonlocal transition policies of NMC which were previously designed phenomenologically. We demonstrate that the resulting solver can be trained solely by observing energy changes of the configuration space exploration as RL rewards and the local minimum energy landscape geometry as RL states. We further show that the trained policies improve upon the standard MCMC-based and nonlocal simulated annealing on hard uniform random and scale-free random 4-SAT benchmarks in terms of residual energy, time-to-solution, and diversity of solutions metrics.

[380] Projected Coupled Diffusion for Test-Time Constrained Joint Generation

Hao Luan, Yi Xian Goh, See-Kiong Ng, Chun Kai Ling

Main category: cs.LG

TL;DR: PCD is a test-time framework for joint generation using multiple diffusion models, enforcing constraints without retraining.

DetailsMotivation: Existing methods struggle with joint correlated sampling and enforcing constraints without costly retraining.

Method: PCD introduces coupled guidance and projection steps to coordinate models and enforce constraints.

Result: PCD improves coupling and guarantees constraint satisfaction efficiently in tasks like image-pair generation and motion planning.

Conclusion: PCD effectively addresses joint generation challenges without computational overhead.

Abstract: Modifications to test-time sampling have emerged as an important extension to diffusion algorithms, with the goal of biasing the generative process to achieve a given objective without having to retrain the entire diffusion model. However, generating jointly correlated samples from multiple pre-trained diffusion models while simultaneously enforcing task-specific constraints without costly retraining has remained challenging. To this end, we propose Projected Coupled Diffusion (PCD), a novel test-time framework for constrained joint generation. PCD introduces a coupled guidance term into the generative dynamics to encourage coordination between diffusion models and incorporates a projection step at each diffusion step to enforce hard constraints. Empirically, we demonstrate the effectiveness of PCD in application scenarios of image-pair generation, object manipulation, and multi-robot motion planning. Our results show improved coupling effects and guaranteed constraint satisfaction without incurring excessive computational costs.

[381] Driving Accurate Allergen Prediction with Protein Language Models and Generalization-Focused Evaluation

Brian Shing-Hei Wong, Joshua Mincheol Kim, Sin-Hang Fung, Qing Xiong, Kelvin Fu-Kiu Ao, Junkang Wei, Ran Wang, Dan Michelle Wang, Jingying Zhou, Bo Feng, Alfred Sze-Lok Cheng, Kevin Y. Yip, Stephen Kwok-Wing Tsui, Qin Cao

Main category: cs.LG

TL;DR: Applm, a framework using xTrimoPGLM protein language model, outperforms existing methods in allergen prediction tasks, including novel allergen identification and mutation impact assessment.

DetailsMotivation: Allergens pose a public health challenge, and accurate identification is needed to address it.

Method: Applm leverages the xTrimoPGLM protein language model for allergen prediction, tested in diverse real-world scenarios.

Result: Applm consistently outperforms seven state-of-the-art methods in tasks like identifying novel allergens and assessing mutations.

Conclusion: xTrimoPGLM’s general protein sequence understanding is key to Applm’s success, and the framework is open-source for future research.

Abstract: Allergens, typically proteins capable of triggering adverse immune responses, represent a significant public health challenge. To accurately identify allergen proteins, we introduce Applm (Allergen Prediction with Protein Language Models), a computational framework that leverages the 100-billion parameter xTrimoPGLM protein language model. We show that Applm consistently outperforms seven state-of-the-art methods in a diverse set of tasks that closely resemble difficult real-world scenarios. These include identifying novel allergens that lack similar examples in the training set, differentiating between allergens and non-allergens among homologs with high sequence similarity, and assessing functional consequences of mutations that create few changes to the protein sequences. Our analysis confirms that xTrimoPGLM, originally trained on one trillion tokens to capture general protein sequence characteristics, is crucial for Applm’s performance by detecting important differences among protein sequences. In addition to providing Applm as open-source software, we also provide our carefully curated benchmark datasets to facilitate future research.

[382] Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards

Zetian Sun, Dongfang Li, Zhuoen Chen, Yuhuai Qin, Baotian Hu

Main category: cs.LG

TL;DR: The paper introduces Gated Reward Accumulation (G-RA) to address reward sparsity and misalignment in long-horizon RL tasks, specifically for software engineering, improving completion and modification rates.

DetailsMotivation: Reward sparsity and misalignment in long-horizon RL tasks hinder performance, especially in software engineering where multi-turn reasoning and verification are crucial.

Method: Proposes SWE-oriented RL Framework with multi-turn interaction and G-RA, which accumulates rewards only when long-term rewards meet a threshold.

Result: G-RA significantly improves completion rates (47.6% to 93.8%) and modification rates (19.6% to 23.8%) in experiments.

Conclusion: Balanced reward accumulation via G-RA is effective for long-horizon RL, offering a practical solution for software engineering tasks.

Abstract: Reward sparsity in long-horizon reinforcement learning (RL) tasks remains a significant challenge, while existing outcome-based reward shaping struggles to define meaningful immediate rewards without introducing bias or requiring explicit task decomposition. Alternatively, verification-based reward shaping uses stepwise critics, but misalignment between immediate rewards and long-term objectives can lead to reward hacking and suboptimal policies. In this work, we address this problem in the context of software engineering (SWE) tasks, where multi-turn reasoning and rule-based verification are critical. We introduce the SWE-oriented RL Framework, a unified system supporting multi-turn interaction, docker-based execution, and customizable reward functions. Additionally, we propose Gated Reward Accumulation (G-RA), a novel method that accumulates immediate rewards only when high-level (long-term) rewards meet a predefined threshold, ensuring stable RL optimization. Experiments on SWE-bench Verified and kBench demonstrate that G-RA leads to an increase in completion rates (47.6% \rightarrow 93.8% and 22.0% \rightarrow 86.0%) and modification rates (19.6% \rightarrow 23.8% and 12.0% \rightarrow 42.0%), while avoiding policy degradation caused by reward misalignment. Our findings highlight the importance of balanced reward accumulation in long-horizon RL and provide a practical solution.

[383] Technical Report: Facilitating the Adoption of Causal Inference Methods Through LLM-Empowered Co-Pilot

Jeroen Berrevoets, Julianna Piskorz, Robert Davis, Harry Amad, Jim Weatherall, Mihaela van der Schaar

Main category: cs.LG

TL;DR: CATE-B is an open-source system using LLMs to simplify treatment effect estimation by guiding users through causal modeling, adjustment set identification, and method selection.

DetailsMotivation: The complexity of treatment effect estimation limits adoption despite advances in machine learning and causal inference. CATE-B aims to make the process more accessible.

Method: CATE-B employs LLMs for causal discovery, edge orientation, and adjustment set identification, along with tailored regression method selection.

Result: The system provides a user-friendly framework for causal analysis and includes benchmark tasks for reproducibility.

Conclusion: CATE-B reduces barriers to rigorous causal analysis and sets a foundation for automated treatment effect estimation benchmarks.

Abstract: Estimating treatment effects (TE) from observational data is a critical yet complex task in many fields, from healthcare and economics to public policy. While recent advances in machine learning and causal inference have produced powerful estimation techniques, their adoption remains limited due to the need for deep expertise in causal assumptions, adjustment strategies, and model selection. In this paper, we introduce CATE-B, an open-source co-pilot system that uses large language models (LLMs) within an agentic framework to guide users through the end-to-end process of treatment effect estimation. CATE-B assists in (i) constructing a structural causal model via causal discovery and LLM-based edge orientation, (ii) identifying robust adjustment sets through a novel Minimal Uncertainty Adjustment Set criterion, and (iii) selecting appropriate regression methods tailored to the causal structure and dataset characteristics. To encourage reproducibility and evaluation, we release a suite of benchmark tasks spanning diverse domains and causal complexities. By combining causal inference with intelligent, interactive assistance, CATE-B lowers the barrier to rigorous causal analysis and lays the foundation for a new class of benchmarks in automated treatment effect estimation.

[384] GNN-based Unified Deep Learning

Furkan Pala, Islem Rekik

Main category: cs.LG

TL;DR: Unified learning improves generalizability of diverse deep learning models in medical imaging by unifying them in a shared graph learning space guided by a GNN.

DetailsMotivation: Addressing the challenge of maintaining model generalizability under domain shifts in medical imaging, where hospitals may use different architectures (MLPs, CNNs, GNNs) for varied data types.

Method: Proposes unified learning, encoding models into a graph representation and optimizing them via a unified GNN (uGNN), enabling parameter sharing and knowledge transfer across architectures.

Result: Unified learning enhances performance on mixed-distribution test sets, showing robustness to large distribution shifts in MorphoMNIST and MedMNIST benchmarks.

Conclusion: The unified learning paradigm effectively improves generalizability and robustness of heterogeneous models in medical imaging, supporting diverse architectures and data distributions.

Abstract: Deep learning models often struggle to maintain generalizability in medical imaging, particularly under domain-fracture scenarios where distribution shifts arise from varying imaging techniques, acquisition protocols, patient populations, demographics, and equipment. In practice, each hospital may need to train distinct models - differing in learning task, width, and depth - to match local data. For example, one hospital may use Euclidean architectures such as MLPs and CNNs for tabular or grid-like image data, while another may require non-Euclidean architectures such as graph neural networks (GNNs) for irregular data like brain connectomes. How to train such heterogeneous models coherently across datasets, while enhancing each model’s generalizability, remains an open problem. We propose unified learning, a new paradigm that encodes each model into a graph representation, enabling unification in a shared graph learning space. A GNN then guides optimization of these unified models. By decoupling parameters of individual models and controlling them through a unified GNN (uGNN), our method supports parameter sharing and knowledge transfer across varying architectures (MLPs, CNNs, GNNs) and distributions, improving generalizability. Evaluations on MorphoMNIST and two MedMNIST benchmarks - PneumoniaMNIST and BreastMNIST - show that unified learning boosts performance when models are trained on unique distributions and tested on mixed ones, demonstrating strong robustness to unseen data with large distribution shifts. Code and benchmarks: https://github.com/basiralab/uGNN

[385] Self-Supervised Temporal Super-Resolution of Energy Data using Generative Adversarial Transformer

Xuanhao Mu, Gökhan Demirel, Yuzhe Zhang, Jianlei Liu, Thorsten Schlachter, Veit Hagenmeyer

Main category: cs.LG

TL;DR: The paper introduces Generative Adversarial Transformers (GATs) to improve upsampling in energy network time series, reducing RMSE by 9% and MPC accuracy by 13%.

DetailsMotivation: To address the limitations of conventional upsampling methods and advanced models (like generative or Super-Resolution models) that suffer from information loss, noise, or reliance on unavailable high-resolution data.

Method: Proposes a new method using Generative Adversarial Transformers (GATs), which can be trained without ground-truth high-resolution data.

Result: The method reduces RMSE by 9% and improves MPC accuracy by 13% compared to conventional interpolation.

Conclusion: GATs offer a promising solution for upsampling in energy network applications, overcoming key challenges of existing methods.

Abstract: To bridge the temporal granularity gap in energy network design and operation based on Energy System Models, resampling of time series is required. While conventional upsampling methods are computationally efficient, they often result in significant information loss or increased noise. Advanced models such as time series generation models, Super-Resolution models and imputation models show potential, but also face fundamental challenges. The goal of time series generative models is to learn the distribution of the original data to generate high-resolution series with similar statistical characteristics. This is not entirely consistent with the definition of upsampling. Time series Super-Resolution models or imputation models can degrade the accuracy of upsampling because the input low-resolution time series are sparse and may have insufficient context. Moreover, such models usually rely on supervised learning paradigms. This presents a fundamental application paradox: their training requires the high-resolution time series that is intrinsically absent in upsampling application scenarios. To address the mentioned upsampling issue, this paper introduces a new method utilizing Generative Adversarial Transformers (GATs), which can be trained without access to any ground-truth high-resolution data. Compared with conventional interpolation methods, the introduced method can reduce the root mean square error (RMSE) of upsampling tasks by 9%, and the accuracy of a model predictive control (MPC) application scenario is improved by 13%.

[386] FreeGAD: A Training-Free yet Effective Approach for Graph Anomaly Detection

Yunfeng Zhao, Yixin Liu, Shiyuan Li, Qingfeng Chen, Yu Zheng, Shirui Pan

Main category: cs.LG

TL;DR: FreeGAD is a training-free graph anomaly detection method that outperforms existing deep learning-based approaches in performance, efficiency, and scalability.

DetailsMotivation: Existing deep learning-based GAD methods are costly and poorly scalable due to complex training. Empirical findings suggest training contributes less to performance than assumed.

Method: FreeGAD uses an affinity-gated residual encoder for anomaly-aware representations and anchor nodes to guide anomaly scoring via statistical deviations.

Result: FreeGAD achieves superior performance, efficiency, and scalability on multiple benchmark datasets without training.

Conclusion: FreeGAD is an effective, scalable, and training-free alternative to traditional GAD methods.

Abstract: Graph Anomaly Detection (GAD) aims to identify nodes that deviate from the majority within a graph, playing a crucial role in applications such as social networks and e-commerce. Despite the current advancements in deep learning-based GAD, existing approaches often suffer from high deployment costs and poor scalability due to their complex and resource-intensive training processes. Surprisingly, our empirical findings suggest that the training phase of deep GAD methods, commonly perceived as crucial, may actually contribute less to anomaly detection performance than expected. Inspired by this, we propose FreeGAD, a novel training-free yet effective GAD method. Specifically, it leverages an affinity-gated residual encoder to generate anomaly-aware representations. Meanwhile, FreeGAD identifies anchor nodes as pseudo-normal and anomalous guides, followed by calculating anomaly scores through anchor-guided statistical deviations. Extensive experiments demonstrate that FreeGAD achieves superior anomaly detection performance, efficiency, and scalability on multiple benchmark datasets from diverse domains, without any training or iterative optimization.

[387] On Spectral Properties of Gradient-based Explanation Methods

Amir Mehrpanah, Erik Englesson, Hossein Azizpour

Main category: cs.LG

TL;DR: The paper analyzes deep network explanation methods, identifying spectral bias and proposing formal solutions like SpectralLens for reliable explanations.

DetailsMotivation: To address reliability issues in explaining deep network predictions by introducing formal probabilistic and spectral perspectives.

Method: Adopts probabilistic and spectral analysis to study explanation methods, identifies spectral bias, and proposes remedies like SpectralLens.

Result: Reveals spectral bias in gradient-based methods, introduces standardized perturbation scale and SpectralLens for consistent explanations.

Conclusion: The proposed formalism and SpectralLens improve explanation reliability, supported by theoretical and quantitative validation.

Abstract: Understanding the behavior of deep networks is crucial to increase our confidence in their results. Despite an extensive body of work for explaining their predictions, researchers have faced reliability issues, which can be attributed to insufficient formalism. In our research, we adopt novel probabilistic and spectral perspectives to formally analyze explanation methods. Our study reveals a pervasive spectral bias stemming from the use of gradient, and sheds light on some common design choices that have been discovered experimentally, in particular, the use of squared gradient and input perturbation. We further characterize how the choice of perturbation hyperparameters in explanation methods, such as SmoothGrad, can lead to inconsistent explanations and introduce two remedies based on our proposed formalism: (i) a mechanism to determine a standard perturbation scale, and (ii) an aggregation method which we call SpectralLens. Finally, we substantiate our theoretical results through quantitative evaluations.

[388] Oops!… They Stole it Again: Attacks on Split Learning

Tanveer Khan, Antonis Michalas

Main category: cs.LG

TL;DR: The paper reviews security challenges in Split Learning (SL), classifying attacks and analyzing defenses to improve privacy.

DetailsMotivation: To address the security risks introduced by the distributed nature of SL and explore potential attacks and defenses.

Method: Systematic review of attacks on SL, classified by attacker role, privacy risks, timing, and vulnerabilities, along with analysis of defense methods.

Result: Identifies security gaps, evaluates defense effectiveness, and highlights limitations.

Conclusion: Provides insights to enhance SL privacy and guides future research directions.

Abstract: Split Learning (SL) is a collaborative learning approach that improves privacy by keeping data on the client-side while sharing only the intermediate output with a server. However, the distributed nature of SL introduces new security challenges, necessitating a comprehensive exploration of potential attacks. This paper systematically reviews various attacks on SL, classifying them based on factors such as the attacker’s role, the type of privacy risks, when data leaks occur, and where vulnerabilities exist. We also analyze existing defense methods, including cryptographic methods, data modification approaches, distributed techniques, and hybrid solutions. Our findings reveal security gaps, highlighting the effectiveness and limitations of existing defenses. By identifying open challenges and future directions, this work provides valuable information to improve SL privacy issues and guide further research.

[389] Variance Reduced Policy Gradient Method for Multi-Objective Reinforcement Learning

Davide Guidobene, Lorenzo Benedetti, Diego Arapovic

Main category: cs.LG

TL;DR: The paper proposes variance-reduction techniques to improve sample efficiency in Multi-Objective Reinforcement Learning (MORL) while maintaining scalability.

DetailsMotivation: MORL addresses complex decision-making with conflicting objectives, but existing policy gradient methods (PGMs) suffer from high sample inefficiency and overly strict assumptions.

Method: The work implements variance-reduction techniques to reduce sample complexity in PGMs for MORL.

Result: The approach aims to enhance sample efficiency without sacrificing scalability to large state-action spaces.

Conclusion: The proposed method offers a practical solution to improve MORL’s effectiveness by balancing sample efficiency and scalability.

Abstract: Multi-Objective Reinforcement Learning (MORL) is a generalization of traditional Reinforcement Learning (RL) that aims to optimize multiple, often conflicting objectives simultaneously rather than focusing on a single reward. This approach is crucial in complex decision-making scenarios where agents must balance trade-offs between various goals, such as maximizing performance while minimizing costs. We consider the problem of MORL where the objectives are combined using a non-linear scalarization function. Just like in standard RL, policy gradient methods (PGMs) are amongst the most effective for handling large and continuous state-action spaces in MORL. However, existing PGMs for MORL suffer from high sample inefficiency, requiring large amounts of data to be effective. Previous attempts to solve this problem rely on overly strict assumptions, losing PGMs’ benefits in scalability to large state-action spaces. In this work, we address the issue of sample efficiency by implementing variance-reduction techniques to reduce the sample complexity of policy gradients while maintaining general assumptions.

[390] Beyond Random Sampling: Instance Quality-Based Data Partitioning via Item Response Theory

Lucas Cardoso, Vitor Santos, José Ribeiro Filho, Ricardo Prudêncio, Regiane Kawasaki, Ronnie Alves

Main category: cs.LG

TL;DR: The paper proposes using Item Response Theory (IRT) to improve dataset partitioning for ML model validation, revealing instance heterogeneity and improving model performance understanding.

DetailsMotivation: Traditional data partitioning ignores instance quality, so the study aims to leverage IRT for better validation.

Method: IRT parameters guide dataset partitioning, evaluated on four tabular datasets with various ML models.

Result: IRT identifies instance heterogeneity and subgroups, improving bias-variance tradeoff understanding. High-guessing instances harm performance.

Conclusion: IRT-informed partitioning enhances ML validation by revealing dataset structure and optimizing model performance.

Abstract: Robust validation of Machine Learning (ML) models is essential, but traditional data partitioning approaches often ignore the intrinsic quality of each instance. This study proposes the use of Item Response Theory (IRT) parameters to characterize and guide the partitioning of datasets in the model validation stage. The impact of IRT-informed partitioning strategies on the performance of several ML models in four tabular datasets was evaluated. The results obtained demonstrate that IRT reveals an inherent heterogeneity of the instances and highlights the existence of informative subgroups of instances within the same dataset. Based on IRT, balanced partitions were created that consistently help to better understand the tradeoff between bias and variance of the models. In addition, the guessing parameter proved to be a determining factor: training with high-guessing instances can significantly impair model performance and resulted in cases with accuracy below 50%, while other partitions reached more than 70% in the same dataset.

[391] Energy-Based Models for Predicting Mutational Effects on Proteins

Patrick Soga, Zhenyu Lei, Yinhan He, Camille Bilodeau, Jundong Li

Main category: cs.LG

TL;DR: A new method for predicting binding free energy changes ($\Delta\Delta G$) in protein engineering by decomposing it into sequence- and structure-based components, outperforming existing deep learning methods.

DetailsMotivation: Predicting $\Delta\Delta G$ is crucial for drug discovery, but existing methods struggle with estimating full conformational distributions. This work aims to simplify the process using energy-based models.

Method: Decomposes $\Delta\Delta G$ into sequence-based (inverse folding model) and structure-based (energy model) components, assuming equilibrium between bound and unbound states.

Result: Outperforms state-of-the-art deep learning methods in $\Delta\Delta G$ prediction and SARS-CoV-2 antibody optimization.

Conclusion: The proposed approach offers a tractable and accurate method for $\Delta\Delta G$ prediction, combining physical inductive bias with deep learning.

Abstract: Predicting changes in binding free energy ($\Delta\Delta G$) is a vital task in protein engineering and protein-protein interaction (PPI) engineering for drug discovery. Previous works have observed a high correlation between $\Delta\Delta G$ and entropy, using probabilities of biologically important objects such as side chain angles and residue identities to estimate $\Delta\Delta G$. However, estimating the full conformational distribution of a protein complex is generally considered intractable. In this work, we propose a new approach to $\Delta\Delta G$ prediction that avoids this issue by instead leveraging energy-based models for estimating the probability of a complex’s conformation. Specifically, we novelly decompose $\Delta\Delta G$ into a sequence-based component estimated by an inverse folding model and a structure-based component estimated by an energy model. This decomposition is made tractable by assuming equilibrium between the bound and unbound states, allowing us to simplify the estimation of degeneracies associated with each state. Unlike previous deep learning-based methods, our method incorporates an energy-based physical inductive bias by connecting the often-used sequence log-odds ratio-based approach to $\Delta\Delta G$ prediction with a new $\Delta\Delta E$ term grounded in statistical mechanics. We demonstrate superiority over existing state-of-the-art structure and sequence-based deep learning methods in $\Delta\Delta G$ prediction and antibody optimization against SARS-CoV-2.

[392] Conditional Information Bottleneck for Multimodal Fusion: Overcoming Shortcut Learning in Sarcasm Detection

Yihua Wang, Qi Jia, Cong Xu, Feiyu Chen, Yuhan Liu, Haotian Zhang, Liang Jin, Lu Liu, Zhichun Wang

Main category: cs.LG

TL;DR: The paper addresses shortcut learning in multimodal sarcasm detection, introduces MUStARD++$^{R}$ to remove shortcuts, and proposes MCIB for effective fusion.

DetailsMotivation: Shortcut learning in datasets impairs generalization, and current fusion strategies are ineffective for sarcasm detection.

Method: Constructs MUStARD++$^{R}$ by removing shortcuts and introduces MCIB for efficient multimodal fusion.

Result: MCIB achieves top performance without relying on shortcuts.

Conclusion: Effective modality fusion is crucial for sarcasm detection, and MCIB addresses current limitations.

Abstract: Multimodal sarcasm detection is a complex task that requires distinguishing subtle complementary signals across modalities while filtering out irrelevant information. Many advanced methods rely on learning shortcuts from datasets rather than extracting intended sarcasm-related features. However, our experiments show that shortcut learning impairs the model’s generalization in real-world scenarios. Furthermore, we reveal the weaknesses of current modality fusion strategies for multimodal sarcasm detection through systematic experiments, highlighting the necessity of focusing on effective modality fusion for complex emotion recognition. To address these challenges, we construct MUStARD++$^{R}$ by removing shortcut signals from MUStARD++. Then, a Multimodal Conditional Information Bottleneck (MCIB) model is introduced to enable efficient multimodal fusion for sarcasm detection. Experimental results show that the MCIB achieves the best performance without relying on shortcut learning.

[393] SPHENIC: Topology-Informed Multi-View Clustering for Spatial Transcriptomics

Chenkai Guo, Yikai Zhu, Jing Yangum, Renxiang Guan, Por Lip Yee, Guangdun Peng, Dayu Hu

Main category: cs.LG

TL;DR: SPHENIC improves spatial transcriptomics clustering by integrating topological features and optimizing spatial embeddings, outperforming existing methods.

DetailsMotivation: Existing methods for spatial-transcriptomics clustering are limited by noisy data and poor spatial neighborhood modeling.

Method: SPHENIC uses persistent homology for stable representation learning and a Spatial Constraint and Distribution Optimization Module (SCDOM) for high-quality spatial embeddings.

Result: SPHENIC outperforms state-of-the-art methods by 3.31%-6.54% on 14 benchmark datasets.

Conclusion: SPHENIC effectively addresses limitations in spatial clustering, offering superior performance.

Abstract: By incorporating spatial location information, spatial-transcriptomics clustering yields more comprehensive insights into cell subpopulation identification. Despite recent progress, existing methods have at least two limitations: (i) topological learning typically considers only representations of individual cells or their interaction graphs; however, spatial transcriptomic profiles are often noisy, making these approaches vulnerable to low-quality topological signals, and (ii) insufficient modeling of spatial neighborhood information leads to low-quality spatial embeddings. To address these limitations, we propose SPHENIC, a novel Spatial Persistent Homology Enhanced Neighborhood Integrative Clustering method. Specifically, SPHENIC incorporates invariant topological features into the clustering network to achieve stable representation learning. Additionally, to construct high-quality spatial embeddings that reflect the true cellular distribution, we design the Spatial Constraint and Distribution Optimization Module (SCDOM). This module increases the similarity between a cell’s embedding and those of its spatial neighbors, decreases similarity with non-neighboring cells, and thereby produces clustering-friendly spatial embeddings. Extensive experiments on 14 benchmark spatial transcriptomic slices demonstrate that SPHENIC achieves superior performance on the spatial clustering task, outperforming existing state-of-the-art methods by 3.31%-6.54% over the best alternative.

[394] Geospatial Diffusion for Land Cover Imperviousness Change Forecasting

Debvrat Varshney, Vibhas Vats, Bhartendu Pandey, Christa Brelsford, Philipe Dias

Main category: cs.LG

TL;DR: The paper proposes using Generative AI (GenAI) for forecasting land-use and land-cover change (LULC), addressing a gap in current Earth System models. It demonstrates feasibility with a diffusion model for imperviousness forecasting, outperforming a no-change baseline.

DetailsMotivation: Current Earth System models lack skill in forecasting LULC, a critical input for assessing risks and consequences in future climate scenarios. This gap motivates the use of GenAI for LULC forecasting.

Method: The paper frames LULC forecasting as a data synthesis problem using GenAI, specifically training a diffusion model for decadal imperviousness forecasting. Historical data for the conterminous U.S. is used.

Result: The diffusion model outperforms a no-change baseline in 12 metropolitan areas, showing lower MAE for resolutions ≥ 0.7×0.7 km², indicating its ability to capture spatiotemporal patterns.

Conclusion: The study validates GenAI’s potential for LULC forecasting and suggests future research to integrate auxiliary Earth data and scenario simulation via driver variables.

Abstract: Land cover, both present and future, has a significant effect on several important Earth system processes. For example, impervious surfaces heat up and speed up surface water runoff and reduce groundwater infiltration, with concomitant effects on regional hydrology and flood risk. While regional Earth System models have increasing skill at forecasting hydrologic and atmospheric processes at high resolution in future climate scenarios, our ability to forecast land-use and land-cover change (LULC), a critical input to risk and consequences assessment for these scenarios, has lagged behind. In this paper, we propose a new paradigm exploiting Generative AI (GenAI) for land cover change forecasting by framing LULC forecasting as a data synthesis problem conditioned on historical and auxiliary data-sources. We discuss desirable properties of generative models that fundament our research premise, and demonstrate the feasibility of our methodology through experiments on imperviousness forecasting using historical data covering the entire conterminous United States. Specifically, we train a diffusion model for decadal forecasting of imperviousness and compare its performance to a baseline that assumes no change at all. Evaluation across 12 metropolitan areas for a year held-out during training indicate that for average resolutions $\geq 0.7\times0.7km^2$ our model yields MAE lower than such a baseline. This finding corroborates that such a generative model can capture spatiotemporal patterns from historical data that are significant for projecting future change. Finally, we discuss future research to incorporate auxiliary information on physical properties about the Earth, as well as supporting simulation of different scenarios by means of driver variables.

[395] Graph Learning via Logic-Based Weisfeiler-Leman Variants and Tabularization

Reijo Jaakkola, Tomi Janhunen, Antti Kuusisto, Magdalena Ortiz, Matias Selin, Mantas Šimkus

Main category: cs.LG

TL;DR: A novel graph classification method using Weisfeiler-Leman variants for tabularizing graph data, achieving competitive accuracy with improved efficiency.

DetailsMotivation: To develop a graph classification approach that combines the expressive power of Weisfeiler-Leman variants with the efficiency of tabular data methods.

Method: Tabularize graph data using Weisfeiler-Leman variants, analyze their expressive power, and test on benchmark datasets.

Result: Matches state-of-the-art accuracy with better time/memory efficiency; interpretable modal logic formulas can be extracted.

Conclusion: The approach is effective, efficient, and offers interpretability, making it a viable alternative to graph neural networks and kernels.

Abstract: We present a novel approach for graph classification based on tabularizing graph data via variants of the Weisfeiler-Leman algorithm and then applying methods for tabular data. We investigate a comprehensive class of Weisfeiler-Leman variants obtained by modifying the underlying logical framework and establish a precise theoretical characterization of their expressive power. We then test two selected variants on twelve benchmark datasets that span a range of different domains. The experiments demonstrate that our approach matches the accuracy of state-of-the-art graph neural networks and graph kernels while being more time or memory efficient, depending on the dataset. We also briefly discuss directly extracting interpretable modal logic formulas from graph datasets.

[396] MDNS: Masked Diffusion Neural Sampler via Stochastic Optimal Control

Yuchen Zhu, Wei Guo, Jaemoo Choi, Guan-Horng Liu, Yongxin Chen, Molei Tao

Main category: cs.LG

TL;DR: The paper introduces MDNS, a framework for training neural samplers to generate samples from complex discrete state spaces, outperforming other methods.

DetailsMotivation: The challenge of sampling from high-dimensional, multi-modal discrete distributions in fields like statistical physics and machine learning motivates the work.

Method: MDNS aligns path measures via learning objectives grounded in stochastic optimal control of Markov chains.

Result: MDNS efficiently samples from target distributions, even in high dimensions, and surpasses other baselines.

Conclusion: MDNS is a scalable and effective framework for discrete neural sampling, with demonstrated potential in various applications.

Abstract: We study the problem of learning a neural sampler to generate samples from discrete state spaces where the target probability mass function $\pi\propto\mathrm{e}^{-U}$ is known up to a normalizing constant, which is an important task in fields such as statistical physics, machine learning, combinatorial optimization, etc. To better address this challenging task when the state space has a large cardinality and the distribution is multi-modal, we propose $\textbf{M}$asked $\textbf{D}$iffusion $\textbf{N}$eural $\textbf{S}$ampler ($\textbf{MDNS}$), a novel framework for training discrete neural samplers by aligning two path measures through a family of learning objectives, theoretically grounded in the stochastic optimal control of the continuous-time Markov chains. We validate the efficiency and scalability of MDNS through extensive experiments on various distributions with distinct statistical properties, where MDNS learns to accurately sample from the target distributions despite the extremely high problem dimensions and outperforms other learning-based baselines by a large margin. A comprehensive study of ablations and extensions is also provided to demonstrate the efficacy and potential of the proposed framework.

[397] REFN: A Reinforcement-Learning-From-Network Framework against 1-day/n-day Exploitations

Tianlong Yu, Lihong Liu, Ziyi Zhou, Fudu Xing, Kailong Wang, Yang Yang

Main category: cs.LG

TL;DR: REFN is a novel framework using Reinforcement Learning (RL) and Large Language Models (LLMs) to autonomously generate network filters for preventing 1-day or n-day vulnerabilities, outperforming existing defenses in accuracy, speed, and scalability.

DetailsMotivation: Existing defenses like host-based patching and network filtering are inadequate due to scalability, compatibility, and deployment issues, especially for diverse or legacy systems.

Method: REFN employs RL driven by online network rewards, uses edge security gateways for unified deployment, and validates outputs with real traffic. It addresses LLM limitations via knowledge distillation, language-to-network translation, and online validation.

Result: REFN achieves 21.1% higher accuracy, reduces Mean Time To Patch to 3.65 hours, and scales to 10K devices.

Conclusion: REFN is a promising step toward using LLMs for rapid, large-scale prevention of 1-day or n-day exploits.

Abstract: The exploitation of 1 day or n day vulnerabilities poses severe threats to networked devices due to massive deployment scales and delayed patching (average Mean Time To Patch exceeds 60 days). Existing defenses, including host based patching and network based filtering, are inadequate due to limited scalability across diverse devices, compatibility issues especially with embedded or legacy systems, and error prone deployment process (manual patch validation). To address these issues, we introduce REFN (Reinforcement Learning From Network), a novel framework that trains Large Language Models (LLMs) to autonomously generate network filters to prevent 1 day or n day exploitations. REFN ensures scalability by uniquely employs Reinforcement Learning (RL) driven by online network rewards instead of traditional Human Feedback (RLHF). REFN guarantees compatibility via unified deployment on edge security gateways (Amazon Eero). REFN provides robustness via online validation using real network traffic. Crucially, REFN addresses three core challenges in training LLMs for exploit prevention: 1) expanding current LLMs limited vulnerability fixing expertise via Agentic RAG based Knowledge Distillation, 2) bridging current LLMs language to network gaps through an RL From VNF Pipeline that translates language context (vulnerability description) into network enforcement, 3) addressing the LLM hallucination and non determinism via the Online Agentic Validation that penalizes erroneous outputs. Evaluated across 22 families of 1 day or n day exploits, REFN demonstrates effectiveness (21.1 percent higher accuracy than alternatives), efficiency (Mean Time To Patch of 3.65 hours) and scalability (easily scale to 10K devices). REFN serves as an initial step toward training LLMs to rapidly prevent massive scale 1 day or n day exploitations.

[398] Electromagnetic Simulations of Antennas on GPUs for Machine Learning Applications

Murat Temiz, Vemund Bakken

Main category: cs.LG

TL;DR: A GPU-powered antenna simulation framework is proposed for machine learning applications, outperforming CPUs and matching commercial EM software results.

DetailsMotivation: Addressing the data-hungry nature of machine learning in EM applications by leveraging GPUs for efficient large-scale antenna simulations.

Method: Utilizes open-source EM software (gprMax) with GPUs to simulate numerous antennas, comparing results with commercial software and evaluating ML/DL models.

Result: Entry-level GPUs outperform high-end CPUs, with gaming GPUs achieving 18x speedup; open-source software matches commercial results at fine resolutions.

Conclusion: GPU acceleration enables efficient large-scale antenna simulations for ML, with open-source tools providing viable alternatives to commercial software.

Abstract: This study proposes an antenna simulation framework powered by graphics processing units (GPUs) based on an open-source electromagnetic (EM) simulation software (gprMax) for machine learning applications of antenna design and optimization. Furthermore, it compares the simulation results with those obtained through commercial EM software. The proposed software framework for machine learning and surrogate model applications will produce antenna data sets consisting of a large number of antenna simulation results using GPUs. Although machine learning methods can attain the optimum solutions for many problems, they are known to be data-hungry and require a great deal of samples for the training stage of the algorithms. However, producing a sufficient number of training samples in EM applications within a limited time is challenging due to the high computational complexity of EM simulations. Therefore, GPUs are utilized in this study to simulate a large number of antennas with predefined or random antenna shape parameters to produce data sets. Moreover, this study also compares various machine learning and deep learning models in terms of antenna parameter estimation performance. This study demonstrates that an entry-level GPU substantially outperforms a high-end CPU in terms of computational performance, while a high-end gaming GPU can achieve around 18 times more computational performance compared to a high-end CPU. Moreover, it is shown that the open-source EM simulation software can deliver similar results to those obtained via commercial software in the simulation of microstrip antennas when the spatial resolution of the simulations is sufficiently fine.

[399] APFL: Analytic Personalized Federated Learning via Dual-Stream Least Squares

Kejia Fan, Jianheng Tang, Zhirui Yang, Feijiang Han, Jiaxu Li, Run He, Yajiang Huang, Anfeng Liu, Houbing Herbert Song, Yunhuai Liu, Huiping Zhuang

Main category: cs.LG

TL;DR: Proposes APFL, a dual-stream least squares method for Personalized Federated Learning (PFL), addressing non-IID data challenges with global and local streams for generalization and personalization.

DetailsMotivation: Existing PFL methods struggle with non-IID data, harming generalization and personalization.

Method: Uses a frozen backbone for feature extraction and dual-stream analytic models (shared primary for global generalization, refinement for local personalization).

Result: APFL achieves heterogeneity invariance and outperforms baselines by 1.10%-15.45% in accuracy.

Conclusion: APFL effectively addresses non-IID data issues in PFL, offering robust generalization and personalization.

Abstract: Personalized Federated Learning (PFL) has presented a significant challenge to deliver personalized models to individual clients through collaborative training. Existing PFL methods are often vulnerable to non-IID data, which severely hinders collective generalization and then compromises the subsequent personalization efforts. In this paper, to address this non-IID issue in PFL, we propose an Analytic Personalized Federated Learning (APFL) approach via dual-stream least squares. In our APFL, we use a foundation model as a frozen backbone for feature extraction. Subsequent to the feature extractor, we develop dual-stream analytic models to achieve both collective generalization and individual personalization. Specifically, our APFL incorporates a shared primary stream for global generalization across all clients, and a dedicated refinement stream for local personalization of each individual client. The analytical solutions of our APFL enable its ideal property of heterogeneity invariance, theoretically meaning that each personalized model remains identical regardless of how heterogeneous the data are distributed across all other clients. Empirical results across various datasets also validate the superiority of our APFL over state-of-the-art baselines, with advantages of at least 1.10%-15.45% in accuracy.

[400] Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, Guang Shi

Main category: cs.LG

TL;DR: RLVR with Pass@1 rewards struggles with exploration-exploitation balance. Using Pass@k as reward improves exploration, and analytical solutions show mutual enhancement of exploration and exploitation. Advantage design for RLVR is promising.

DetailsMotivation: Address the issue of conservative policies and local optima in RLVR by exploring Pass@k as a reward metric and its impact on exploration.

Method: Train policy models using Pass@k as reward (Pass@k Training), derive analytical solutions for its advantage, and explore advantage design for RLVR.

Result: Pass@k Training enhances exploration, and exploration-exploitation can mutually benefit each other. Advantage design shows promising results.

Conclusion: Pass@k Training and advantage design offer a potential direction for improving RLVR by balancing exploration and exploitation.

Abstract: Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., $\textbf{Pass@k Training}$), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.

[401] Natively Trainable Sparse Attention for Hierarchical Point Cloud Datasets

Nicolas Lapautre, Maria Marchenko, Carlos Miguel Patiño, Xin Zhou

Main category: cs.LG

TL;DR: Combining Erwin architecture with Native Sparse Attention (NSA) improves transformer efficiency for large-scale physical systems, matching or exceeding original Erwin performance.

DetailsMotivation: Overcoming quadratic scaling of attention mechanisms in transformers for large physical systems.

Method: Adapt NSA for non-sequential data, implement Erwin NSA model, and evaluate on cosmology, molecular dynamics, and air pressure datasets.

Result: Achieves performance matching or exceeding original Erwin model; reproduces Erwin paper results for validation.

Conclusion: Erwin NSA model effectively addresses quadratic attention complexity, enhancing transformer efficiency for large-scale physical systems.

Abstract: Unlocking the potential of transformers on datasets of large physical systems depends on overcoming the quadratic scaling of the attention mechanism. This work explores combining the Erwin architecture with the Native Sparse Attention (NSA) mechanism to improve the efficiency and receptive field of transformer models for large-scale physical systems, addressing the challenge of quadratic attention complexity. We adapt the NSA mechanism for non-sequential data, implement the Erwin NSA model, and evaluate it on three datasets from the physical sciences – cosmology simulations, molecular dynamics, and air pressure modeling – achieving performance that matches or exceeds that of the original Erwin model. Additionally, we reproduce the experimental results from the Erwin paper to validate their implementation.

[402] IBEX: Information-Bottleneck-EXplored Coarse-to-Fine Molecular Generation under Limited Data

Dong Xu, Zhangfan Yang, Jenna Xinyi Yao, Shuangbao Song, Zexuan Zhu, Junkai Ji

Main category: cs.LG

TL;DR: IBEX is a coarse-to-fine pipeline using information-bottleneck theory to improve structure-based drug design, achieving better docking success, Vina scores, and molecular quality.

DetailsMotivation: Addressing the scarcity of protein-ligand complex data and overfitting issues in existing pipelines.

Method: Uses PAC-Bayesian information-bottleneck theory to analyze sample information density, retains TargetDiff architecture, and refines conformations with L-BFGS optimization.

Result: Improves docking success rate from 53% to 64%, enhances Vina scores, QED by 25%, and achieves state-of-the-art validity and diversity.

Conclusion: IBEX effectively tackles data scarcity and improves performance in structure-based drug design.

Abstract: Three-dimensional generative models increasingly drive structure-based drug discovery, yet it remains constrained by the scarce publicly available protein-ligand complexes. Under such data scarcity, almost all existing pipelines struggle to learn transferable geometric priors and consequently overfit to training-set biases. As such, we present IBEX, an Information-Bottleneck-EXplored coarse-to-fine pipeline to tackle the chronic shortage of protein-ligand complex data in structure-based drug design. Specifically, we use PAC-Bayesian information-bottleneck theory to quantify the information density of each sample. This analysis reveals how different masking strategies affect generalization and indicates that, compared with conventional de novo generation, the constrained Scaffold Hopping task endows the model with greater effective capacity and improved transfer performance. IBEX retains the original TargetDiff architecture and hyperparameters for training to generate molecules compatible with the binding pocket; it then applies an L-BFGS optimization step to finely refine each conformation by optimizing five physics-based terms and adjusting six translational and rotational degrees of freedom in under one second. With only these modifications, IBEX raises the zero-shot docking success rate on CBGBench CrossDocked2020-based from 53% to 64%, improves the mean Vina score from $-7.41 kcal mol^{-1}$ to $-8.07 kcal mol^{-1}$, and achieves the best median Vina energy in 57 of 100 pockets versus 3 for the original TargetDiff. IBEX also increases the QED by 25%, achieves state-of-the-art validity and diversity, and markedly reduces extrapolation error.

[403] Enhancing Fairness in Autoencoders for Node-Level Graph Anomaly Detection

Shouju Wang, Yuchen Song, Sheng’en Li, Dongmian Zou

Main category: cs.LG

TL;DR: DECAF-GAD is a framework addressing fairness in autoencoder-based graph anomaly detection (GAD) by disentangling sensitive attributes and enhancing fairness without compromising performance.

DetailsMotivation: Fairness in GAD is underexplored, and GNN-based models can amplify biases. Existing fair GNNs focus on node classification, not autoencoder-based GAD.

Method: Proposes DECAF-GAD using a structural causal model (SCM) to disentangle sensitive attributes, a specialized autoencoder, and a fairness-guided loss function.

Result: DECAF-GAD achieves competitive GAD performance and significantly improves fairness metrics on synthetic and real-world datasets.

Conclusion: DECAF-GAD effectively balances fairness and performance in GAD, addressing a critical gap in the field.

Abstract: Graph anomaly detection (GAD) has become an increasingly important task across various domains. With the rapid development of graph neural networks (GNNs), GAD methods have achieved significant performance improvements. However, fairness considerations in GAD remain largely underexplored. Indeed, GNN-based GAD models can inherit and amplify biases present in training data, potentially leading to unfair outcomes. While existing efforts have focused on developing fair GNNs, most approaches target node classification tasks, where models often rely on simple layer architectures rather than autoencoder-based structures, which are the most widely used architecturs for anomaly detection. To address fairness in autoencoder-based GAD models, we propose \textbf{D}is\textbf{E}ntangled \textbf{C}ounterfactual \textbf{A}dversarial \textbf{F}air (DECAF)-GAD, a framework that alleviates bias while preserving GAD performance. Specifically, we introduce a structural causal model (SCM) to disentangle sensitive attributes from learned representations. Based on this causal framework, we formulate a specialized autoencoder architecture along with a fairness-guided loss function. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that DECAF-GAD not only achieves competitive anomaly detection performance but also significantly enhances fairness metrics compared to baseline GAD methods. Our code is available at https://github.com/Tlhey/decaf_code.

[404] Non-Stationary Restless Multi-Armed Bandits with Provable Guarantee

Yu-Heng Hung, Ping-Chun Hsieh, Kai Wang

Main category: cs.LG

TL;DR: The paper proposes an algorithm (rmab) for non-stationary restless multi-armed bandits (RMABs) with bounded variation budgets, achieving a sublinear regret bound.

DetailsMotivation: Traditional RMAB algorithms assume stationary dynamics, which fail in real-world applications like healthcare and recommendation systems due to non-stationarity.

Method: The rmab algorithm combines sliding window reinforcement learning with an upper confidence bound (UCB) mechanism to learn and adapt to non-stationary transitions.

Result: The algorithm achieves a regret bound of O~(N² B^(1/4) T^(3/4)), providing a theoretical foundation for non-stationary RMABs.

Conclusion: The work introduces a novel approach for non-stationary RMABs, offering both practical and theoretical advancements.

Abstract: Online restless multi-armed bandits (RMABs) typically assume that each arm follows a stationary Markov Decision Process (MDP) with fixed state transitions and rewards. However, in real-world applications like healthcare and recommendation systems, these assumptions often break due to non-stationary dynamics, posing significant challenges for traditional RMAB algorithms. In this work, we specifically consider $N$-armd RMAB with non-stationary transition constrained by bounded variation budgets $B$. Our proposed \rmab; algorithm integrates sliding window reinforcement learning (RL) with an upper confidence bound (UCB) mechanism to simultaneously learn transition dynamics and their variations. We further establish that \rmab; achieves $\widetilde{\mathcal{O}}(N^2 B^{\frac{1}{4}} T^{\frac{3}{4}})$ regret bound by leveraging a relaxed definition of regret, providing a foundational theoretical framework for non-stationary RMAB problems for the first time.

[405] Comparison of Data Reduction Criteria for Online Gaussian Processes

Thore Wietzke, Knut Graichen

Main category: cs.LG

TL;DR: This paper compares reduction criteria for Online Gaussian Processes (GPs) to handle streaming data efficiently, evaluating their computational complexity and behavior, and provides guidelines for selecting criteria.

DetailsMotivation: The computational complexity of GPs limits their use in streaming scenarios with accumulating data. Online GPs address this by managing datapoint budgets, but criteria for reducing redundancy need evaluation.

Method: The study compares multiple reduction criteria for Online GPs, analyzing their computational complexity and reduction behavior. It tests these on benchmark functions and real-world datasets, including dynamic system identification.

Result: The evaluation provides insights into the performance of reduction criteria, and proposes acceptance criteria to further filter redundant datapoints.

Conclusion: The paper offers practical guidelines for selecting reduction criteria in Online GP algorithms, enhancing their applicability in streaming scenarios.

Abstract: Gaussian Processes (GPs) are widely used for regression and system identification due to their flexibility and ability to quantify uncertainty. However, their computational complexity limits their applicability to small datasets. Moreover in a streaming scenario, more and more datapoints accumulate which is intractable even for Sparse GPs. Online GPs aim to alleviate this problem by e.g. defining a maximum budget of datapoints and removing redundant datapoints. This work provides a unified comparison of several reduction criteria, analyzing both their computational complexity and reduction behavior. The criteria are evaluated on benchmark functions and real-world datasets, including dynamic system identification tasks. Additionally, acceptance criteria are proposed to further filter out redundant datapoints. This work yields practical guidelines for choosing a suitable criterion for an online GP algorithm.

[406] Memory-Augmented Transformers: A Systematic Review from Neuroscience Principles to Technical Solutions

Parsa Omidi, Xingshuai Huang, Axel Laborieux, Bahareh Nikpour, Tianyu Shi, Armaghan Eshaghi

Main category: cs.LG

TL;DR: The paper reviews Memory-Augmented Transformers, bridging neuroscience and AI to address limitations in long-range context retention, continual learning, and knowledge integration.

DetailsMotivation: To overcome critical limitations of Transformer architectures in memory-related tasks by integrating neuroscience principles with engineering advances.

Method: A unified framework organizing progress via functional objectives, memory representations, and integration mechanisms, analyzing core memory operations.

Result: Identifies a shift toward adaptive, test-time learning systems and highlights challenges like scalability and interference, with emerging solutions.

Conclusion: Provides a roadmap for developing cognitively-inspired, lifelong-learning Transformer architectures.

Abstract: Memory is fundamental to intelligence, enabling learning, reasoning, and adaptability across biological and artificial systems. While Transformer architectures excel at sequence modeling, they face critical limitations in long-range context retention, continual learning, and knowledge integration. This review presents a unified framework bridging neuroscience principles, including dynamic multi-timescale memory, selective attention, and consolidation, with engineering advances in Memory-Augmented Transformers. We organize recent progress through three taxonomic dimensions: functional objectives (context extension, reasoning, knowledge integration, adaptation), memory representations (parameter-encoded, state-based, explicit, hybrid), and integration mechanisms (attention fusion, gated control, associative retrieval). Our analysis of core memory operations (reading, writing, forgetting, and capacity management) reveals a shift from static caches toward adaptive, test-time learning systems. We identify persistent challenges in scalability and interference, alongside emerging solutions including hierarchical buffering and surprise-gated updates. This synthesis provides a roadmap toward cognitively-inspired, lifelong-learning Transformer architectures.

[407] SoK: Data Minimization in Machine Learning

Robin Staab, Nikola Jovanović, Kimberly Mai, Prakhar Ganesh, Martin Vechev, Ferdinando Fioretto, Matthew Jagielski

Main category: cs.LG

TL;DR: The paper introduces a framework for Data Minimization in Machine Learning (DMML) to unify and clarify existing research, aiding practitioners in applying DM principles effectively.

DetailsMotivation: Data minimization is a key principle in regulations like GDPR, but its application in ML is fragmented, causing confusion among practitioners.

Method: The authors propose a comprehensive DMML framework, including a unified data pipeline, adversaries, and minimization points, and review related literature.

Result: The framework provides a structured overview, enabling better understanding and adoption of DM strategies in AI/ML.

Conclusion: The work bridges gaps in DMML research, promoting unified understanding and practical implementation of data minimization in ML.

Abstract: Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection. This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the terminology, metrics, and evaluation criteria used across different research communities. To address this gap, our work introduces a comprehensive framework for DMML, including a unified data pipeline, adversaries, and points of minimization. This framework allows us to systematically review the literature on data minimization and \emph{DM-adjacent} methodologies, for the first time presenting a structured overview designed to help practitioners and researchers effectively apply DM principles. Our work facilitates a unified DM-centric understanding and broader adoption of data minimization strategies in AI/ML.

[408] Efficiently Verifiable Proofs of Data Attribution

Ari Karchmer, Seth Neel, Martin Pawelczyk

Main category: cs.LG

TL;DR: The paper proposes an interactive verification protocol for data attribution to address trust issues, ensuring efficiency and correctness with formal guarantees.

DetailsMotivation: To solve the trust issue in data attribution methods, where computational costs limit accessibility and verification for resource-constrained parties.

Method: An interactive proof protocol between a Prover and Verifier, ensuring PAC verification with formal guarantees on completeness, soundness, and efficiency.

Result: The protocol ensures Verifier workload scales independently of dataset size, providing PAC guarantees for data attributions.

Conclusion: The proposed method offers a scalable and trustworthy solution for verifying data attributions, applicable to various tasks.

Abstract: Data attribution methods aim to answer useful counterfactual questions like “what would a ML model’s prediction be if it were trained on a different dataset?” However, estimation of data attribution models through techniques like empirical influence or “datamodeling” remains very computationally expensive. This causes a critical trust issue: if only a few computationally rich parties can obtain data attributions, how can resource-constrained parties trust that the provided attributions are indeed “good,” especially when they are used for important downstream applications (e.g., data pricing)? In this paper, we address this trust issue by proposing an interactive verification paradigm for data attribution. An untrusted and computationally powerful Prover learns data attributions, and then engages in an interactive proof with a resource-constrained Verifier. Our main result is a protocol that provides formal completeness, soundness, and efficiency guarantees in the sense of Probably-Approximately-Correct (PAC) verification. Specifically, if both Prover and Verifier follow the protocol, the Verifier accepts data attributions that are {\epsilon}-close to the optimal data attributions (in terms of the Mean Squared Error) with probability 1-{\delta}. Conversely, if the Prover arbitrarily deviates from the protocol, even with infinite compute, then this is detected (or it still yields data attributions to the Verifier) except with probability {\delta}. Importantly, our protocol ensures the Verifier’s workload, measured by the number of independent model retrainings it must perform, scales only as O(1/{\epsilon}); i.e., independently of the dataset size. At a technical level, our results apply to efficiently verifying any linear function over the boolean hypercube computed by the Prover, making them broadly applicable to various attribution tasks.

[409] A Dataset for Distilling Knowledge Priors from Literature for Therapeutic Design

Haydn Thomas Jones, Natalie Maus, Josh Magnus Ludan, Maggie Ziyu Huan, Jiaming Liang, Marcelo Der Torossian Torres, Jiatao Liang, Zachary Ives, Yoseph Barash, Cesar de la Fuente-Nunez, Jacob R. Gardner, Mark Yatskar

Main category: cs.LG

TL;DR: The paper introduces a dataset (\ourdataset) of priors for AI-driven therapeutic design, extracted from literature using LLM pipelines. It improves model performance and safety in molecule proposals.

DetailsMotivation: AI-driven discovery lacks experimental priors, leading to unsafe molecule proposals (e.g., mutagenic risks). The dataset addresses this gap by providing structured priors from literature.

Method: Constructed a dataset of 32.3M natural language facts and entity representations using LLM pipelines. Trained LLM, CLIP, and LLava models to jointly reason about text and design targets.

Result: Models pretrained with \ourdataset outperformed larger models (e.g., 2B TxGemma) on TDC tasks and improved safety in GuacaMol molecule proposals.

Conclusion: \ourdataset enhances AI-driven therapeutic design by providing strong priors, improving performance and safety. The dataset is publicly available and will expand with literature.

Abstract: AI-driven discovery can greatly reduce design time and enhance new therapeutics’ effectiveness. Models using simulators explore broad design spaces but risk violating implicit constraints due to a lack of experimental priors. For example, in a new analysis we performed on a diverse set of models on the GuacaMol benchmark using supervised classifiers, over 60% of molecules proposed had high probability of being mutagenic. In this work, we introduce \ourdataset, a dataset of priors for design problems extracted from literature describing compounds used in lab settings. It is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. \ourdataset~ consists of 32.3 million pairs of natural language facts, and appropriate entity representations (i.e. SMILES or refseq IDs). To demonstrate the potential of the data, we train LLM, CLIP, and LLava architectures to reason jointly about text and design targets and evaluate on tasks from the Therapeutic Data Commons (TDC). \ourdatasetis highly effective for creating models with strong priors: in supervised prediction problems that use our data as pretraining, our best models with 15M learnable parameters outperform larger 2B TxGemma on both regression and classification TDC tasks, and perform comparably to 9B models on average. Models built with \ourdatasetcan be used as constraints while optimizing for novel molecules in GuacaMol, resulting in proposals that are safer and nearly as effective. We release our dataset at \href{https://huggingface.co/datasets/medexanon/Medex}{huggingface.co/datasets/medexanon/Medex}, and will provide expanded versions as available literature grows.

[410] An Explainable Transformer-based Model for Phishing Email Detection: A Large Language Model Approach

Mohammad Amaz Uddin, Md Mahiuddin, Iqbal H. Sarker

Main category: cs.LG

TL;DR: A fine-tuned DistilBERT model is proposed for phishing email detection, achieving high accuracy and using XAI techniques for interpretability.

DetailsMotivation: Phishing emails are a persistent cyber threat, and advanced detection methods are needed due to increasing sophistication of attacks.

Method: An optimized, fine-tuned DistilBERT model is used, with preprocessing for data cleaning and class imbalance. XAI techniques (LIME, Transformer Interpret) explain predictions.

Result: The model achieves high accuracy in detecting phishing emails.

Conclusion: The fine-tuned DistilBERT model, combined with XAI, offers an effective and interpretable solution for phishing email detection.

Abstract: Phishing email is a serious cyber threat that tries to deceive users by sending false emails with the intention of stealing confidential information or causing financial harm. Attackers, often posing as trustworthy entities, exploit technological advancements and sophistication to make detection and prevention of phishing more challenging. Despite extensive academic research, phishing detection remains an ongoing and formidable challenge in the cybersecurity landscape. Large Language Models (LLMs) and Masked Language Models (MLMs) possess immense potential to offer innovative solutions to address long-standing challenges. In this research paper, we present an optimized, fine-tuned transformer-based DistilBERT model designed for the detection of phishing emails. In the detection process, we work with a phishing email dataset and utilize the preprocessing techniques to clean and solve the imbalance class issues. Through our experiments, we found that our model effectively achieves high accuracy, demonstrating its capability to perform well. Finally, we demonstrate our fine-tuned model using Explainable-AI (XAI) techniques such as Local Interpretable Model-Agnostic Explanations (LIME) and Transformer Interpret to explain how our model makes predictions in the context of text classification for phishing emails.

[411] Neural Networks Generalize on Low Complexity Data

Sourav Chatterjee, Timothy Sudijono

Main category: cs.LG

TL;DR: Feedforward neural networks with ReLU activation generalize on low-complexity data, defined via a simple programming language and MDL. The interpolating MDL network achieves high accuracy on tasks like primality testing, even without explicit design for the task.

DetailsMotivation: To demonstrate that neural networks can generalize well on low-complexity data when trained using MDL principles, even for tasks they weren't explicitly designed for.

Method: Define a simple programming language and MDL for networks. Train interpolating MDL networks on i.i.d. data from this language and evaluate generalization, e.g., on primality testing.

Result: The MDL network generalizes with high probability, e.g., accurately predicts primality with error probability 1-O((ln N)/n).

Conclusion: MDL-based neural networks can generalize effectively on low-complexity tasks, even in noisy settings, suggesting tempered overfitting.

Abstract: We show that feedforward neural networks with ReLU activation generalize on low complexity data, suitably defined. Given i.i.d.~data generated from a simple programming language, the minimum description length (MDL) feedforward neural network which interpolates the data generalizes with high probability. We define this simple programming language, along with a notion of description length of such networks. We provide several examples on basic computational tasks, such as checking primality of a natural number. For primality testing, our theorem shows the following and more. Suppose that we draw an i.i.d.~sample of $n$ numbers uniformly at random from $1$ to $N$. For each number $x_i$, let $y_i = 1$ if $x_i$ is a prime and $0$ if it is not. Then, the interpolating MDL network accurately answers, with error probability $1- O((\ln N)/n)$, whether a newly drawn number between $1$ and $N$ is a prime or not. Note that the network is not designed to detect primes; minimum description learning discovers a network which does so. Extensions to noisy data are also discussed, suggesting that MDL neural network interpolators can demonstrate tempered overfitting.

[412] Diversifying Policy Behaviors with Extrinsic Behavioral Curiosity

Zhenglin Wan, Xingrui Yu, David Mark Bossens, Yueming Lyu, Qing Guo, Flint Xiaofeng Fan, Yew Soon Ong, Ivor Tsang

Main category: cs.LG

TL;DR: QD-IRL integrates quality-diversity optimization with IRL to learn diverse behaviors from limited demonstrations, enhanced by Extrinsic Behavioral Curiosity (EBC) for novelty-driven rewards, significantly improving performance in robot locomotion tasks.

DetailsMotivation: Overcome the limitation of single-policy learning in imitation learning by enabling diverse behavior acquisition for robustness in unpredictable scenarios.

Method: Introduces QD-IRL with EBC, which uses an external critic to reward novel behaviors, evaluated on robot locomotion tasks.

Result: EBC boosts QD-IRL performance by up to 185%, surpassing expert performance by 20% in Humanoid, and enhances QD-RL algorithms.

Conclusion: EBC is a generic, effective technique for learning diverse policies, validated across multiple environments and algorithms.

Abstract: Imitation learning (IL) has shown promise in various applications (e.g. robot locomotion) but is often limited to learning a single expert policy, constraining behavior diversity and robustness in unpredictable real-world scenarios. To address this, we introduce Quality Diversity Inverse Reinforcement Learning (QD-IRL), a novel framework that integrates quality-diversity optimization with IRL methods, enabling agents to learn diverse behaviors from limited demonstrations. This work introduces Extrinsic Behavioral Curiosity (EBC), which allows agents to receive additional curiosity rewards from an external critic based on how novel the behaviors are with respect to a large behavioral archive. To validate the effectiveness of EBC in exploring diverse locomotion behaviors, we evaluate our method on multiple robot locomotion tasks. EBC improves the performance of QD-IRL instances with GAIL, VAIL, and DiffAIL across all included environments by up to 185%, 42%, and 150%, even surpassing expert performance by 20% in Humanoid. Furthermore, we demonstrate that EBC is applicable to Gradient-Arborescence-based Quality Diversity Reinforcement Learning (QD-RL) algorithms, where it substantially improves performance and provides a generic technique for learning behavioral-diverse policies. The source code of this work is provided at https://github.com/vanzll/EBC.

[413] DiRW: Path-Aware Digraph Learning for Heterophily

Daohan Su, Xunkai Li, Zhenjun Li, Yinping Liao, Rong-Hua Li, Guoren Wang

Main category: cs.LG

TL;DR: DiRW is a plug-and-play strategy for directed graph neural networks (DiGNNs), improving efficiency and performance by using a direction-aware path sampler and node-wise aggregator.

DetailsMotivation: Existing DiGNNs are limited by complex mechanisms and reliance on high-quality topology, leading to inefficiency and unstable performance.

Method: DiRW employs a direction-aware path sampler optimized for walk probability, length, and number, and a node-wise learnable path aggregator.

Result: DiRW enhances spatial-based DiGNNs and achieves state-of-the-art performance on 9 datasets.

Conclusion: DiRW offers a scalable and effective solution for learning from directed graphs, with demonstrated improvements in performance and efficiency.

Abstract: Recently, graph neural network (GNN) has emerged as a powerful representation learning tool for graph-structured data. However, most approaches are tailored for undirected graphs, neglecting the abundant information in the edges of directed graphs (digraphs). In fact, digraphs are widely applied in the real world and confirmed to address heterophily challenges. Despite recent advancements, existing spatial- and spectral-based DiGNNs have limitations due to their complex learning mechanisms and reliance on high-quality topology, resulting in low efficiency and unstable performance. To address these issues, we propose Directed Random Walk (DiRW), a plug-and-play strategy for most spatial-based DiGNNs and also an innovative model which offers a new digraph learning paradigm. Specifically, it utilizes a direction-aware path sampler optimized from the perspectives of walk probability, length, and number in a weight-free manner by considering node profiles and topologies. Building upon this, DiRW incorporates a node-wise learnable path aggregator for generalized node representations. Extensive experiments on 9 datasets demonstrate that DiRW: (1) enhances most spatial-based methods as a plug-and-play strategy; (2) achieves SOTA performance as a new digraph learning paradigm. The source code and data are available at https://github.com/dhsiuu/DiRW.

[414] Multi-objective Optimization in CPU Design Space Exploration: Attention is All You Need

Runzhen Xue, Hao Wu, Mingyu Yan, Ziheng Xiao, Guangyu Sun, Xiaochun Ye, Dongrui Fan

Main category: cs.LG

TL;DR: AttentionDSE is an end-to-end DSE framework using attention-based neural architecture for performance prediction and design guidance, addressing scalability, efficiency, and interpretability challenges in CPU design.

DetailsMotivation: Current DSE frameworks struggle with scalability, inefficiency, and lack of interpretability in high-dimensional CPU design spaces.

Method: AttentionDSE integrates performance prediction and design guidance via an attention-based neural architecture, using Perception-Driven Attention and Attention-aware Bottleneck Analysis.

Result: Achieves 3.9% higher Pareto Hypervolume and 80% reduction in exploration time on SPEC CPU2017 benchmarks.

Conclusion: AttentionDSE offers a scalable, efficient, and interpretable solution for modern CPU design space exploration.

Abstract: Design Space Exploration (DSE) is essential to modern CPU design, yet current frameworks struggle to scale and generalize in high-dimensional architectural spaces. As the dimensionality of design spaces continues to grow, existing DSE frameworks face three fundamental challenges: (1) reduced accuracy and poor scalability of surrogate models in large design spaces; (2) inefficient acquisition guided by hand-crafted heuristics or exhaustive search; (3) limited interpretability, making it hard to pinpoint architectural bottlenecks. In this work, we present \textbf{AttentionDSE}, the first end-to-end DSE framework that \emph{natively integrates} performance prediction and design guidance through an attention-based neural architecture. Unlike traditional DSE workflows that separate surrogate modeling from acquisition and rely heavily on hand-crafted heuristics, AttentionDSE establishes a unified, learning-driven optimization loop, in which attention weights serve a dual role: enabling accurate performance estimation and simultaneously exposing the performance bottleneck. This paradigm shift elevates attention from a passive representation mechanism to an active, interpretable driver of design decision-making. Key innovations include: (1) a \textbf{Perception-Driven Attention} mechanism that exploits architectural hierarchy and locality, scaling attention complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(n)$ via sliding windows; (2) an \textbf{Attention-aware Bottleneck Analysis} that automatically surfaces critical parameters for targeted optimization, eliminating the need for domain-specific heuristics. Evaluated on high-dimensional CPU design space using the SPEC CPU2017 benchmark suite, AttentionDSE achieves up to \textbf{3.9% higher Pareto Hypervolume} and over \textbf{80% reduction in exploration time} compared to state-of-the-art baselines.

[415] Grouped Sequency-arranged Rotation: Optimizing Rotation Transformation for Quantization for Free

Euntae Choi, Sumin Song, Woosang Lim, Sungjoo Yoo

Main category: cs.LG

TL;DR: A training-free method improves rotation matrices for low-bit PTQ in LLMs, using Walsh-Hadamard transform and GSR to reduce quantization error and outlier impact, outperforming existing methods.

DetailsMotivation: Address the limitations of existing rotation-based PTQ methods for LLMs at very low bit-widths (e.g., 2-bit) without requiring training.

Method: Leverages Walsh-Hadamard transform with sequency ordering and introduces Grouped Sequency-arranged Rotation (GSR) using block-diagonal matrices to reduce quantization error and isolate outliers.

Result: Significantly improves performance on reasoning tasks and Perplexity (PPL) scores, even when applied over learned rotation techniques.

Conclusion: The proposed training-free approach effectively enhances PTQ for LLMs at low bit-widths, matching optimization-based methods without training.

Abstract: Large Language Models (LLMs) face deployment challenges due to high computational costs, and while Post-Training Quantization (PTQ) offers a solution, existing rotation-based methods struggle at very low bit-widths like 2-bit. We introduce a novel, training-free approach to construct an improved rotation matrix, addressing the limitations of current methods. The key contributions include leveraging the Walsh-Hadamard transform with sequency ordering, which clusters similar frequency components to reduce quantization error compared to standard Hadamard matrices, significantly improving performance. Furthermore, we propose a Grouped Sequency-arranged Rotation (GSR) using block-diagonal matrices with smaller Walsh blocks, effectively isolating outlier impacts and achieving performance comparable to optimization-based methods without requiring any training. Our method demonstrates robust performance on reasoning tasks and Perplexity (PPL) score on WikiText-2. Our method also enhances results even when applied over existing learned rotation techniques.

[416] FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Guangda Liu, Chengwei Li, Zhenyu Ning, Minyi Guo, Jieru Zhao

Main category: cs.LG

TL;DR: FreeKV is a co-optimized framework improving KV retrieval efficiency in LLMs without accuracy loss, achieving 13x speedup.

DetailsMotivation: Long contexts in LLMs create deployment challenges due to KV cache size, with existing methods compromising accuracy or efficiency.

Method: FreeKV combines speculative retrieval and fine-grained correction algorithmically, and hybrid KV layouts with double-buffered recall system-wise.

Result: FreeKV achieves near-lossless accuracy and up to 13x speedup over state-of-the-art KV retrieval methods.

Conclusion: FreeKV effectively addresses KV cache challenges, balancing efficiency and accuracy for long-context LLM deployment.

Abstract: Large language models (LLMs) have been widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods are proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, an algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency. Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to 13$\times$ speedup compared to SOTA KV retrieval methods.

[417] Interpretable Neural ODEs for Gene Regulatory Network Discovery under Perturbations

Zaikang Lin, Sei Chang, Aaron Zweig, Minseo Kang, Elham Azizi, David A. Knowles

Main category: cs.LG

TL;DR: PerturbODE is a novel framework using neural ODEs to model cell state trajectories and infer causal gene regulatory networks from large-scale perturbation data, addressing limitations in expressivity, scalability, and dynamic biological processes.

DetailsMotivation: Existing models for inferring gene regulatory networks from perturbation data lack expressivity, scalability, and fail to capture dynamic biological processes like cellular differentiation.

Method: PerturbODE integrates biologically informative neural ODEs to model cell state trajectories under perturbations and derives the causal GRN from the ODE parameters.

Result: PerturbODE shows efficacy in trajectory prediction and GRN inference across simulated and real over-expression datasets.

Conclusion: PerturbODE advances GRN inference by addressing dynamic biological processes and outperforming existing models in scalability and expressivity.

Abstract: Modern high-throughput biological datasets with thousands of perturbations provide the opportunity for large-scale discovery of causal graphs that represent the regulatory interactions between genes. Differentiable causal graphical models have been proposed to infer a gene regulatory network (GRN) from large scale interventional datasets, capturing the causal gene regulatory relationships from genetic perturbations. However, existing models are limited in their expressivity and scalability while failing to address the dynamic nature of biological processes such as cellular differentiation. We propose PerturbODE, a novel framework that incorporates biologically informative neural ordinary differential equations (neural ODEs) to model cell state trajectories under perturbations and derive the causal GRN from the neural ODE’s parameters. We demonstrate PerturbODE’s efficacy in trajectory prediction and GRN inference across simulated and real over-expression datasets.

[418] CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization

Yanxia Deng, Aozhong Zhang, Selcuk Gurses, Naigang Wang, Zi Yang, Penghang Yin

Main category: cs.LG

TL;DR: CLoQ introduces a calibrated LoRA initialization method for quantized LLMs, improving fine-tuning efficiency by minimizing layer-wise discrepancies and outperforming existing methods.

DetailsMotivation: Addressing challenges in applying LoRA to quantized LLMs due to reduced precision, aiming for efficient fine-tuning with limited resources.

Method: Uses a calibration dataset to quantize LLMs and determine optimal LoRA components per layer, supported by a novel theoretical result for closed-form construction.

Result: CLoQ outperforms existing LoRA fine-tuning methods for quantized LLMs, especially at ultra low-bit widths, across tasks like language generation and reasoning.

Conclusion: CLoQ provides a robust initialization strategy for fine-tuning quantized LLMs, enhancing performance and efficiency.

Abstract: Fine-tuning large language models (LLMs) using low-rank adaptation (LoRA) has become a highly efficient approach for downstream tasks, particularly in scenarios with limited computational resources. However, applying LoRA techniques to quantized LLMs poses unique challenges due to the reduced representational precision of quantized weights. In this paper, we introduce CLoQ (Calibrated LoRA initialization for Quantized LLMs), a simplistic initialization strategy designed to overcome these challenges. Our approach focuses on minimizing the layer-wise discrepancy between the original LLM and its quantized counterpart with LoRA components during initialization. By leveraging a small calibration dataset, CLoQ quantizes a pre-trained LLM and determines the optimal LoRA components for each layer, ensuring a strong foundation for subsequent fine-tuning. A key contribution of this work is a novel theoretical result that enables the accurate and closed-form construction of these optimal LoRA components. We validate the efficacy of CLoQ across multiple tasks such as language generation, arithmetic reasoning, and commonsense reasoning, demonstrating that it consistently outperforms existing LoRA fine-tuning methods for quantized LLMs, especially at ultra low-bit widths.

[419] Improved Regularization and Robustness for Fine-tuning in Neural Networks

Dongyue Li, Hongyang R. Zhang

Main category: cs.LG

TL;DR: The paper proposes a regularized self-labeling method to improve fine-tuning in transfer learning, addressing overfitting and noisy labels, and demonstrates its effectiveness across various datasets.

DetailsMotivation: Fine-tuning pre-trained models on small datasets can lead to overfitting and memorization of noisy labels, necessitating robust regularization methods.

Method: The authors analyze fine-tuning’s generalization properties, propose layer-wise regularization, self-label-correction, and label-reweighting, and validate these on image and text datasets.

Result: The method improves baselines by 1.76% for image classification and 0.75% for few-shot tasks, and by 3.56% in noisy label settings.

Conclusion: Regularized self-labeling effectively enhances fine-tuning robustness and performance, especially in noisy or small-data scenarios.

Abstract: A widely used algorithm for transfer learning is fine-tuning, where a pre-trained model is fine-tuned on a target task with a small amount of labeled data. When the capacity of the pre-trained model is significantly larger than the size of the target dataset, fine-tuning is prone to overfitting and memorizing the training labels. Hence, a crucial question is to regularize fine-tuning and ensure its robustness against noise. To address this question, we begin by analyzing the generalization properties of fine-tuning. We present a PAC-Bayes generalization bound that depends on the distance traveled in each layer during fine-tuning and the noise stability of the fine-tuned model. We empirically measure these quantities. Based on the analysis, we propose regularized self-labeling – the interpolation between regularization and self-labeling methods, including (i) layer-wise regularization to constrain the distance traveled in each layer; (ii) self-label-correction and label-reweighting to correct mislabeled data points (that the model is confident) and reweight less confident data points. We validate our approach on an extensive collection of image and text datasets using multiple pre-trained model architectures. Our approach improves baseline methods by 1.76% (on average) for seven image classification tasks and 0.75% for a few-shot classification task. When the target data set includes noisy labels, our approach outperforms baseline methods by an average of 3.56% in two noisy settings.

[420] Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, Akash Srivastava

Main category: cs.LG

TL;DR: The paper proposes a probabilistic inference-based approach for scaling LLMs at inference time, outperforming deterministic methods and achieving high accuracy with fewer rollouts.

DetailsMotivation: Diminishing returns from scaling model sizes/data motivate exploring inference-time scaling, but existing methods are prone to reward hacking.

Method: Adapts particle-based Monte Carlo methods to treat inference-time scaling as a probabilistic inference task, exploring the typical set of state distributions.

Result: Empirical results show 4-16x better scaling rates and surpassing GPT-4o accuracy with fewer rollouts.

Conclusion: The method effectively connects probabilistic inference with LLM scaling, offering robustness and efficiency for future work.

Abstract: Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work. Code, videos, and further information available at https://probabilistic-inference-scaling.github.io.

[421] Learning to Schedule in Parallel-Server Queues with Stochastic Bilinear Rewards

Jung-hun Kim, Milan Vojnovic

Main category: cs.LG

TL;DR: The paper addresses scheduling in multi-class, parallel-server queuing systems with uncertain rewards, aiming to minimize regret while ensuring system stability. It proposes an algorithm combining weighted proportional fairness and bandit techniques, achieving sub-linear regret and holding costs.

DetailsMotivation: The problem is motivated by resource allocation in network systems, where balancing reward maximization and fair allocation is crucial for stability.

Method: A scheduling algorithm based on weighted proportional fairness and marginal costs, incorporating a bandit algorithm for bilinear rewards.

Result: The algorithm achieves sub-linear regret ($\tilde{O}(\sqrt{T})$) and holding costs, ensuring system stability. Numerical experiments validate its efficiency.

Conclusion: The proposed algorithm effectively balances reward maximization and stability, with theoretical guarantees and practical validation.

Abstract: We consider the problem of scheduling in multi-class, parallel-server queuing systems with uncertain rewards from job-server assignments. In this scenario, jobs incur holding costs while awaiting completion, and job-server assignments yield observable stochastic rewards with unknown mean values. The mean rewards for job-server assignments are assumed to follow a bilinear model with respect to features that characterize jobs and servers. Our objective is to minimize regret by maximizing the cumulative reward of job-server assignments over a time horizon, while keeping the total job holding cost bounded to ensure the stability of the queueing system. This problem is motivated by applications requiring resource allocation in network systems. In this problem, it is essential to control the tradeoff between reward maximization and fair allocation for the stability of the underlying queuing system (i.e., maximizing network throughput). To address this problem, we propose a scheduling algorithm based on a weighted proportional fair criteria augmented with marginal costs for reward maximization, incorporating a bandit algorithm tailored for bilinear rewards. Our algorithm achieves a sub-linear regret bound and a sub-linear mean holding cost (and queue length bound) of $\tilde{O}(\sqrt{T})$, respectively, with respect to the time horizon $T$, thus guaranteeing queuing system stability. Additionally, we establish stability conditions for distributed iterative algorithms for computing allocations, which are relevant to large-scale system applications. Finally, we demonstrate the efficiency of our algorithm through numerical experiments.

[422] Delayed Feedback Modeling with Influence Functions

Chenlu Ding, Jiancan Wu, Yancheng Yuan, Cunchun Li, Xiang Wang, Dingxian Wang, Frank Yang, Andrew Rabinovich

Main category: cs.LG

TL;DR: IF-DFM improves CVR prediction by addressing delayed feedback with influence functions, avoiding full retraining for efficiency.

DetailsMotivation: Delayed feedback in CPA advertising leads to biased training data; existing solutions are inefficient and inflexible.

Method: IF-DFM uses influence functions to estimate the impact of delayed conversions, reformulating the inverse Hessian-vector product for scalability.

Result: IF-DFM outperforms prior methods in accuracy and adaptability on benchmark datasets.

Conclusion: IF-DFM offers a scalable and effective solution for delayed feedback in CVR prediction.

Abstract: In online advertising under the cost-per-conversion (CPA) model, accurate conversion rate (CVR) prediction is crucial. A major challenge is delayed feedback, where conversions may occur long after user interactions, leading to incomplete recent data and biased model training. Existing solutions partially mitigate this issue but often rely on auxiliary models, making them computationally inefficient and less adaptive to user interest shifts. We propose IF-DFM, an \underline{I}nfluence \underline{F}unction-empowered for \underline{D}elayed \underline{F}eedback \underline{M}odeling which estimates the impact of newly arrived and delayed conversions on model parameters, enabling efficient updates without full retraining. By reformulating the inverse Hessian-vector product as an optimization problem, IF-DFM achieves a favorable trade-off between scalability and effectiveness. Experiments on benchmark datasets show that IF-DFM outperforms prior methods in both accuracy and adaptability.

[423] Unifying Self-Supervised Clustering and Energy-Based Models

Emanuele Sansone, Robin Manhaeve

Main category: cs.LG

TL;DR: The paper connects self-supervised learning and generative models, proposing a unified framework that outperforms existing methods in clustering, generation, and out-of-distribution detection.

DetailsMotivation: To establish a principled connection between self-supervised learning and generative models, leveraging their complementary strengths.

Method: Analysis of self-supervised learning objectives, integration with likelihood-based generative models, and introduction of a lower bound for failure mode penalization.

Result: Outperforms existing methods on SVHN, CIFAR10, and CIFAR100 in clustering, generation, and out-of-distribution detection.

Conclusion: The unified framework successfully integrates discriminative and generative training, with potential applications in neuro-symbolic frameworks.

Abstract: Self-supervised learning excels at learning representations from large amounts of data. At the same time, generative models offer the complementary property of learning information about the underlying data generation process. In this study, we aim at establishing a principled connection between these two paradigms and highlight the benefits of their complementarity. In particular, we perform an analysis of self-supervised learning objectives, elucidating the underlying probabilistic graphical models and presenting a standardized methodology for their derivation from first principles. The analysis suggests a natural means of integrating self-supervised learning with likelihood-based generative models. We instantiate this concept within the realm of cluster-based self-supervised learning and energy models, introducing a lower bound proven to reliably penalize the most important failure modes and unlocking full unification. Our theoretical findings are substantiated through experiments on synthetic and real-world data, including SVHN, CIFAR10, and CIFAR100, demonstrating that our objective function allows to jointly train a backbone network in a discriminative and generative fashion, consequently outperforming existing self-supervised learning strategies in terms of clustering, generation and out-of-distribution detection performance by a wide margin. We also demonstrate that the solution can be integrated into a neuro-symbolic framework to tackle a simple yet non-trivial instantiation of the symbol grounding problem. The code is publicly available at https://github.com/emsansone/GEDI.

[424] Rhythmic sharing: A bio-inspired paradigm for zero-shot adaptive learning in neural networks

Hoony Kang, Wolfgang Losert

Main category: cs.LG

TL;DR: The paper introduces a learning paradigm inspired by neural oscillatory rhythms, enabling rapid adaptation and unsupervised learning in AI networks.

DetailsMotivation: To mimic the brain's ability to adapt quickly and learn from limited data, which current AI struggles with.

Method: Developed a learning paradigm using link strength oscillations, where learning is tied to the coordination of these oscillations.

Result: The network can rapidly adapt to contextual changes and predict dynamics of multiple contexts, including unseen ones.

Conclusion: The paradigm offers a novel approach for rapid adaptive learning in AI, applicable to various neural network models.

Abstract: The brain rapidly adapts to new contexts and learns from limited data, a coveted characteristic that artificial intelligence (AI) algorithms struggle to mimic. Inspired by the mechanical oscillatory rhythms of neural cells, we developed a learning paradigm utilizing link strength oscillations, where learning is associated with the coordination of these oscillations. Link oscillations can rapidly change coordination, allowing the network to sense and adapt to subtle contextual changes without supervision. The network becomes a generalist AI architecture, capable of predicting dynamics of multiple contexts including unseen ones. These results make our paradigm a powerful starting point for novel models of cognition. Because our paradigm is agnostic to specifics of the neural network, our study opens doors for introducing rapid adaptive learning into leading AI models.

[425] iFairy: the First 2-bit Complex LLM with All Parameters in ${\pm1, \pm i}$

Feiyu Wang, Guoan Wang, Yihao Zhang, Shengfan Wang, Weitao Li, Bokai Huang, Shimao Chen, Zihan Jiang, Rui Xu, Tong Yang

Main category: cs.LG

TL;DR: Fairy±i introduces a novel 2-bit quantization framework for complex-valued LLMs, surpassing the accuracy ceiling of full-precision models by leveraging complex domain advantages.

DetailsMotivation: Current QAT research is limited by the accuracy ceiling of full-precision models, and no method has attempted to surpass it.

Method: The framework maps weights to the fourth roots of unity (±1, ±i), enabling multiplication-free inference with additions and swaps.

Result: Fairy±i outperforms existing 2-bit methods in PPL and downstream tasks while maintaining efficiency.

Conclusion: This work pioneers a new direction for highly accurate, low-bit LLMs.

Abstract: Quantization-Aware Training (QAT) integrates quantization into the training loop, enabling LLMs to learn robust low-bit representations, and is widely recognized as one of the most promising research directions. All current QAT research focuses on minimizing quantization error on full-precision models, where the full-precision accuracy acts as an upper bound (accuracy ceiling). No existing method has even attempted to surpass this ceiling. To break this ceiling, we propose a new paradigm: raising the ceiling (full-precision model), and then still quantizing it efficiently into 2 bits. We propose Fairy$\pm i$, the first 2-bit quantization framework for complex-valued LLMs. Specifically, our method leverages the representational advantages of the complex domain to boost full-precision accuracy. We map weights to the fourth roots of unity ${\pm1, \pm i}$, forming a perfectly symmetric and information-theoretically optimal 2-bit representation. Importantly, each quantized weight has either a zero real or imaginary part, enabling multiplication-free inference using only additions and element swaps. Experimental results show that Fairy$\pm i$ outperforms the ceiling of existing 2-bit quantization approaches in terms of both PPL and downstream tasks, while maintaining strict storage and compute efficiency. This work opens a new direction for building highly accurate and practical LLMs under extremely low-bit constraints.

[426] Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed

Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, Eduard Gorbunov

Main category: cs.LG

TL;DR: The paper analyzes AdaGrad/Adam’s limitations in handling heavy-tailed noise and shows gradient clipping improves their high-probability convergence.

DetailsMotivation: Adaptive methods like AdaGrad/Adam are vital for training deep learning models, but their performance under heavy-tailed noise is poorly understood.

Method: The study proves AdaGrad/Adam’s poor high-probability convergence with heavy-tailed noise and introduces gradient clipping to fix it. Theoretical bounds are derived for clipped versions.

Result: Clipped AdaGrad/Adam achieves better high-probability convergence with polylogarithmic dependence on confidence levels, even with delayed stepsizes.

Conclusion: Gradient clipping enhances AdaGrad/Adam’s robustness against heavy-tailed noise, supported by empirical evidence.

Abstract: Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad-Norm and Adam-Norm with clipping and with/without delay for smooth convex/non-convex stochastic optimization with heavy-tailed noise. We extend our results to the case of AdaGrad/Adam with delayed stepsizes. Our empirical evaluations highlight the superiority of clipped versions of AdaGrad/Adam in handling the heavy-tailed noise.

[427] Boosting Cross-problem Generalization in Diffusion-Based Neural Combinatorial Solver via Inference Time Adaptation

Haoyu Lei, Kaiwen Zhou, Yinchuan Li, Zhitang Chen, Farzan Farnia

Main category: cs.LG

TL;DR: DIFU-Ada enables zero-shot cross-problem transfer and cross-scale generalization for diffusion-based NCO solvers without additional training.

DetailsMotivation: Address challenges in cross-scale and cross-problem generalization and high training costs in existing NCO methods.

Method: Proposes DIFU-Ada, a training-free inference time adaptation framework leveraging pre-defined guidance functions.

Result: Achieves competitive zero-shot transfer performance across TSP variants like PCTSP and OP.

Conclusion: DIFU-Ada bridges the gap in combinatorial optimization by enabling efficient generalization without retraining.

Abstract: Diffusion-based Neural Combinatorial Optimization (NCO) has demonstrated effectiveness in solving NP-complete (NPC) problems by learning discrete diffusion models for solution generation, eliminating hand-crafted domain knowledge. Despite their success, existing NCO methods face significant challenges in both cross-scale and cross-problem generalization, and high training costs compared to traditional solvers. While recent studies on diffusion models have introduced training-free guidance approaches that leverage pre-defined guidance functions for conditional generation, such methodologies have not been extensively explored in combinatorial optimization. To bridge this gap, we propose a training-free inference time adaptation framework (DIFU-Ada) that enables both the zero-shot cross-problem transfer and cross-scale generalization capabilities of diffusion-based NCO solvers without requiring additional training. We provide theoretical analysis that helps understanding the cross-problem transfer capability. Our experimental results demonstrate that a diffusion solver, trained exclusively on the Traveling Salesman Problem (TSP), can achieve competitive zero-shot transfer performance across different problem scales on TSP variants, such as Prize Collecting TSP (PCTSP) and the Orienteering Problem (OP), through inference time adaptation.

[428] Sample-efficient LLM Optimization with Reset Replay

Zichuan Liu, Jinyu Wang, Lei Song, Jiang Bian

Main category: cs.LG

TL;DR: LoRR enhances LLM optimization by improving sample efficiency and reducing primacy bias through high-replay training and periodic resets.

DetailsMotivation: Address low sample efficiency and primacy bias in RL and preference optimization methods for LLMs.

Method: Introduces LoRR, a plugin combining high-replay training, periodic resets, and hybrid optimization (SFT + preference-based losses).

Result: LoRR boosts performance on reasoning benchmarks, matching or outperforming complex RL methods.

Conclusion: LoRR is a practical, efficient paradigm for LLM finetuning, maximizing performance with limited data.

Abstract: Recent advancements in post-training Large Language Models (LLMs), particularly through Reinforcement Learning (RL) and preference optimization methods, are key drivers for enhancing their reasoning capabilities. However, these methods are often plagued by low sample efficiency and a susceptibility to primacy bias, where overfitting to initial experiences degrades policy quality and damages the learning process. To address these challenges, we introduce LLM optimization with Reset Replay (LoRR), a general and powerful plugin designed to enhance sample efficiency in any preference-based optimization framework. LoRR core mechanism enables training at a high replay number, maximizing the utility of each collected data batch. To counteract the risk of overfitting inherent in high-replay training, LoRR incorporates a periodic reset strategy with reusing initial data, which preserves network plasticity. Furthermore, it leverages a hybrid optimization objective, combining supervised fine-tuning (SFT) and preference-based losses to further bolster data exploitation. Our extensive experiments demonstrate that LoRR significantly boosts the performance of various preference optimization methods on both mathematical and general reasoning benchmarks. Notably, an iterative DPO approach augmented with LoRR achieves comparable performance on challenging math tasks, outperforming some complex and computationally intensive RL-based algorithms. These findings highlight that LoRR offers a practical, sample-efficient, and highly effective paradigm for LLM finetuning, unlocking greater performance from limited data.

[429] Minimax Optimality in Contextual Dynamic Pricing with General Valuation Models

Xueping Gong, Wei You, Jiheng Zhang

Main category: cs.LG

TL;DR: The paper introduces a minimax-optimal algorithm for contextual dynamic pricing, leveraging Lipschitz continuity and discretization to achieve tight confidence bounds without needing the Lipschitz constant. It extends to general function classes and matches minimax lower bounds, with empirical validation.

DetailsMotivation: To address the challenge of personalized pricing with unknown noise distributions and bounded valuations, ensuring robust performance without prior knowledge of the noise or Lipschitz constant.

Method: Proposes a layered data partitioning approach with discretized candidate prices, using offline regression oracles for general function classes and tight confidence bounds.

Result: Achieves a regret upper bound matching the minimax lower bound up to logarithmic factors, with improved guarantees under additional structures like linear models or smoothness.

Conclusion: The method outperforms benchmarks, validated by numerical experiments, and generalizes well across various settings, offering a robust solution for contextual dynamic pricing.

Abstract: We study contextual dynamic pricing, where a decision maker posts personalized prices based on observable contexts and receives binary purchase feedback indicating whether the customer’s valuation exceeds the price. Each valuation is modeled as an unknown latent function of the context, corrupted by independent and identically distributed market noise from an unknown distribution. Relying only on Lipschitz continuity of the noise distribution and bounded valuations, we propose a minimax-optimal algorithm. To accommodate the unknown distribution, our method discretizes the relevant noise range to form a finite set of candidate prices, then applies layered data partitioning to obtain confidence bounds substantially tighter than those derived via the elliptical-potential lemma. A key advantage is that estimation bias in the valuation function cancels when comparing upper confidence bounds, eliminating the need to know the Lipschitz constant. The framework extends beyond linear models to general function classes through offline regression oracles. Our regret analysis depends solely on the oracle’s estimation error, typically governed by the statistical complexity of the class. These techniques yield a regret upper bound matching the minimax lower bound up to logarithmic factors. Furthermore, we refine these guarantees under additional structures – e.g., linear valuation models, second-order smoothness, sparsity, and known noise distribution or observable valuations – and compare our bounds and assumptions with prior dynamic-pricing methods. Finally, numerical experiments corroborate the theory and show clear improvements over benchmark methods.

[430] Tuning-Free Online Robust Principal Component Analysis through Implicit Regularization

Lakshmi Jayalal, Gokularam Muthukrishnan, Sheetal Kalyani

Main category: cs.LG

TL;DR: The paper proposes a tuning-free OR-PCA method using implicit regularization via modified gradient descents, eliminating dataset-sensitive parameter tuning.

DetailsMotivation: The standard OR-PCA relies on explicit regularizers requiring dataset-sensitive tuning, which limits scalability. The goal is to remove this dependency.

Method: Three modified gradient descent versions are used to implicitly encourage sparsity and low-rank structures, making OR-PCA tuning-free.

Result: The method performs comparably or better than tuned OR-PCA on simulated and real-world datasets.

Conclusion: Tuning-free OR-PCA enhances scalability for large datasets by eliminating the need for dataset-dependent parameter tuning.

Abstract: The performance of the standard Online Robust Principal Component Analysis (OR-PCA) technique depends on the optimum tuning of the explicit regularizers and this tuning is dataset sensitive. We aim to remove the dependency on these tuning parameters by using implicit regularization. We propose to use the implicit regularization effect of various modified gradient descents to make OR-PCA tuning free. Our method incorporates three different versions of modified gradient descent that separately but naturally encourage sparsity and low-rank structures in the data. The proposed method performs comparable or better than the tuned OR-PCA for both simulated and real-world datasets. Tuning-free ORPCA makes it more scalable for large datasets since we do not require dataset-dependent parameter tuning.

[431] HGAurban: Heterogeneous Graph Autoencoding for Urban Spatial-Temporal Learning

Qianru Zhang, Xinyi Gao, Haixin Wang, Dong Huang, Siu-Ming Yiu, Hongzhi Yin

Main category: cs.LG

TL;DR: HGAurban is a novel spatial-temporal graph masked autoencoder for robust urban data representation, addressing noise and sparsity challenges.

DetailsMotivation: Existing neural networks struggle with noisy and sparse spatial-temporal data, limiting meaningful region representation learning.

Method: HGAurban uses a heterogeneous graph encoder and masked autoencoder to learn diverse spatial relationships and dynamic temporal correlations.

Result: The framework outperforms state-of-the-art methods in spatiotemporal tasks and handles real-world data challenges effectively.

Conclusion: HGAurban provides a robust solution for urban data representation, improving performance in spatial-temporal applications.

Abstract: Spatial-temporal graph representations play a crucial role in urban sensing applications, including traffic analysis, human mobility behavior modeling, and citywide crime prediction. However, a key challenge lies in the noisy and sparse nature of spatial-temporal data, which limits existing neural networks’ ability to learn meaningful region representations in the spatial-temporal graph. To overcome these limitations, we propose HGAurban, a novel heterogeneous spatial-temporal graph masked autoencoder that leverages generative self-supervised learning for robust urban data representation. Our framework introduces a spatial-temporal heterogeneous graph encoder that extracts region-wise dependencies from multi-source data, enabling comprehensive modeling of diverse spatial relationships. Within our self-supervised learning paradigm, we implement a masked autoencoder that jointly processes node features and graph structure. This approach automatically learns heterogeneous spatial-temporal patterns across regions, significantly improving the representation of dynamic temporal correlations. Comprehensive experiments across multiple spatiotemporal mining tasks demonstrate that our framework outperforms state-of-the-art methods and robustly handles real-world urban data challenges, including noise and sparsity in both spatial and temporal dimensions.

[432] VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models

Suhas G Hegde, Shilpy Kaur, Aruna Tiwari

Main category: cs.LG

TL;DR: VectorFit introduces adaptive training of singular vectors and biases in pre-trained weights, outperforming PEFT methods with 9x fewer parameters.

DetailsMotivation: Address the performance gap between PEFT methods and full fine-tuning by leveraging existing knowledge in pre-trained weights.

Method: Adaptively trains singular vectors and biases of pre-trained weights to create high-rank incremental weight matrices.

Result: Achieves superior performance with 9x fewer trainable parameters than leading PEFT methods across 19 datasets.

Conclusion: VectorFit is a highly parameter-efficient method that bridges the gap to full fine-tuning.

Abstract: Popular PEFT methods reduce trainable parameter count for fine-tuning by parameterizing new low-rank or sparse trainable weights in parallel to the frozen pre-trained weights $W$. However, these weights are trained from scratch, and there exists a performance gap between these methods and full fine-tuning, especially in low-budget settings. We introduce VectorFit, a new way of parameterization that efficiently utilizes the existing knowledge embedded in $W$ by adaptively training their singular vectors and biases. We show that utilizing the structural and transformational properties of $W$ in this way can lead to high-rank incremental weight matrices $\Delta W$, comparable to that of full fine-tuning. VectorFit delivers superior results with 9$\boldsymbol\times$ fewer trainable parameters than the leading PEFT methods. Through comprehensive experiments across 19 datasets covering a wide range of language and vision tasks such as natural language understanding and generation, question answering, image classification, and image generation, we demonstrate that VectorFit surpasses baselines in terms of performance as a function of parameter-efficiency.

[433] Federated Time Series Generation on Feature and Temporally Misaligned Data

Zhi Wen Soi, Chenrui Fan, Aditya Shankar, Abele Mălan, Lydia Y. Chen

Main category: cs.LG

TL;DR: FedTDD is a federated time series diffusion model addressing misaligned timesteps and features by exchanging synthetic outputs instead of model parameters, improving local imputations and achieving significant performance gains.

DetailsMotivation: Existing federated time series models assume perfect alignment, which is unrealistic. FedTDD addresses the challenge of distributed time series data with misaligned features and timesteps.

Method: FedTDD uses a data distillation and aggregation framework, exchanging synthetic outputs to learn correlations across clients. A global distiller network iteratively improves by leveraging shared synthetic data.

Result: FedTDD outperforms centralized training and local training, achieving 79.4% and 62.8% improvement in Context-FID and Correlational scores.

Conclusion: FedTDD effectively handles misaligned time series data in federated learning by sharing synthetic outputs, enhancing local imputations and overall performance.

Abstract: Distributed time series data presents a challenge for federated learning, as clients often possess different feature sets and have misaligned time steps. Existing federated time series models are limited by the assumption of perfect temporal or feature alignment across clients. In this paper, we propose FedTDD, a novel federated time series diffusion model that jointly learns a synthesizer across clients. At the core of FedTDD is a novel data distillation and aggregation framework that reconciles the differences between clients by imputing the misaligned timesteps and features. In contrast to traditional federated learning, FedTDD learns the correlation across clients’ time series through the exchange of local synthetic outputs instead of model parameters. A coordinator iteratively improves a global distiller network by leveraging shared knowledge from clients through the exchange of synthetic data. As the distiller becomes more refined over time, it subsequently enhances the quality of the clients’ local feature estimates, allowing each client to then improve its local imputations for missing data using the latest, more accurate distiller. Experimental results on five datasets demonstrate FedTDD’s effectiveness compared to centralized training, and the effectiveness of sharing synthetic outputs to transfer knowledge of local time series. Notably, FedTDD achieves 79.4% and 62.8% improvement over local training in Context-FID and Correlational scores.

[434] FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

Philip Zmushko, Aleksandr Beznosikov, Martin Takáč, Samuel Horváth

Main category: cs.LG

TL;DR: FRUGAL is a memory-efficient optimization framework that splits gradients to perform low-dimensional updates with advanced algorithms while using state-free methods for remaining directions, outperforming existing methods in pre-training and fine-tuning.

DetailsMotivation: Address the high GPU memory demand caused by optimizer states in large language models and mitigate information loss from low-rank updates.

Method: FRUGAL uses gradient splitting for low-dimensional updates (e.g., Adam) and state-free methods (e.g., SGD) for remaining directions, integrating with low-rank selection techniques like GaLore and BAdam.

Result: Achieves state-of-the-art performance in pre-training and fine-tuning under fixed memory budgets, with theoretical convergence guarantees.

Conclusion: FRUGAL balances memory efficiency and performance, offering a superior alternative to existing low-rank update methods.

Abstract: With the increase in the number of parameters in large language models, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA (Hu et al., 2021)), low-rank gradient projection (GaLore (Zhao et al., 2024)), and blockwise optimization (BAdam (Luo et al., 2024)) have been proposed. However, in all these algorithms, the $\textit{effective rank of the weight updates remains low-rank}$, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce $\texttt{FRUGAL}$ ($\textbf{F}$ull-$\textbf{R}$ank $\textbf{U}$pdates with $\textbf{G}$r$\textbf{A}$dient sp$\textbf{L}$itting), a new memory-efficient optimization framework. $\texttt{FRUGAL}$ leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD (Bernstein et al., 2018). Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches across various fixed memory budgets, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.

[435] Efficient Distributed Optimization under Heavy-Tailed Noise

Su Hyeong Lee, Manzil Zaheer, Tian Li

Main category: cs.LG

TL;DR: TailOPT is a framework addressing heavy-tailed noise in distributed optimization, featuring efficient variants like $Bi^2Clip$ for adaptive performance without extra costs.

DetailsMotivation: Heavy-tailed stochastic gradient noise in distributed optimization, especially in attention-based models, hinders effective training.

Method: TailOPT uses adaptive optimization or clipping techniques, with $Bi^2Clip$ performing coordinate-wise clipping at inner and outer optimizers.

Result: TailOPT achieves convergence under heavy-tailed noise and outperforms state-of-the-art methods in language tasks.

Conclusion: TailOPT, particularly $Bi^2Clip$, offers an efficient solution for heavy-tailed noise in distributed optimization.

Abstract: Distributed optimization has become the default training paradigm in modern machine learning due to the growing scale of models and datasets. To mitigate communication overhead, local updates are often applied before global aggregation, resulting in a nested optimization approach with inner and outer steps. However, heavy-tailed stochastic gradient noise remains a significant challenge, particularly in attention-based models, hindering effective training. In this work, we propose TailOPT, an efficient framework designed to address heavy-tailed noise by leveraging adaptive optimization or clipping techniques. We establish convergence guarantees for the TailOPT framework under heavy-tailed noise with potentially unbounded gradient variance and local updates. Among its variants, we highlight a memory and communication efficient instantiation which we call $Bi^2Clip$, which performs coordinate-wise clipping at both the inner and outer optimizers, achieving adaptive-like performance (e.g., Adam) without the cost of maintaining or transmitting additional gradient statistics. Empirically, TailOPT, including $Bi^2Clip$, demonstrates superior performance on several language tasks and models, outperforming state-of-the-art methods.

[436] A Market for Accuracy: Classification under Competition

Ohad Einav, Nir Rosenfeld

Main category: cs.LG

TL;DR: The paper studies machine learning in competitive markets, proposing a method for classification that maximizes market share while benefiting providers and consumers, and ensuring market stability.

DetailsMotivation: Traditional learning approaches ignore competition among providers, which affects providers, consumers, and the market. The work aims to address this gap.

Method: Proposes a classification method for competitive markets, focusing on market share maximization and considering timing of market entry and model updates.

Result: The approach benefits providers and consumers, ensures market stability, and converges quickly to equilibrium across various domains.

Conclusion: The method effectively addresses learning in competitive markets, balancing provider and consumer interests while maintaining market stability.

Abstract: Machine learning models play a key role for service providers looking to gain market share in consumer markets. However, traditional learning approaches do not take into account the existence of additional providers, who compete with each other for consumers. Our work aims to study learning in this market setting, as it affects providers, consumers, and the market itself. We begin by analyzing such markets through the lens of the learning objective, and show that accuracy cannot be the only consideration. We then propose a method for classification under competition, so that a learner can maximize market share in the presence of competitors. We show that our approach benefits the providers as well as the consumers, and find that the timing of market entry and model updates can be crucial. We display the effectiveness of our approach across a range of domains, from simple distributions to noisy datasets, and show that the market as a whole remains stable by converging quickly to an equilibrium.

[437] Goal-Oriented Time-Series Forecasting: Foundation Framework Design

Luca-Andrei Fechete, Mohamed Sana, Fadhel Ayed, Nicola Piovesan, Wenjie Li, Antonio De Domenico, Tareq Si Salem

Main category: cs.LG

TL;DR: A training method for time-series forecasting adapts focus to application-specific regions of interest without retraining, improving accuracy and downstream performance.

DetailsMotivation: Current methods minimize overall error but ignore varying importance of forecast ranges in applications.

Method: Partitions prediction space into segments during training, dynamically reweighting and aggregating them to emphasize target ranges.

Result: Improves forecast accuracy in regions of interest and enhances downstream task performance.

Conclusion: Enables better integration of predictive modeling with real-world decision-making.

Abstract: Conventional time-series forecasting methods typically aim to minimize overall prediction error, without accounting for the varying importance of different forecast ranges in downstream applications. We propose a training methodology that enables forecasting models to adapt their focus to application-specific regions of interest at inference time, without retraining. The approach partitions the prediction space into fine-grained segments during training, which are dynamically reweighted and aggregated to emphasize the target range specified by the application. Unlike prior methods that predefine these ranges, our framework supports flexible, on-demand adjustments. Experiments on standard benchmarks and a newly collected wireless communication dataset demonstrate that our method not only improves forecast accuracy within regions of interest but also yields measurable gains in downstream task performance. These results highlight the potential for closer integration between predictive modeling and decision-making in real-world systems.

[438] Learning Classifiers That Induce Markets

Yonatan Sommer, Ivri Hikri, Lotan Amit, Nir Rosenfeld

Main category: cs.LG

TL;DR: The paper explores how deploying classifiers can create markets for features, influencing user behavior and costs, and proposes a framework to study and compute these dynamics.

DetailsMotivation: To challenge the assumption that costs in strategic classification are fixed, showing they can emerge from classifier deployment and market formation.

Method: Extends strategic classification to include market-induced costs, analyzes the learning task, devises an algorithm for market prices, and proposes a differentiable learning framework.

Result: Demonstrates how classifiers can induce feature markets, with experiments validating the novel setting and approach.

Conclusion: Classifier deployment can dynamically shape feature costs through market mechanisms, requiring new learning frameworks to account for these effects.

Abstract: When learning is used to inform decisions about humans, such as for loans, hiring, or admissions, this can incentivize users to strategically modify their features, at a cost, to obtain positive predictions. The common assumption is that the function governing costs is exogenous, fixed, and predetermined. We challenge this assumption, and assert that costs can emerge as a result of deploying a classifier. Our idea is simple: when users seek positive predictions, this creates demand for important features; and if features are available for purchase, then a market will form, and competition will give rise to prices. We extend the strategic classification framework to support this notion, and study learning in a setting where a classifier can induce a market for features. We present an analysis of the learning task, devise an algorithm for computing market prices, propose a differentiable learning framework, and conduct experiments to explore our novel setting and approach.

[439] From Actions to Words: Towards Abstractive-Textual Policy Summarization in RL

Sahar Admoni, Assaf Hallak, Yftah Ziser, Omer Ben-Porat, Ofra Amir

Main category: cs.LG

TL;DR: SySLLM uses LLMs to generate abstractive-textual summaries of RL policies, improving user understanding over traditional demonstration-based methods.

DetailsMotivation: RL policies are hard to explain due to their complexity, undermining user trust. Current methods (e.g., videos) are limited and place interpretation burden on users.

Method: SySLLM leverages LLMs to synthesize state-action trajectories into structured textual summaries without prior training.

Result: SySLLM captures expert-identified insights (e.g., goal preferences) and is preferred by 75.5% of users over demonstration-based summaries.

Conclusion: SySLLM offers a scalable, user-friendly alternative for explaining RL policies, enhancing trust and comprehension.

Abstract: Policies generated by Reinforcement Learning (RL) algorithms are difficult to explain to users, as they emerge from the interaction of complex reward structures and neural network representations. Consequently, analyzing and predicting agent behavior can be challenging, undermining user trust in real-world applications. To facilitate user understanding, current methods for global policy summarization typically rely on videos that demonstrate agent behavior in a subset of world states. However, users can only watch a limited number of demonstrations, constraining their understanding. Moreover, these methods place the burden of interpretation on users by presenting raw behaviors rather than synthesizing them into coherent patterns. To resolve these issues, we introduce SySLLM (Synthesized Summary using Large Language Models), advocating for a new paradigm of abstractive-textual policy explanations. By leveraging Large Language Models (LLMs)-which possess extensive world knowledge and pattern synthesis capabilities-SySLLM generates textual summaries that provide structured and comprehensible explanations of agent policies. SySLLM demonstrates that LLMs can interpret spatio-temporally structured descriptions of state-action trajectories from an RL agent and generate valuable policy insights in a zero-shot setting, without any prior knowledge or fine-tuning. Our evaluation shows that SySLLM captures key insights, such as goal preferences and exploration strategies, that were also identified by human experts. Furthermore, in a large-scale user study (with 200 participants), SySLLM summaries were preferred over demonstration-based summaries (HIGHLIGHTS) by a clear majority (75.5%) of participants.

[440] Adaptive Budgeted Multi-Armed Bandits for IoT with Dynamic Resource Constraints

Shubham Vaishnav, Praveen Kumar Donta, Sindri Magnússon

Main category: cs.LG

TL;DR: A Budgeted Multi-Armed Bandit framework for IoT with dynamic constraints, using a decaying violation budget and Budgeted UCB algorithm, achieves sublinear regret and better constraint satisfaction.

DetailsMotivation: Current IoT systems struggle with dynamic resource constraints, needing adaptive solutions for real-time performance.

Method: Proposes Budgeted UCB algorithm with a decaying violation budget to balance performance and constraint compliance.

Result: Theoretical guarantees show sublinear regret and logarithmic violations; simulations confirm faster adaptation and better constraint satisfaction.

Conclusion: The framework enables adaptive, resource-aware IoT systems, outperforming standard methods.

Abstract: Internet of Things (IoT) systems increasingly operate in environments where devices must respond in real time while managing fluctuating resource constraints, including energy and bandwidth. Yet, current approaches often fall short in addressing scenarios where operational constraints evolve over time. To address these limitations, we propose a novel Budgeted Multi-Armed Bandit framework tailored for IoT applications with dynamic operational limits. Our model introduces a decaying violation budget, which permits limited constraint violations early in the learning process and gradually enforces stricter compliance over time. We present the Budgeted Upper Confidence Bound (UCB) algorithm, which adaptively balances performance optimization and compliance with time-varying constraints. We provide theoretical guarantees showing that Budgeted UCB achieves sublinear regret and logarithmic constraint violations over the learning horizon. Extensive simulations in a wireless communication setting show that our approach achieves faster adaptation and better constraint satisfaction than standard online learning methods. These results highlight the framework’s potential for building adaptive, resource-aware IoT systems.

[441] MIAT: Maneuver-Intention-Aware Transformer for Spatio-Temporal Trajectory Prediction

Chandra Raskoti, Iftekharul Islam, Xuan Wang, Weizi Li

Main category: cs.LG

TL;DR: The paper introduces MIAT, a Transformer-based model for vehicle trajectory prediction, improving accuracy by integrating maneuver intention awareness and spatiotemporal interaction modeling.

DetailsMotivation: Accurate trajectory prediction is crucial for autonomous driving in mixed traffic, but uncertainties from driving behaviors make it challenging.

Method: MIAT combines maneuver intention awareness with spatiotemporal interaction modeling, tested on the NGSIM dataset against transformer- and LSTM-based methods.

Result: MIAT improves short-horizon predictions by 4.7% and long-horizon by 1.6%, with an 11.1% boost in long-horizon performance using intention awareness.

Conclusion: MIAT enhances trajectory prediction accuracy, especially for long-horizon scenarios, with publicly available code and datasets.

Abstract: Accurate vehicle trajectory prediction is critical for safe and efficient autonomous driving, especially in mixed traffic environments when both human-driven and autonomous vehicles co-exist. However, uncertainties introduced by inherent driving behaviors – such as acceleration, deceleration, and left and right maneuvers – pose significant challenges for reliable trajectory prediction. We introduce a Maneuver-Intention-Aware Transformer (MIAT) architecture, which integrates a maneuver intention awareness control mechanism with spatiotemporal interaction modeling to enhance long-horizon trajectory predictions. We systematically investigate the impact of varying awareness of maneuver intention on both short- and long-horizon trajectory predictions. Evaluated on the real-world NGSIM dataset and benchmarked against various transformer- and LSTM-based methods, our approach achieves an improvement of up to 4.7% in short-horizon predictions and a 1.6% in long-horizon predictions compared to other intention-aware benchmark methods. Moreover, by leveraging intention awareness control mechanism, MIAT realizes an 11.1% performance boost in long-horizon predictions, with a modest drop in short-horizon performance. The source code and datasets are available at https://github.com/cpraskoti/MIAT.

[442] Rethinking Client-oriented Federated Graph Learning

Zekai Chen, Xunkai Li, Yinlin Zhu, Rong-Hua Li, Guoren Wang

Main category: cs.LG

TL;DR: FedC4 is a novel Federated Graph Learning (FGL) method that improves Client-Client (C-C) collaboration by using graph condensation to reduce communication costs and privacy risks, outperforming existing baselines.

DetailsMotivation: Existing C-C methods in FGL suffer from high communication costs and privacy risks due to redundant node representation broadcasting.

Method: FedC4 combines graph condensation with C-C collaboration, refining knowledge into synthetic embeddings and tailoring node representations to target clients.

Result: FedC4 outperforms state-of-the-art baselines in task performance and communication cost on eight public datasets.

Conclusion: FedC4 demonstrates superior efficiency and privacy preservation in FGL, offering a promising direction for future research.

Abstract: As a new distributed graph learning paradigm, Federated Graph Learning (FGL) facilitates collaborative model training across local systems while preserving data privacy. We review existing FGL approaches and categorize their optimization mechanisms into: (1) Server-Client (S-C), where clients upload local model parameters for server-side aggregation and global updates; (2) Client-Client (C-C), which allows direct exchange of information between clients and customizing their local training process. We reveal that C-C shows superior potential due to its refined communication structure. However, existing C-C methods broadcast redundant node representations, incurring high communication costs and privacy risks at the node level. To this end, we propose FedC4, which combines graph Condensation with C-C Collaboration optimization. Specifically, FedC4 employs graph condensation technique to refine the knowledge of each client’s graph into a few synthetic embeddings instead of transmitting node-level knowledge. Moreover, FedC4 introduces three novel modules that allow the source client to send distinct node representations tailored to the target client’s graph properties. Experiments on eight public real-world datasets show that FedC4 outperforms state-of-the-art baselines in both task performance and communication cost. Our code is now available on https://github.com/Ereshkigal1/FedC4.

[443] Responsible Machine Learning via Mixed-Integer Optimization

Nathan Justin, Qingshi Sun, Andrés Gómez, Phebe Vayanos

Main category: cs.LG

TL;DR: The paper introduces mixed-integer optimization (MIO) as a framework for embedding responsible ML principles like fairness and transparency into machine learning models while maintaining performance.

DetailsMotivation: The increasing deployment of ML in critical and sensitive areas raises concerns about fairness, transparency, and robustness, necessitating responsible ML methods.

Method: The paper uses mixed-integer optimization (MIO) to integrate responsible ML considerations directly into the learning process, enabling transparent models with domain-specific constraints.

Result: MIO provides practical strategies and tools for building responsible ML models, illustrated through examples and mathematical formulations.

Conclusion: The paper highlights MIO’s utility for responsible ML, discusses current limitations, and suggests future research directions.

Abstract: In the last few decades, Machine Learning (ML) has achieved significant success across domains ranging from healthcare, sustainability, and the social sciences, to criminal justice and finance. But its deployment in increasingly sophisticated, critical, and sensitive areas affecting individuals, the groups they belong to, and society as a whole raises critical concerns around fairness, transparency and robustness, among others. As the complexity and scale of ML systems and of the settings in which they are deployed grow, so does the need for responsible ML methods that address these challenges while providing guaranteed performance in deployment. Mixed-integer optimization (MIO) offers a powerful framework for embedding responsible ML considerations directly into the learning process while maintaining performance. For example, it enables learning of inherently transparent models that can conveniently incorporate fairness or other domain specific constraints. This tutorial paper provides an accessible and comprehensive introduction to this topic discussing both theoretical and practical aspects. It outlines some of the core principles of responsible ML, their importance in applications, and the practical utility of MIO for building ML models that align with these principles. Through examples and mathematical formulations, it illustrates practical strategies and available tools for efficiently solving MIO problems for responsible ML. It concludes with a discussion on current limitations and open research questions, providing suggestions for future work.

[444] Identifying Causal Direction via Variational Bayesian Compression

Quang-Duy Tran, Bao Duong, Phuoc Nguyen, Thin Nguyen

Main category: cs.LG

TL;DR: The paper proposes using variational Bayesian learning of neural networks to improve cause-effect identification by balancing model fitness and computational complexity, outperforming previous methods.

DetailsMotivation: The challenge of distinguishing cause and effect from observational data using the algorithmic Markov condition, where current methods compromise between model fitness and computational simplicity.

Method: Leveraging variational Bayesian learning of neural networks to interpret codelengths, improving model fitness without the high computational cost of Gaussian processes.

Result: The method shows promising performance enhancements on synthetic and real-world benchmarks in cause-effect identification.

Conclusion: The proposed approach effectively addresses limitations of previous methods, offering better performance in identifying causal relationships.

Abstract: Telling apart the cause and effect between two random variables with purely observational data is a challenging problem that finds applications in various scientific disciplines. A key principle utilized in this task is the algorithmic Markov condition, which postulates that the joint distribution, when factorized according to the causal direction, yields a more succinct codelength compared to the anti-causal direction. Previous approaches approximate these codelengths by relying on simple functions or Gaussian processes (GPs) with easily evaluable complexity, compromising between model fitness and computational complexity. To address these limitations, we propose leveraging the variational Bayesian learning of neural networks as an interpretation of the codelengths. This allows the improvement of model fitness, while maintaining the succinctness of the codelengths, and the avoidance of the significant computational complexity of the GP-based approaches. Extensive experiments on both synthetic and real-world benchmarks in cause-effect identification demonstrate the effectiveness of our proposed method, showing promising performance enhancements on several datasets in comparison to most related methods.

[445] Reinforcement Learning with Random Time Horizons

Enric Ribera Borrell, Lorenz Richter, Christof Schütte

Main category: cs.LG

TL;DR: The paper extends reinforcement learning to random time horizons, deriving policy gradient formulas for stochastic and deterministic policies, and shows improved convergence in experiments.

DetailsMotivation: Real-world applications often involve random stopping times, which are not addressed by classical reinforcement learning frameworks.

Method: The work derives policy gradient formulas for random time horizons, using trajectory and state-space perspectives, and connects to optimal control theory.

Result: Numerical experiments show the proposed formulas significantly improve optimization convergence over traditional methods.

Conclusion: The extension to random time horizons and derived gradient formulas enhance reinforcement learning for practical applications.

Abstract: We extend the standard reinforcement learning framework to random time horizons. While the classical setting typically assumes finite and deterministic or infinite runtimes of trajectories, we argue that multiple real-world applications naturally exhibit random (potentially trajectory-dependent) stopping times. Since those stopping times typically depend on the policy, their randomness has an effect on policy gradient formulas, which we (mostly for the first time) derive rigorously in this work both for stochastic and deterministic policies. We present two complementary perspectives, trajectory or state-space based, and establish connections to optimal control theory. Our numerical experiments demonstrate that using the proposed formulas can significantly improve optimization convergence compared to traditional approaches.

[446] Optimistic critics can empower small actors

Olya Mastikhina, Dhruv Sreenivas, Pablo Samuel Castro

Main category: cs.LG

TL;DR: Smaller actors in actor-critic methods lead to performance degradation and overfit critics, primarily due to value underestimation. Techniques to mitigate this issue are explored.

DetailsMotivation: To understand the implications of asymmetric actor-critic setups, particularly the use of smaller actors, and address performance degradation.

Method: Broad empirical investigations and analyses of asymmetric actor-critic architectures, focusing on the effects of smaller actors.

Result: Smaller actors cause performance degradation and overfit critics, linked to poor data collection from value underestimation.

Conclusion: Mitigating value underestimation is crucial for advancing asymmetric actor-critic methods, with the critic playing a key role.

Abstract: Actor-critic methods have been central to many of the recent advances in deep reinforcement learning. The most common approach is to use symmetric architectures, whereby both actor and critic have the same network topology and number of parameters. However, recent works have argued for the advantages of asymmetric setups, specifically with the use of smaller actors. We perform broad empirical investigations and analyses to better understand the implications of this and find that, in general, smaller actors result in performance degradation and overfit critics. Our analyses suggest poor data collection, due to value underestimation, as one of the main causes for this behavior, and further highlight the crucial role the critic can play in alleviating this pathology. We explore techniques to mitigate the observed value underestimation, which enables further research in asymmetric actor-critic methods.

[447] 15,500 Seconds: Lean UAV Classification Using EfficientNet and Lightweight Fine-Tuning

Andrew P. Berg, Qian Zhang, Mia Y. Wang

Main category: cs.LG

TL;DR: The paper proposes using pre-trained deep learning models, PEFT, and data augmentation to improve UAV audio classification on limited datasets, achieving 95.95% accuracy with EfficientNet-B0.

DetailsMotivation: Addressing data scarcity in UAV audio classification for reliable modality-specific systems.

Method: Integration of pre-trained models, PEFT, and data augmentation; evaluation of transformer-based and CNN architectures with cross-validation.

Result: EfficientNet-B0 with full fine-tuning and three augmentations achieved the highest accuracy (95.95%).

Conclusion: Lightweight architectures with PEFT and augmentations are effective for UAV audio classification; future work includes multimodal extension.

Abstract: As unmanned aerial vehicles (UAVs) become increasingly prevalent in both consumer and defense applications, the need for reliable, modality-specific classification systems grows in urgency. This paper addresses the challenge of data scarcity in UAV audio classification by expanding on prior work through the integration of pre-trained deep learning models, parameter-efficient fine-tuning (PEFT) strategies, and targeted data augmentation techniques. Using a custom dataset of 3,100 UAV audio clips (15,500 seconds) spanning 31 distinct drone types, we evaluate the performance of transformer-based and convolutional neural network (CNN) architectures under various fine-tuning configurations. Experiments were conducted with five-fold cross-validation, assessing accuracy, training efficiency, and robustness. Results show that full fine-tuning of the EfficientNet-B0 model with three augmentations achieved the highest validation accuracy (95.95), outperforming both the custom CNN and transformer-based models like AST. These findings suggest that combining lightweight architectures with PEFT and well-chosen augmentations provides an effective strategy for UAV audio classification on limited datasets. Future work will extend this framework to multimodal UAV classification using visual and radar telemetry.

[448] PromptTSS: A Prompting-Based Approach for Interactive Multi-Granularity Time Series Segmentation

Ching Chang, Ming-Chih Lo, Wen-Chih Peng, Tien-Fu Chen

Main category: cs.LG

TL;DR: PromptTSS is a novel framework for segmenting multivariate time series data across multiple granularities, addressing limitations of existing methods by using a unified model with a prompting mechanism.

DetailsMotivation: Existing time series segmentation methods struggle with handling multiple granularities and adapting to evolving patterns, which are critical for tasks like predictive maintenance.

Method: PromptTSS employs a unified model with a prompting mechanism that uses label and boundary information to guide segmentation, capturing both coarse- and fine-grained patterns dynamically.

Result: PromptTSS improves accuracy by 24.49% in multi-granularity segmentation, 17.88% in single-granularity segmentation, and up to 599.24% in transfer learning.

Conclusion: PromptTSS effectively addresses the challenges of multi-granularity segmentation and adaptability, demonstrating significant performance improvements.

Abstract: Multivariate time series data, collected across various fields such as manufacturing and wearable technology, exhibit states at multiple levels of granularity, from coarse-grained system behaviors to fine-grained, detailed events. Effectively segmenting and integrating states across these different granularities is crucial for tasks like predictive maintenance and performance optimization. However, existing time series segmentation methods face two key challenges: (1) the inability to handle multiple levels of granularity within a unified model, and (2) limited adaptability to new, evolving patterns in dynamic environments. To address these challenges, we propose PromptTSS, a novel framework for time series segmentation with multi-granularity states. PromptTSS uses a unified model with a prompting mechanism that leverages label and boundary information to guide segmentation, capturing both coarse- and fine-grained patterns while adapting dynamically to unseen patterns. Experiments show PromptTSS improves accuracy by 24.49% in multi-granularity segmentation, 17.88% in single-granularity segmentation, and up to 599.24% in transfer learning, demonstrating its adaptability to hierarchical states and evolving time series dynamics. Our code is available at https://github.com/blacksnail789521/PromptTSS.

[449] Fast Convergence for High-Order ODE Solvers in Diffusion Probabilistic Models

Daniel Zhengyu Huang, Jiaoyang Huang, Zhengjiang Lin

Main category: cs.LG

TL;DR: The paper analyzes the convergence of deterministic samplers derived from probability flow ODEs in diffusion models, proving bounds on the total variation distance between generated and target distributions.

DetailsMotivation: To rigorously analyze the convergence of deterministic samplers in diffusion models, addressing the interaction of numerical integration errors and score function approximation quality.

Method: Develops and analyzes $p$-th order Runge-Kutta schemes for probability flow ODEs, assuming bounded derivatives of the learned score function.

Result: Proves a bound on the total variation distance, dependent on score function error, data dimension, and solver step size. Numerical experiments confirm bounded score function derivatives.

Conclusion: The analysis provides theoretical guarantees for deterministic samplers in diffusion models, validated by empirical results.

Abstract: Diffusion probabilistic models generate samples by learning to reverse a noise-injection process that transforms data into noise. A key development is the reformulation of the reverse sampling process as a deterministic probability flow ordinary differential equation (ODE), which allows for efficient sampling using high-order numerical solvers. Unlike traditional time integrator analysis, the accuracy of this sampling procedure depends not only on numerical integration errors but also on the approximation quality and regularity of the learned score function, as well as their interaction. In this work, we present a rigorous convergence analysis of deterministic samplers derived from probability flow ODEs for general forward processes with arbitrary variance schedules. Specifically, we develop and analyze $p$-th order (exponential) Runge-Kutta schemes, under the practical assumption that the first and second derivatives of the learned score function are bounded. We prove that the total variation distance between the generated and target distributions can be bounded as \begin{align*} O\bigl(d^{\frac{7}{4}}\varepsilon_{\text{score}}^{\frac{1}{2}} +d(dH_{\max})^p\bigr), \end{align*} where $\varepsilon^2_{\text{score}}$ denotes the $L^2$ error in the score function approximation, $d$ is the data dimension, and $H_{\max}$ represents the maximum solver step size. Numerical experiments on benchmark datasets further confirm that the derivatives of the learned score function are bounded in practice.

[450] Discrepancy-Aware Graph Mask Auto-Encoder

Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Weigang Lu

Main category: cs.LG

TL;DR: The paper introduces DGMAE, a Discrepancy-Aware Graph Mask Auto-Encoder, to improve graph representation learning by reconstructing discrepancy information in heterophilic graphs.

DetailsMotivation: Existing methods fail in heterophilic graphs due to their focus on neighborhood similarity, ignoring node discrepancies, leading to indistinguishable representations.

Method: DGMAE reconstructs discrepancy information between neighboring nodes during masking to enhance representation distinguishability.

Result: Experiments on 17 datasets show DGMAE preserves node discrepancies and outperforms state-of-the-art methods in node classification, clustering, and graph classification.

Conclusion: DGMAE effectively addresses the limitations of existing methods in heterophilic graphs, demonstrating superior performance in graph self-supervised learning.

Abstract: Masked Graph Auto-Encoder, a powerful graph self-supervised training paradigm, has recently shown superior performance in graph representation learning. Existing works typically rely on node contextual information to recover the masked information. However, they fail to generalize well to heterophilic graphs where connected nodes may be not similar, because they focus only on capturing the neighborhood information and ignoring the discrepancy information between different nodes, resulting in indistinguishable node representations. In this paper, to address this issue, we propose a Discrepancy-Aware Graph Mask Auto-Encoder (DGMAE). It obtains more distinguishable node representations by reconstructing the discrepancy information of neighboring nodes during the masking process. We conduct extensive experiments on 17 widely-used benchmark datasets. The results show that our DGMAE can effectively preserve the discrepancies of nodes in low-dimensional space. Moreover, DGMAE significantly outperforms state-of-the-art graph self-supervised learning methods on three graph analytic including tasks node classification, node clustering, and graph classification, demonstrating its remarkable superiority. The code of DGMAE is available at https://github.com/zhengziyu77/DGMAE.

[451] Transferable Parasitic Estimation via Graph Contrastive Learning and Label Rebalancing in AMS Circuits

Shan Shen, Shenglu Hua, Jiajun Zou, Jiawei Liu, Jianwang Zhai, Chuan Shi, Wenjian Yu

Main category: cs.LG

TL;DR: CircuitGCL is a graph contrastive learning framework for AMS circuits, addressing data scarcity and label imbalance with self-supervised learning and balanced loss functions, outperforming SOTA methods.

DetailsMotivation: The scarcity of design data, unbalanced label distribution, and diverse circuit implementations hinder robust and transferable circuit representations for tasks like parasitic estimation.

Method: CircuitGCL uses self-supervised learning with hyperspherical representation scattering and balanced loss functions (BMSE, BSCE) to enhance transferability and mitigate label disparities.

Result: CircuitGCL achieves significant improvements: 33.64% ~ 44.20% higher R² for edge regression and 0.9× ~ 2.1× F1-score gain for node classification.

Conclusion: CircuitGCL effectively addresses data and label challenges in AMS circuits, offering superior performance for parasitic estimation tasks.

Abstract: Graph representation learning on Analog-Mixed Signal (AMS) circuits is crucial for various downstream tasks, e.g., parasitic estimation. However, the scarcity of design data, the unbalanced distribution of labels, and the inherent diversity of circuit implementations pose significant challenges to learning robust and transferable circuit representations. To address these limitations, we propose CircuitGCL, a novel graph contrastive learning framework that integrates representation scattering and label rebalancing to enhance transferability across heterogeneous circuit graphs. CircuitGCL employs a self-supervised strategy to learn topology-invariant node embeddings through hyperspherical representation scattering, eliminating dependency on large-scale data. Simultaneously, balanced mean squared error (BMSE) and balanced softmax cross-entropy (BSCE) losses are introduced to mitigate label distribution disparities between circuits, enabling robust and transferable parasitic estimation. Evaluated on parasitic capacitance estimation (edge-level task) and ground capacitance classification (node-level task) across TSMC 28nm AMS designs, CircuitGCL outperforms all state-of-the-art (SOTA) methods, with the $R^2$ improvement of $33.64% \sim 44.20%$ for edge regression and F1-score gain of $0.9\times \sim 2.1\times$ for node classification. Our code is available at https://github.com/ShenShan123/CircuitGCL.

[452] Class-Proportional Coreset Selection for Difficulty-Separable Data

Elisa Tsai, Haizhong Zheng, Atul Prakash

Main category: cs.LG

TL;DR: The paper introduces class-difficulty separability and the CDSC measure to improve coreset selection, showing better performance with class-proportional methods in high-stakes domains.

DetailsMotivation: Existing coreset methods assume class-wise homogeneity in data difficulty, ignoring variations across classes, which can degrade performance in critical applications.

Method: The authors propose the Class Difficulty Separability Coefficient (CDSC) and class-proportional variants of sampling strategies to address class-difficulty variations.

Result: Class-proportional methods outperform class-agnostic ones, e.g., CCS-CP shows minimal performance drops (e.g., 2.58% accuracy loss) at 99% pruning.

Conclusion: Modeling class-difficulty separability leads to more effective and robust data pruning, especially in noisy or imbalanced datasets.

Abstract: High-quality training data is essential for building reliable and efficient machine learning systems. One-shot coreset selection addresses this by pruning the dataset while maintaining or even improving model performance, often relying on training-dynamics-based data difficulty scores. However, most existing methods implicitly assume class-wise homogeneity in data difficulty, overlooking variation in data difficulty across different classes. In this work, we challenge this assumption by showing that, in domains such as network intrusion detection and medical imaging, data difficulty often clusters by class. We formalize this as class-difficulty separability and introduce the Class Difficulty Separability Coefficient (CDSC) as a quantitative measure. We demonstrate that high CDSC values correlate with performance degradation in class-agnostic coreset methods, which tend to overrepresent easy majority classes while neglecting rare but informative ones. To address this, we introduce class-proportional variants of multiple sampling strategies. Evaluated on five diverse datasets spanning security and medical domains, our methods consistently achieve state-of-the-art performance. For instance, on CTU-13, at an extreme 99% pruning rate, a class-proportional variant of Coverage-centric Coreset Selection (CCS-CP) shows remarkable stability, with accuracy dropping only 2.58%, precision 0.49%, and recall 0.19%. In contrast, the class-agnostic CCS baseline, the next best method, suffers sharper declines of 7.59% in accuracy, 4.57% in precision, and 4.11% in recall. We further show that aggressive pruning enhances generalization in noisy, imbalanced, and large-scale datasets. Our results underscore that explicitly modeling class-difficulty separability leads to more effective, robust, and generalizable data pruning, particularly in high-stakes scenarios.

[453] MAP Estimation with Denoisers: Convergence Rates and Guarantees

Scott Pesme, Giacomo Meanti, Michael Arbel, Julien Mairal

Main category: cs.LG

TL;DR: The paper provides a theoretical justification for using pretrained denoisers as surrogates for the proximal operator in MAP optimization, proving convergence under log-concavity assumptions.

DetailsMotivation: Existing methods use denoisers heuristically without theoretical backing. This work aims to justify such practices by proving convergence under specific conditions.

Method: The authors propose a simple algorithm related to practical methods, interpreting it as gradient descent on smoothed proximal objectives.

Result: The algorithm provably converges to the proximal operator under a log-concavity assumption on the prior.

Conclusion: The analysis offers a theoretical foundation for empirically successful but previously heuristic denoising methods in MAP optimization.

Abstract: Denoiser models have become powerful tools for inverse problems, enabling the use of pretrained networks to approximate the score of a smoothed prior distribution. These models are often used in heuristic iterative schemes aimed at solving Maximum a Posteriori (MAP) optimisation problems, where the proximal operator of the negative log-prior plays a central role. In practice, this operator is intractable, and practitioners plug in a pretrained denoiser as a surrogate-despite the lack of general theoretical justification for this substitution. In this work, we show that a simple algorithm, closely related to several used in practice, provably converges to the proximal operator under a log-concavity assumption on the prior $p$. We show that this algorithm can be interpreted as a gradient descent on smoothed proximal objectives. Our analysis thus provides a theoretical foundation for a class of empirically successful but previously heuristic methods.

[454] Self-Questioning Language Models

Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, Deepak Pathak

Main category: cs.LG

TL;DR: Large language models can self-improve by generating and solving their own questions without external data, using an asymmetric self-play framework called SQLM.

DetailsMotivation: To explore if language models can enhance reasoning skills autonomously by generating and solving their own questions, eliminating the need for curated datasets.

Method: Proposes SQLM: an asymmetric self-play framework with a proposer generating questions and a solver answering them, both trained via reinforcement learning. Rewards are based on question difficulty and correctness (via majority voting or unit tests for coding).

Result: Tested on three benchmarks (multiplication, algebra, programming), showing improvement without external data.

Conclusion: Language models can autonomously improve reasoning skills through self-generated problems, reducing reliance on curated datasets.

Abstract: Can large language models improve without external data – by generating their own questions and answers? We hypothesize that a pre-trained language model can improve its reasoning skills given only a single prompt specifying the topic (e.g., algebra word problems) and asking the model to generate its own questions. To do this, we propose Self-Questioning Language Models (SQLM): an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver, who tries to answer it. Both the proposer and solver are trained via reinforcement learning. The proposer receives a reward if the problem is not too easy or too difficult, and the solver receives a reward based on majority voting, a proxy for correctness in the absence of ground-truth answers. For coding, the proposer can instead generate unit tests which are used for verification. We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces. By continually generating more interesting problems and attempting to solve them, language models can improve on downstream benchmarks without access to any curated training datasets.

[455] Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai

Main category: cs.LG

TL;DR: Shuffle-R1 improves RL fine-tuning efficiency for MLLMs by addressing Advantage Collapsing and Rollout Silencing through dynamic trajectory sampling and batch restructuring.

DetailsMotivation: Current RL pipelines for MLLMs suffer from inefficiencies due to Advantage Collapsing and Rollout Silencing, leading to suboptimal learning.

Method: Proposes Pairwise Trajectory Sampling and Advantage-based Trajectory Shuffle to enhance gradient signal and rollout exposure.

Result: Outperforms RL baselines on reasoning benchmarks with minimal overhead.

Conclusion: Data-centric adaptations like Shuffle-R1 are crucial for efficient RL training in MLLMs.

Abstract: Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2) Advantage-based Trajectory Shuffle, which increases exposure of valuable rollouts through informed batch reshuffling. Experiments across multiple reasoning benchmarks show that our framework consistently outperforms strong RL baselines with minimal overhead. These results highlight the importance of data-centric adaptations for more efficient RL training in MLLM.

[456] A Graph-Based Framework for Exploring Mathematical Patterns in Physics: A Proof of Concept

Massimiliano Romiti

Main category: cs.LG

TL;DR: A graph-based framework using neural networks and symbolic analysis explores and validates mathematical patterns in physics equations, achieving high accuracy in link prediction and generating cross-domain hypotheses.

DetailsMotivation: Traditional methods fail to fully explore the implicit network of mathematical relationships in physics equations, necessitating a systematic approach to discover and validate patterns.

Method: The study uses a weighted knowledge graph and Graph Attention Network to analyze 400 advanced physics equations, resolving notational ambiguities and predicting links.

Result: The framework achieved 97.4% AUC in link prediction, generated cross-domain hypotheses, and verified theoretical consistencies, uncovering new connections like analog gravity.

Conclusion: The system effectively transforms complex mathematical relationships into interpretable patterns, serving as both a hypothesis generator and knowledge auditor.

Abstract: The vast corpus of physics equations forms an implicit network of mathematical relationships that traditional analysis cannot fully explore. This work introduces a graph-based framework combining neural networks with symbolic analysis to systematically discover and validate mathematical patterns across physics domains. Starting from 659 equations, we performed rigorous semantic disambiguation to resolve notational polysemy affecting 213 equations, then focused on 400 advanced physics equations by excluding elementary mechanics to emphasize inter-branch connections of modern physics. This corpus was represented as a weighted knowledge graph where a Graph Attention Network achieved 97.4% AUC in link prediction, significantly outperforming classical baselines. The framework’s primary value emerges from its dual capability: generating hypotheses and auditing knowledge. First, it functions as a hypothesis generator, producing hundreds of candidate cross-domain connections, from blackbody radiation coupled with Navier-Stokes equations to radioactive decay linked with electromagnetic induction. Second, through symbolic analysis of 30 equation clusters, it serves as a computational auditor that verified established theory consistencies, synthesized the Magnetic Reynolds Number from electromagnetic-fluid coupling, and revealed how even parsing errors could potentially point toward legitimate research like analog gravity. This proof-of-concept intentionally over-generates candidates to ensure comprehensive exploration of mathematical possibility space. Even tautologies and errors serve scientific purposes: redundancy identification and knowledge base quality assessment. The system transforms the intractable combinatorial space into a filtered stream of mathematical patterns for human interpretation.

[457] LinguaFluid: Language Guided Fluid Control via Semantic Rewards in Reinforcement Learning

Aoming Liang, Chi Cheng, Dashuai Chen, Boai Sun, Dixia Fan

Main category: cs.LG

TL;DR: The paper introduces a semantically aligned RL method using SBERT to compute rewards based on textual goal descriptions, eliminating the need for manual reward engineering.

DetailsMotivation: Existing RL reward functions rely on heuristics or manual tuning, which is challenging in environments with hard-to-specify goals.

Method: Rewards are computed as cosine similarity between goal and episode descriptions using SBERT, replacing manual reward functions.

Result: The method achieves competitive control behavior without hand-crafted rewards and shows a correlation between language embeddings and Euclidean space.

Conclusion: This approach aligns agent behavior with natural language goals and paves the way for integrating LLMs into control applications.

Abstract: In the domain of scientific machine learning, designing effective reward functions remains a challenge in reinforcement learning (RL), particularly in environments where task goals are difficult to specify numerically. Reward functions in existing work are predominantly based on heuristics, manual engineering, or task-specific tuning. In this work, we introduce a semantically aligned reinforcement learning method where rewards are computed by aligning the current state with a target semantic instruction using a Sentence-Bidirectional Encoder Representations from Transformers (SBERT). Instead of relying on manually defined reward functions, the policy receives feedback based on the reward, which is a cosine similarity between the goal textual description and the statement description in the episode. We evaluated our approach in several environments and showed that semantic reward can guide learning to achieve competitive control behavior, even in the absence of hand-crafted reward functions. Our study demonstrates a correlation between the language embedding space and the conventional Euclidean space. This framework opens new horizons for aligning agent behavior with natural language goals and lays the groundwork for a more seamless integration of larger language models (LLMs) and fluid control applications.

[458] Hardness-Aware Dynamic Curriculum Learning for Robust Multimodal Emotion Recognition with Missing Modalities

Rui Liu, Haolin Zuo, Zheng Lian, Hongyu Yuan, Qi Fan

Main category: cs.LG

TL;DR: The paper proposes HARDY-MER, a framework for multimodal emotion recognition that dynamically adjusts training to focus on hard samples by evaluating their reconstruction difficulty and balancing learning between easy and hard instances.

DetailsMotivation: Existing methods for missing modalities in MER fail to account for varying reconstruction difficulties, limiting performance on hard samples.

Method: HARDY-MER uses a Multi-view Hardness Evaluation to quantify sample hardness (Direct and Indirect Hardness) and a Retrieval-based Dynamic Curriculum Learning strategy to adjust training focus.

Result: HARDY-MER outperforms existing methods in missing-modality scenarios on benchmark datasets.

Conclusion: The proposed framework effectively addresses the challenge of hard samples in MER, improving performance in missing-modality cases.

Abstract: Missing modalities have recently emerged as a critical research direction in multimodal emotion recognition (MER). Conventional approaches typically address this issue through missing modality reconstruction. However, these methods fail to account for variations in reconstruction difficulty across different samples, consequently limiting the model’s ability to handle hard samples effectively. To overcome this limitation, we propose a novel Hardness-Aware Dynamic Curriculum Learning framework, termed HARDY-MER. Our framework operates in two key stages: first, it estimates the hardness level of each sample, and second, it strategically emphasizes hard samples during training to enhance model performance on these challenging instances. Specifically, we first introduce a Multi-view Hardness Evaluation mechanism that quantifies reconstruction difficulty by considering both Direct Hardness (modality reconstruction errors) and Indirect Hardness (cross-modal mutual information). Meanwhile, we introduce a Retrieval-based Dynamic Curriculum Learning strategy that dynamically adjusts the training curriculum by retrieving samples with similar semantic information and balancing the learning focus between easy and hard instances. Extensive experiments on benchmark datasets demonstrate that HARDY-MER consistently outperforms existing methods in missing-modality scenarios. Our code will be made publicly available at https://github.com/HARDY-MER/HARDY-MER.

[459] Semantic-Enhanced Time-Series Forecasting via Large Language Models

Hao Liu, Chun Yang, Zhang xiaoxing, Xiaobin Zhu

Main category: cs.LG

TL;DR: SE-LLM enhances time series forecasting by embedding semantic knowledge of periodicity and anomalies into LLMs, improving interpretability and performance.

DetailsMotivation: Existing LLM-based time series forecasting lacks semantic alignment between linguistic knowledge and time series patterns, limiting performance.

Method: Proposes SE-LLM, which embeds time series characteristics into token embeddings and introduces a plugin module for long/short-term dependency modeling.

Result: SE-LLM outperforms SOTA methods, reducing computational costs while improving accuracy.

Conclusion: SE-LLM effectively bridges the modality gap, enhancing LLMs for time series forecasting.

Abstract: Time series forecasting plays a significant role in finance, energy, meteorology, and IoT applications. Recent studies have leveraged the generalization capabilities of large language models (LLMs) to adapt to time series forecasting, achieving promising performance. However, existing studies focus on token-level modal alignment, instead of bridging the intrinsic modality gap between linguistic knowledge structures and time series data patterns, greatly limiting the semantic representation. To address this issue, we propose a novel Semantic-Enhanced LLM (SE-LLM) that explores the inherent periodicity and anomalous characteristics of time series to embed into the semantic space to enhance the token embedding. This process enhances the interpretability of tokens for LLMs, thereby activating the potential of LLMs for temporal sequence analysis. Moreover, existing Transformer-based LLMs excel at capturing long-range dependencies but are weak at modeling short-term anomalies in time-series data. Hence, we propose a plugin module embedded within self-attention that models long-term and short-term dependencies to effectively adapt LLMs to time-series analysis. Our approach freezes the LLM and reduces the sequence dimensionality of tokens, greatly reducing computational consumption. Experiments demonstrate the superiority performance of our SE-LLM against the state-of-the-art (SOTA) methods.

[460] WeChat-YATT: A Simple, Scalable and Balanced RLHF Trainer

Junyu Wu, Weiming Chang, Xiaotao Liu, Guanyou He, Tingfeng Xian, Haoqiang Hong, Boqi Chen, Haotao Tian, Tao Yang, Yunsheng Shi, Feng Lin, Ting Yao

Main category: cs.LG

TL;DR: WeChat-YATT is a scalable RLHF training framework addressing controller scalability and dynamic workload challenges, improving throughput and GPU utilization.

DetailsMotivation: Current RLHF frameworks face scalability and efficiency issues in complex multimodal workflows and dynamic workloads.

Method: Introduces WeChat-YATT with a parallel controller model and dynamic resource placement schema.

Result: Achieves higher throughput and better GPU utilization, successfully deployed in WeChat.

Conclusion: WeChat-YATT effectively addresses RLHF training challenges and is robust in real-world applications.

Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent paradigm for training large language models and multimodal systems. Despite notable advances enabled by existing RLHF training frameworks, significant challenges remain in scaling to complex multimodal workflows and adapting to dynamic workloads. In particular, current systems often encounter limitations related to controller scalability when managing large models, as well as inefficiencies in orchestrating intricate RLHF pipelines, especially in scenarios that require dynamic sampling and resource allocation. In this paper, we introduce WeChat-YATT (Yet Another Transformer Trainer in WeChat), a simple, scalable, and balanced RLHF training framework specifically designed to address these challenges. WeChat-YATT features a parallel controller programming model that enables flexible and efficient orchestration of complex RLHF workflows, effectively mitigating the bottlenecks associated with centralized controller architectures and facilitating scalability in large-scale data scenarios. In addition, we propose a dynamic placement schema that adaptively partitions computational resources and schedules workloads, thereby significantly reducing hardware idle time and improving GPU utilization under variable training conditions. We evaluate WeChat-YATT across a range of experimental scenarios, demonstrating that it achieves substantial improvements in throughput compared to state-of-the-art RLHF training frameworks. Furthermore, WeChat-YATT has been successfully deployed to train models supporting WeChat product features for a large-scale user base, underscoring its effectiveness and robustness in real-world applications.We have open-source WeChat-YATT at https://www.github.com/tencent/WeChat-YATT.

[461] On Understanding of the Dynamics of Model Capacity in Continual Learning

Supriyo Chakraborty, Krishnan Raghavan

Main category: cs.LG

TL;DR: The paper introduces CLEMC to model the non-stationary stability-plasticity balance in continual learning, showing NN’s task representation ability diminishes with differing task distributions.

DetailsMotivation: Address the stability-plasticity dilemma in continual learning by characterizing the dynamic balance point.

Method: Develop a difference equation to model NN, task data, and optimization interplay, validated through experiments on various architectures.

Result: Effective capacity and stability-plasticity balance are non-stationary; NN performance drops with differing task distributions.

Conclusion: CLEMC provides insights into continual learning dynamics, applicable across diverse NN architectures.

Abstract: The stability-plasticity dilemma, closely related to a neural network’s (NN) capacity-its ability to represent tasks-is a fundamental challenge in continual learning (CL). Within this context, we introduce CL’s effective model capacity (CLEMC) that characterizes the dynamic behavior of the stability-plasticity balance point. We develop a difference equation to model the evolution of the interplay between the NN, task data, and optimization procedure. We then leverage CLEMC to demonstrate that the effective capacity-and, by extension, the stability-plasticity balance point is inherently non-stationary. We show that regardless of the NN architecture or optimization method, a NN’s ability to represent new tasks diminishes when incoming task distributions differ from previous ones. We conduct extensive experiments to support our theoretical findings, spanning a range of architectures-from small feedforward network and convolutional networks to medium-sized graph neural networks and transformer-based large language models with millions of parameters.

[462] M3-Net: A Cost-Effective Graph-Free MLP-Based Model for Traffic Prediction

Guangyin Jin, Sicong Lai, Xiaoshuai Hao, Mingtao Zhang, Jinlei Zhang

Main category: cs.LG

TL;DR: Proposes M3-Net, a cost-effective MLP-based model for traffic prediction, addressing limitations of existing methods by using time series/spatio-temporal embeddings and a novel MLP-Mixer with MoE.

DetailsMotivation: Existing deep learning methods for traffic prediction rely on complex models or complete network structures, limiting efficiency and scalability.

Method: Introduces M3-Net, a graph-free MLP model with time series/spatio-temporal embeddings and an MLP-Mixer with MoE.

Result: Outperforms existing methods in prediction performance and lightweight deployment on real datasets.

Conclusion: M3-Net offers a scalable, efficient solution for traffic prediction without relying on complex architectures or complete network data.

Abstract: Achieving accurate traffic prediction is a fundamental but crucial task in the development of current intelligent transportation systems.Most of the mainstream methods that have made breakthroughs in traffic prediction rely on spatio-temporal graph neural networks, spatio-temporal attention mechanisms, etc. The main challenges of the existing deep learning approaches are that they either depend on a complete traffic network structure or require intricate model designs to capture complex spatio-temporal dependencies. These limitations pose significant challenges for the efficient deployment and operation of deep learning models on large-scale datasets. To address these challenges, we propose a cost-effective graph-free Multilayer Perceptron (MLP) based model M3-Net for traffic prediction. Our proposed model not only employs time series and spatio-temporal embeddings for efficient feature processing but also first introduces a novel MLP-Mixer architecture with a mixture of experts (MoE) mechanism. Extensive experiments conducted on multiple real datasets demonstrate the superiority of the proposed model in terms of prediction performance and lightweight deployment.

[463] Interpretable Reward Model via Sparse Autoencoder

Shuyi Zhang, Wei Shi, Sihang Li, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang

Main category: cs.LG

TL;DR: SARM integrates a Sparse Autoencoder into a reward model to improve interpretability and adaptability in aligning LLMs with human preferences.

DetailsMotivation: Traditional reward models lack interpretability and flexibility, hindering effective alignment of LLM behaviors with human values.

Method: SARM uses a pretrained Sparse Autoencoder to map LLM activations into an interpretable feature space, enabling transparent reward scoring.

Result: SARM provides feature-level attribution, adapts to preference shifts, and outperforms conventional reward models in alignment tasks.

Conclusion: SARM offers a scalable and interpretable solution for aligning LLMs with human preferences, addressing key limitations of existing reward models.

Abstract: Large language models (LLMs) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align LLM behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model (SARM), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of LLM-based RM into an interpretable, sparse, and monosemantic feature space, from which a scalar head aggregates feature activations to produce transparent and conceptually meaningful reward scores. Empirical evaluations demonstrate that SARM facilitates direct feature-level attribution of reward assignments, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models. Our code is available at https://github.com/schrieffer-z/sarm.

[464] To Theoretically Understand Transformer-Based In-Context Learning for Optimizing CSMA

Shugang Hao, Hongbo Li, Lingjie Duan

Main category: cs.LG

TL;DR: The paper proposes an LLM transformer-based in-context learning (ICL) approach to optimize WiFi 7 channel access, addressing throughput issues in dynamic environments.

DetailsMotivation: Existing model-based backoff strategies perform poorly due to inaccurate node density estimation, leading to throughput loss.

Method: A transformer-based ICL optimizer is designed to predict contention window thresholds (CWT) using pre-collected collision data and a query case. An efficient training algorithm ensures near-optimal predictions, even with imperfect data.

Result: The approach achieves minimal prediction deviations and near-optimal throughput, outperforming model-based and DRL-based methods in NS-3 experiments.

Conclusion: The transformer-based ICL optimizer effectively addresses dynamic channel challenges, offering fast convergence and superior throughput performance.

Abstract: The binary exponential backoff scheme is widely used in WiFi 7 and still incurs poor throughput performance under dynamic channel environments. Recent model-based approaches (e.g., non-persistent and $p$-persistent CSMA) simply optimize backoff strategies under a known and fixed node density, still leading to a large throughput loss due to inaccurate node density estimation. This paper is the first to propose LLM transformer-based in-context learning (ICL) theory for optimizing channel access. We design a transformer-based ICL optimizer to pre-collect collision-threshold data examples and a query collision case. They are constructed as a prompt as the input for the transformer to learn the pattern, which then generates a predicted contention window threshold (CWT). To train the transformer for effective ICL, we develop an efficient algorithm and guarantee a near-optimal CWT prediction within limited training steps. As it may be hard to gather perfect data examples for ICL in practice, we further extend to allow erroneous data input in the prompt. We prove that our optimizer maintains minimal prediction and throughput deviations from the optimal values. Experimental results on NS-3 further demonstrate our approach’s fast convergence and near-optimal throughput over existing model-based and DRL-based approaches under unknown node densities.

[465] EvaDrive: Evolutionary Adversarial Policy Optimization for End-to-End Autonomous Driving

Siwen Jiao, Kangan Qian, Hao Ye, Yang Zhong, Ziang Luo, Sicong Jiang, Zilin Huang, Yangyi Fang, Jinyu Miao, Zheng Fu, Yunlong Wang, Kun Jiang, Diange Yang, Rui Fan, Baoyun Peng

Main category: cs.LG

TL;DR: EvaDrive introduces a multi-objective reinforcement learning framework for autonomous driving, enabling iterative trajectory refinement via adversarial optimization, outperforming existing methods.

DetailsMotivation: Current methods isolate trajectory generation from evaluation or collapse preferences into scalar rewards, limiting iterative refinement and obscuring trade-offs.

Method: EvaDrive uses adversarial optimization with a hierarchical generator (autoregressive intent modeling + diffusion-based refinement) and a multi-objective critic, guided by Pareto frontier selection.

Result: Achieves 94.9 PDMS on NAVSIM v1 and 64.96 Driving Score on Bench2Drive, surpassing competitors.

Conclusion: EvaDrive offers a scalarization-free, human-like iterative decision-making framework for autonomous driving.

Abstract: Autonomous driving faces significant challenges in achieving human-like iterative decision-making, which continuously generates, evaluates, and refines trajectory proposals. Current generation-evaluation frameworks isolate trajectory generation from quality assessment, preventing iterative refinement essential for planning, while reinforcement learning methods collapse multi-dimensional preferences into scalar rewards, obscuring critical trade-offs and yielding scalarization bias.To overcome these issues, we present EvaDrive, a novel multi-objective reinforcement learning framework that establishes genuine closed-loop co-evolution between trajectory generation and evaluation via adversarial optimization. EvaDrive frames trajectory planning as a multi-round adversarial game. In this game, a hierarchical generator continuously proposes candidate paths by combining autoregressive intent modeling for temporal causality with diffusion-based refinement for spatial flexibility. These proposals are then rigorously assessed by a trainable multi-objective critic that explicitly preserves diverse preference structures without collapsing them into a single scalarization bias.This adversarial interplay, guided by a Pareto frontier selection mechanism, enables iterative multi-round refinement, effectively escaping local optima while preserving trajectory diversity.Extensive experiments on NAVSIM and Bench2Drive benchmarks demonstrate SOTA performance, achieving 94.9 PDMS on NAVSIM v1 (surpassing DiffusionDrive by 6.8, DriveSuprim by 5.0, and TrajHF by 0.9) and 64.96 Driving Score on Bench2Drive. EvaDrive generates diverse driving styles via dynamic weighting without external preference data, introducing a closed-loop adversarial framework for human-like iterative decision-making, offering a novel scalarization-free trajectory optimization approach.

[466] Decentralized Weather Forecasting via Distributed Machine Learning and Blockchain-Based Model Validation

Rilwan Umar, Aydin Abadi, Basil Aldali, Benito Vincent, Elliot A. J. Hurley, Hotoon Aljazaeri, Jamie Hedley-Cook, Jamie-Lee Bell, Lambert Uwuigbusun, Mujeeb Ahmed, Shishir Nagaraja, Suleiman Sabo, Weaam Alrbeiqi

Main category: cs.LG

TL;DR: A decentralized weather forecasting framework combining Federated Learning and blockchain improves accuracy, resilience, and scalability while ensuring privacy and security.

DetailsMotivation: Current centralized forecasting systems face security vulnerabilities, scalability issues, and single points of failure.

Method: Integrates Federated Learning (FL) for privacy-preserving collaborative training and blockchain for transparent verification. Uses a reputation-based voting mechanism and IPFS for off-chain storage.

Result: Improves forecasting accuracy, system resilience, and scalability.

Conclusion: The proposed framework is viable for real-world, security-critical environments.

Abstract: Weather forecasting plays a vital role in disaster preparedness, agriculture, and resource management, yet current centralized forecasting systems are increasingly strained by security vulnerabilities, limited scalability, and susceptibility to single points of failure. To address these challenges, we propose a decentralized weather forecasting framework that integrates Federated Learning (FL) with blockchain technology. FL enables collaborative model training without exposing sensitive local data; this approach enhances privacy and reduces data transfer overhead. Meanwhile, the Ethereum blockchain ensures transparent and dependable verification of model updates. To further enhance the system’s security, we introduce a reputation-based voting mechanism that assesses the trustworthiness of submitted models while utilizing the Interplanetary File System (IPFS) for efficient off-chain storage. Experimental results demonstrate that our approach not only improves forecasting accuracy but also enhances system resilience and scalability, making it a viable candidate for deployment in real-world, security-critical environments.

cs.MA

[467] REALISM: A Regulatory Framework for Coordinated Scheduling in Multi-Operator Shared Micromobility Services

Heng Tan, Hua Yan, Yukun Yuan, Guang Wang, Yu Yang

Main category: cs.MA

TL;DR: REALISM is a regulatory framework for multi-operator shared micromobility services, using Shapley value to assign scores based on city goals and operator contributions, improving equity and demand satisfaction.

DetailsMotivation: Shared micromobility's growth causes social problems like road overload and inequity, exacerbated by non-cooperative multi-operator systems. Existing intrusive regulatory frameworks are impractical.

Method: Design REALISM with fairness-aware score assignment using Shapley value, optimizing via an alternating procedure between operators and regulators.

Result: Achieves 39.93% gain in equity and 1.82% in demand satisfaction using Chicago e-scooter data.

Conclusion: REALISM effectively balances city goals and operator interests, improving shared micromobility regulation.

Abstract: Shared micromobility (e.g., shared bikes and electric scooters), as a kind of emerging urban transportation, has become more and more popular in the world. However, the blooming of shared micromobility vehicles brings some social problems to the city (e.g., overloaded vehicles on roads, and the inequity of vehicle deployment), which deviate from the city regulator’s expectation of the service of the shared micromobility system. In addition, the multi-operator shared micromobility system in a city complicates the problem because of their non-cooperative self-interested pursuits. Existing regulatory frameworks of multi-operator vehicle rebalancing generally assume the intrusive control of vehicle rebalancing of all the operators, which is not practical in the real world. To address this limitation, we design REALISM, a regulatory framework for coordinated scheduling in multi-operator shared micromobility services that incorporates the city regulator’s regulations in the form of assigning a score to each operator according to the city goal achievements and operators’ individual contributions to achieving the city goal, measured by Shapley value. To realize the fairness-aware score assignment, we measure the fairness of assigned scores and use them as one of the components to optimize the score assignment model. To optimize the whole framework, we develop an alternating procedure to make operators and the city regulator interact with each other until convergence. We evaluate our framework based on real-world e-scooter usage data in Chicago. Our experiment results show that our method achieves a performance gain of at least 39.93% in the equity of vehicle usage and 1.82% in the average demand satisfaction of the whole city.

[468] Smooth Games of Configuration in the Linear-Quadratic Setting

Jesse Milzman, Jeffrey Mao, Giuseppe Loianno

Main category: cs.MA

TL;DR: The paper introduces a ‘game of configuration’ framework for strategic fine-tuning in differential games, providing a method to compute and optimize configuration parameters in multi-agent scenarios.

DetailsMotivation: Existing literature lacks exploration of optimal configuration in dynamic games from a strategic perspective where agents' choices influence each other.

Method: A two-stage game framework is proposed: first, players choose configuration parameters; second, these parameters impact dynamics and costs. Subgame perfect solutions and gradient-based methods are introduced.

Result: The approach is demonstrated in example affine-quadratic systems, showing effectiveness in both zero-sum and general-sum scenarios.

Conclusion: The framework successfully addresses strategic configuration in dynamic games, offering practical solutions for multi-agent interactions.

Abstract: Dynamic game theory offers a toolbox for formalizing and solving for both cooperative and non-cooperative strategies in multi-agent scenarios. However, the optimal configuration of such games remains largely unexplored. While there is existing literature on the parametrization of dynamic games, little research examines this parametrization from a strategic perspective where each agent’s configuration choice is influenced by the decisions of others. In this work, we introduce the concept of a game of configuration, providing a framework for the strategic fine-tuning of differential games. We define a game of configuration as a two-stage game within the setting of finite-horizon, affine-quadratic, AQ, differential games. In the first stage, each player chooses their corresponding configuration parameter, which will impact their dynamics and costs in the second stage. We provide the subgame perfect solution concept and a method for computing first stage cost gradients over the configuration space. This then allows us to formulate a gradient-based method for searching for local solutions to the configuration game, as well as provide necessary conditions for equilibrium configurations over their downstream (second stage) trajectories. We conclude by demonstrating the effectiveness of our approach in example AQ systems, both zero-sum and general-sum.

cs.MM

[469] Ensembling Synchronisation-based and Face-Voice Association Paradigms for Robust Active Speaker Detection in Egocentric Recordings

Jason Clarke, Yoshihiko Gotoh, Stefan Goetze

Main category: cs.MM

TL;DR: A novel ensemble approach combines synchronisation-dependent and synchronisation-agnostic models for robust audiovisual active speaker detection in egocentric recordings, achieving improved performance.

DetailsMotivation: Traditional methods fail under challenging conditions like occlusions and audio interference in egocentric recordings, necessitating a hybrid solution.

Method: The proposed method fuses outputs of synchronisation-based and face-voice association models via weighted averaging, with a refined preprocessing pipeline for the latter.

Result: The ensemble achieves 70.2% and 66.7% mAP on the Ego4D-AVD validation set with TalkNet and Light-ASD backbones, respectively.

Conclusion: The ensemble leverages complementary strengths of both models, demonstrating effectiveness in diverse conditions.

Abstract: Audiovisual active speaker detection (ASD) in egocentric recordings is challenged by frequent occlusions, motion blur, and audio interference, which undermine the discernability of temporal synchrony between lip movement and speech. Traditional synchronisation-based systems perform well under clean conditions but degrade sharply in first-person recordings. Conversely, face-voice association (FVA)-based methods forgo synchronisation modelling in favour of cross-modal biometric matching, exhibiting robustness to transient visual corruption but suffering when overlapping speech or front-end segmentation errors occur. In this paper, a simple yet effective ensemble approach is proposed to fuse synchronisation-dependent and synchronisation-agnostic model outputs via weighted averaging, thereby harnessing complementary cues without introducing complex fusion architectures. A refined preprocessing pipeline for the FVA-based component is also introduced to optimise ensemble integration. Experiments on the Ego4D-AVD validation set demonstrate that the ensemble attains 70.2% and 66.7% mean Average Precision (mAP) with TalkNet and Light-ASD backbones, respectively. A qualitative analysis stratified by face image quality and utterance masking prevalence further substantiates the complementary strengths of each component.

[470] MMRAG-DocQA: A Multi-Modal Retrieval-Augmented Generation Method for Document Question-Answering with Hierarchical Index and Multi-Granularity Retrieval

Ziyu Gong, Yihua Huang, Chengcheng Mai

Main category: cs.MM

TL;DR: MMRAG-DocQA is a novel multi-modal RAG model addressing hallucinations and disconnection in long-context document QA by leveraging hierarchical indexing and multi-granularity retrieval.

DetailsMotivation: Existing LVLM-based methods suffer from hallucinations, while RAG-based methods struggle with inter-modal disconnection and cross-page fragmentation.

Method: Proposes MMRAG-DocQA with hierarchical indexing (flattened in-page and topological cross-page chunks) and multi-granularity semantic retrieval (page-level and document-level).

Result: Outperforms on MMLongBench-Doc and LongDocURL datasets, excelling in modality-rich, multi-page document QA.

Conclusion: MMRAG-DocQA effectively integrates multi-modal evidence across long-range pages, improving accuracy in document QA.

Abstract: The multi-modal long-context document question-answering task aims to locate and integrate multi-modal evidences (such as texts, tables, charts, images, and layouts) distributed across multiple pages, for question understanding and answer generation. The existing methods can be categorized into Large Vision-Language Model (LVLM)-based and Retrieval-Augmented Generation (RAG)-based methods. However, the former were susceptible to hallucinations, while the latter struggled for inter-modal disconnection and cross-page fragmentation. To address these challenges, a novel multi-modal RAG model, named MMRAG-DocQA, was proposed, leveraging both textual and visual information across long-range pages to facilitate accurate question answering. A hierarchical indexing method with the integration of flattened in-page chunks and topological cross-page chunks was designed to jointly establish in-page multi-modal associations and long-distance cross-page dependencies. By means of joint similarity evaluation and large language model (LLM)-based re-ranking, a multi-granularity semantic retrieval method, including the page-level parent page retrieval and document-level summary retrieval, was proposed to foster multi-modal evidence connection and long-distance evidence integration and reasoning. Experimental results performed on public datasets, MMLongBench-Doc and LongDocURL, demonstrated the superiority of our MMRAG-DocQA method in understanding and answering modality-rich and multi-page documents.

eess.AS

[471] Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Children’s Speech

Abhijit Sinha, Harishankar Kumar, Mohit Joshi, Hemant Kumar Kathania, Shrikanth Narayanan, Sudarsana Reddy Kadiri

Main category: eess.AS

TL;DR: The paper analyzes how Wav2Vec2 variants encode age and gender traits in children’s speech, finding early layers (1-7) are more effective for speaker-specific cues, while deeper layers focus on linguistic information. PCA improves classification, with top models achieving high accuracy.

DetailsMotivation: Children's speech variability complicates age and gender classification, and SSL models' effectiveness for this task is underexplored.

Method: Layer-wise analysis of four Wav2Vec2 variants using PFSTAR and CMU Kids datasets, with PCA applied to improve classification.

Result: Wav2Vec2-large-lv60 achieves 97.14% (age) and 98.20% (gender) on CMU Kids; base-100h and large-lv60 models reach 86.05% and 95.00% on PFSTAR. Early layers (1-7) outperform deeper layers for speaker traits.

Conclusion: The study reveals how speaker traits are structured across SSL model depth, supporting adaptive strategies for child-aware speech interfaces.

Abstract: Children’s speech presents challenges for age and gender classification due to high variability in pitch, articulation, and developmental traits. While self-supervised learning (SSL) models perform well on adult speech tasks, their ability to encode speaker traits in children remains underexplored. This paper presents a detailed layer-wise analysis of four Wav2Vec2 variants using the PFSTAR and CMU Kids datasets. Results show that early layers (1-7) capture speaker-specific cues more effectively than deeper layers, which increasingly focus on linguistic information. Applying PCA further improves classification, reducing redundancy and highlighting the most informative components. The Wav2Vec2-large-lv60 model achieves 97.14% (age) and 98.20% (gender) on CMU Kids; base-100h and large-lv60 models reach 86.05% and 95.00% on PFSTAR. These results reveal how speaker traits are structured across SSL model depth and support more targeted, adaptive strategies for child-aware speech interfaces.

[472] Towards Frame-level Quality Predictions of Synthetic Speech

Michael Kuhlmann, Fritz Seebauer, Petra Wagner, Reinhold Haeb-Umbach

Main category: eess.AS

TL;DR: The paper explores the feasibility of frame-level automatic speech quality assessment, addressing issues in existing predictors and proposing criteria and a chunk-based method to improve localization performance.

DetailsMotivation: To enhance explainability in speech synthesis quality assessment by enabling frame-level predictions, addressing gaps in current methods.

Method: Identifies issues in existing predictors, defines criteria for frame-level prediction, suggests chunk-based processing, and tests predictors with artificial distortions.

Result: Frame-level quality predictors outperform human annotations in detecting localized distortions.

Conclusion: Frame-level assessment is viable and can improve explainability and accuracy in speech quality evaluation.

Abstract: While automatic subjective speech quality assessment has witnessed much progress, an open question is whether an automatic quality assessment at frame resolution is possible. This would be highly desirable, as it adds explainability to the assessment of speech synthesis systems. Here, we take first steps towards this goal by identifying issues of existing quality predictors that prevent sensible frame-level prediction. Further, we define criteria that a frame-level predictor should fulfill. We also suggest a chunk-based processing that avoids the impact of a localized distortion on the score of neighboring frames. Finally, we measure in experiments with localized artificial distortions the localization performance of a set of frame-level quality predictors and show that they can outperform detection performance of human annotations obtained from a crowd-sourced perception experiment.

[473] Exploring Cross-Utterance Speech Contexts for Conformer-Transducer Speech Recognition Systems

Mingyu Cui, Mengzhe Geng, Jiajun Deng, Chengxi Deng, Jiawen Kang, Shujie Hu, Guinan Li, Tianzi Wang, Zhaoqing Li, Xie Chen, Xunying Liu

Main category: eess.AS

TL;DR: The paper explores four cross-utterance speech context modeling methods for C-T ASR systems, proposing an efficient batch-training scheme. Experiments show significant WER/CER reductions, outperforming baselines and competing models.

DetailsMotivation: To improve ASR performance by incorporating cross-utterance speech contexts into C-T models, addressing synchronization overhead and sequential order preservation.

Method: Four modeling approaches: feature concatenation, embedding concatenation, embedding pooling projection, and a novel chunk-based method. An efficient batch-training scheme is introduced.

Result: Best systems achieved significant WER/CER reductions (up to 0.9%-1.1% absolute) across four datasets, outperforming baselines and competing models like Wav2vec2.0-Conformer.

Conclusion: Incorporating cross-utterance contexts enhances ASR performance, demonstrating potential for integration into speech foundation models.

Abstract: This paper investigates four types of cross-utterance speech contexts modeling approaches for streaming and non-streaming Conformer-Transformer (C-T) ASR systems: i) input audio feature concatenation; ii) cross-utterance Encoder embedding concatenation; iii) cross-utterance Encoder embedding pooling projection; or iv) a novel chunk-based approach applied to C-T models for the first time. An efficient batch-training scheme is proposed for contextual C-Ts that uses spliced speech utterances within each minibatch to minimize the synchronization overhead while preserving the sequential order of cross-utterance speech contexts. Experiments are conducted on four benchmark speech datasets across three languages: the English GigaSpeech and Mandarin Wenetspeech corpora used in contextual C-T models pre-training; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets used in domain fine-tuning. The best performing contextual C-T systems consistently outperform their respective baselines using no cross-utterance speech contexts in pre-training and fine-tuning stages with statistically significant average word error rate (WER) or character error rate (CER) reductions up to 0.9%, 1.1%, 0.51%, and 0.98% absolute (6.0%, 5.4%, 2.0%, and 3.4% relative) on the four tasks respectively. Their performance competitiveness against Wav2vec2.0-Conformer, XLSR-128, and Whisper models highlights the potential benefit of incorporating cross-utterance speech contexts into current speech foundation models.

[474] Evaluation of Speech Foundation Models for ASR on Child-Adult Conversations in Autism Diagnostic Sessions

Aditya Ashvin, Rimita Lahiri, Aditya Kommineni, Somer Bishop, Catherine Lord, Sudarsana Reddy Kadiri, Shrikanth Narayanan

Main category: eess.AS

TL;DR: The paper evaluates ASR performance on child-adult interactions in autism diagnostics, finding a 15-20% WER drop for child speech. Fine-tuning Whisper-large with LoRA improves WER by 8-13%.

DetailsMotivation: To address the underexplored performance of speech foundation models in child-adult conversational settings, crucial for diagnosing developmental disorders like Autism.

Method: Comprehensive evaluation of ASR models (Whisper, Wav2Vec2, HuBERT, WavLM) on child-adult interactions, followed by fine-tuning Whisper-large with LoRA.

Result: Speech foundation models show a 15-20% WER drop for child speech. Fine-tuning improves WER by 8% (child) and 13% (adult).

Conclusion: Fine-tuning foundation models like Whisper-large can significantly improve ASR performance for child-adult interactions in clinical settings.

Abstract: Reliable transcription of child-adult conversations in clinical settings is crucial for diagnosing developmental disorders like Autism. Recent advances in deep learning and availability of large scale transcribed data has led to development of speech foundation models that have shown dramatic improvements in ASR performance. However, their performance on conversational child-adult interactions remains underexplored. In this work, we provide a comprehensive evaluation of ASR performance on a dataset containing child-adult interactions from autism diagnostic sessions, using Whisper, Wav2Vec2, HuBERT, and WavLM. We find that speech foundation models show a noticeable performance drop (15-20% absolute WER) for child speech compared to adult speech in the conversational setting. Then, we fine-tune the best-performing zero-shot model (Whisper-large) using LoRA in a low-resource setting, yielding 8% and 13% absolute WER improvements for child and adult speech, respectively.

[475] Navigating PESQ: Up-to-Date Versions and Open Implementations

Matteo Torcoli, Mhd Modar Halimeh, Emanuël A. P. Habets

Main category: eess.AS

TL;DR: This paper provides guidance on the different versions and implementations of PESQ, highlighting significant differences between versions and stressing the importance of specifying exact details for accurate comparisons. It also offers a repository implementing the latest corrections.

DetailsMotivation: The withdrawal of PESQ by ITU and the proliferation of versions and implementations have created confusion, especially for new users, necessitating clear guidance.

Method: The work reviews PESQ versions and implementations, analyzes their differences, and provides a repository with the latest corrections (Corrigendum 2).

Result: Differences between PESQ versions are significant, and specifying exact details is crucial for result interpretation and cross-study comparisons.

Conclusion: Clear documentation of PESQ versions and implementations, along with the provided repository, will improve consistency and comparability in speech quality evaluation.

Abstract: Perceptual Evaluation of Speech Quality (PESQ) is an objective quality measure that remains widely used despite its withdrawal by the International Telecommunication Union (ITU). PESQ has evolved over two decades, with multiple versions and publicly available implementations emerging during this time. Different versions and their updates can be overwhelming, especially for new PESQ users. This work provides practical guidance on the different versions and implementations of PESQ. We show that differences can be significant, especially between PESQ versions. We stress the importance of specifying the exact version and implementation that is used to compute PESQ, and possibly to detail how multi-channel signals are handled. These practices would facilitate the interpretation of results and allow comparisons of PESQ scores between different studies. We also provide a repository that implements the latest corrections to PESQ, i.e., Corrigendum 2, which is not implemented by any other openly available distribution: https://github.com/audiolabs/PESQ.

[476] LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness

Zongli Ye, Jiachen Lian, Akshaj Gupta, Xuanru Zhou, Haodong Li, Krish Patel, Hwi Joo Park, Dingkun Zhou, Chenxu Guo, Shuhe Li, Sam Wang, Iris Zhou, Cheol Jun Cho, Zoe Ezzes, Jet M. J. Vonk, Brittany T. Morin, Rian Bogley, Lisa Wauters, Zachary A. Miller, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli

Main category: eess.AS

TL;DR: LCS-CTC improves phoneme-level speech recognition by combining similarity-aware local alignment with constrained CTC training, outperforming vanilla CTC.

DetailsMotivation: CTC often underperforms in unclear or non-fluent speech, necessitating a more robust method for phoneme-level recognition.

Method: LCS-CTC uses a two-stage framework: predicting frame-phoneme cost matrices and applying a modified LCS algorithm to constrain CTC decoding paths.

Result: LCS-CTC outperforms vanilla CTC on LibriSpeech and PPA datasets, showing better generalization and robustness.

Conclusion: LCS-CTC unifies phoneme modeling for fluent and non-fluent speech, offering improved recognition and alignment.

Abstract: Phonetic speech transcription is crucial for fine-grained linguistic analysis and downstream speech applications. While Connectionist Temporal Classification (CTC) is a widely used approach for such tasks due to its efficiency, it often falls short in recognition performance, especially under unclear and nonfluent speech. In this work, we propose LCS-CTC, a two-stage framework for phoneme-level speech recognition that combines a similarity-aware local alignment algorithm with a constrained CTC training objective. By predicting fine-grained frame-phoneme cost matrices and applying a modified Longest Common Subsequence (LCS) algorithm, our method identifies high-confidence alignment zones which are used to constrain the CTC decoding path space, thereby reducing overfitting and improving generalization ability, which enables both robust recognition and text-free forced alignment. Experiments on both LibriSpeech and PPA demonstrate that LCS-CTC consistently outperforms vanilla CTC baselines, suggesting its potential to unify phoneme modeling across fluent and non-fluent speech.

[477] Speech Enhancement based on cascaded two flow

Seonggyu Lee, Sein Cheong, Sangwook Han, Kihyuk Kim, Jong Won Shin

Main category: eess.AS

TL;DR: A method using flow matching for speech enhancement (SE) achieves competitive performance with fewer function evaluations (NFE) by cascading generative models without needing a separate predictive model.

DetailsMotivation: To improve SE efficiency by reducing NFEs while maintaining performance, avoiding the need for additional predictive models.

Method: Uses flow matching for SE and generating enhanced speech as an initial point and conditioning variable, cascading two generative models.

Result: Achieves equivalent or better performance than baselines with the same or fewer NFEs.

Conclusion: The proposed method is efficient and effective, eliminating the need for separate predictive models while maintaining performance.

Abstract: Speech enhancement (SE) based on diffusion probabilistic models has exhibited impressive performance, while requiring a relatively high number of function evaluations (NFE). Recently, SE based on flow matching has been proposed, which showed competitive performance with a small NFE. Early approaches adopted the noisy speech as the only conditioning variable. There have been other approaches which utilize speech enhanced with a predictive model as another conditioning variable and to sample an initial value, but they require a separate predictive model on top of the generative SE model. In this work, we propose to employ an identical model based on flow matching for both SE and generating enhanced speech used as an initial starting point and a conditioning variable. Experimental results showed that the proposed method required the same or fewer NFEs even with two cascaded generative methods while achieving equivalent or better performances to the previous baselines.

eess.IV

[478] Explainable AI Technique in Lung Cancer Detection Using Convolutional Neural Networks

Nishan Rai, Sujan Khatri, Devendra Risal

Main category: eess.IV

TL;DR: A deep learning framework for automated lung cancer screening using CT images, evaluated with custom and transfer learning CNNs, achieving high accuracy and explainability.

DetailsMotivation: Early detection of lung cancer is crucial for improving survival outcomes, necessitating automated and interpretable screening methods.

Method: Uses a custom CNN and fine-tuned transfer learning models (DenseNet121, ResNet152, VGG19) with cost-sensitive learning on the IQ-OTH/NCCD dataset. Evaluated via accuracy, precision, recall, F1-score, and ROC-AUC, with SHAP for explainability.

Result: ResNet152 achieved the highest accuracy (97.3%), while DenseNet121 balanced precision, recall, and F1 best (up to 92%, 90%, 91%). SHAP improved interpretability.

Conclusion: CNN-based approaches with explainability offer fast, accurate, and interpretable lung cancer screening, especially in resource-limited settings.

Abstract: Early detection of lung cancer is critical to improving survival outcomes. We present a deep learning framework for automated lung cancer screening from chest computed tomography (CT) images with integrated explainability. Using the IQ-OTH/NCCD dataset (1,197 scans across Normal, Benign, and Malignant classes), we evaluate a custom convolutional neural network (CNN) and three fine-tuned transfer learning backbones: DenseNet121, ResNet152, and VGG19. Models are trained with cost-sensitive learning to mitigate class imbalance and evaluated via accuracy, precision, recall, F1-score, and ROC-AUC. While ResNet152 achieved the highest accuracy (97.3%), DenseNet121 provided the best overall balance in precision, recall, and F1 (up to 92%, 90%, 91%, respectively). We further apply Shapley Additive Explanations (SHAP) to visualize evidence contributing to predictions, improving clinical transparency. Results indicate that CNN-based approaches augmented with explainability can provide fast, accurate, and interpretable support for lung cancer screening, particularly in resource-limited settings.

[479] Data-Efficient Learning for Generalizable Surgical Video Understanding

Sahar Nasirihaghighi

Main category: eess.IV

TL;DR: This doctoral research advances surgical video analysis by addressing annotation scarcity, spatiotemporal complexity, and domain gaps. It benchmarks neural networks, proposes novel architectures, and develops semi-supervised frameworks to improve performance with minimal labeled data. Two large datasets are released to support reproducibility.

DetailsMotivation: To bridge the gap between deep learning-based surgical video analysis in research and real-world clinical deployment, addressing challenges like annotation scarcity and domain gaps.

Method: Benchmarked state-of-the-art neural networks, proposed novel architectures, and developed semi-supervised frameworks (DIST, SemiVT-Surge, ENCORE) leveraging unlabeled data. Released datasets GynSurg and Cataract-1K.

Result: Achieved state-of-the-art results on surgical datasets by reducing reliance on labeled data and enhancing model training.

Conclusion: The work contributes to robust, data-efficient, and clinically scalable solutions for surgical video analysis, paving the way for impactful AI systems in surgical care and training.

Abstract: Advances in surgical video analysis are transforming operating rooms into intelligent, data-driven environments. Computer-assisted systems support full surgical workflow, from preoperative planning to intraoperative guidance and postoperative assessment. However, developing robust and generalizable models for surgical video understanding remains challenging due to (I) annotation scarcity, (II) spatiotemporal complexity, and (III) domain gap across procedures and institutions. This doctoral research aims to bridge the gap between deep learning-based surgical video analysis in research and its real-world clinical deployment. To address the core challenge of recognizing surgical phases, actions, and events, critical for analysis, I benchmarked state-of-the-art neural network architectures to identify the most effective designs for each task. I further improved performance by proposing novel architectures and integrating advanced modules. Given the high cost of expert annotations and the domain gap across surgical video sources, I focused on reducing reliance on labeled data. We developed semi-supervised frameworks that improve model performance across tasks by leveraging large amounts of unlabeled surgical video. We introduced novel semi-supervised frameworks, including DIST, SemiVT-Surge, and ENCORE, that achieved state-of-the-art results on challenging surgical datasets by leveraging minimal labeled data and enhancing model training through dynamic pseudo-labeling. To support reproducibility and advance the field, we released two multi-task datasets: GynSurg, the largest gynecologic laparoscopy dataset, and Cataract-1K, the largest cataract surgery video dataset. Together, this work contributes to robust, data-efficient, and clinically scalable solutions for surgical video analysis, laying the foundation for generalizable AI systems that can meaningfully impact surgical care and training.

[480] DIVA-VQA: Detecting Inter-frame Variations in UGC Video Quality

Xinyi Wang, Angeliki Katsenou, David Bull

Main category: eess.IV

TL;DR: A novel NR-VQA model using spatio-temporal fragmentation and inter-frame variations achieves top performance on UGC datasets with low runtime complexity.

DetailsMotivation: The rise of UGC demands efficient NR-VQA for quality monitoring without pristine references.

Method: The model analyzes quality-sensitive regions at multiple levels (frames, patches, fragmented frames) using 2D/3D features from inter-frame variations.

Result: Ranked top 2 on five UGC datasets (DIVA-VQA-L: 0.898, DIVA-VQA-B: 0.886) with low runtime complexity.

Conclusion: The proposed NR-VQA model is effective and efficient for large-scale video quality assessment in UGC applications.

Abstract: The rapid growth of user-generated (video) content (UGC) has driven increased demand for research on no-reference (NR) perceptual video quality assessment (VQA). NR-VQA is a key component for large-scale video quality monitoring in social media and streaming applications where a pristine reference is not available. This paper proposes a novel NR-VQA model based on spatio-temporal fragmentation driven by inter-frame variations. By leveraging these inter-frame differences, the model progressively analyses quality-sensitive regions at multiple levels: frames, patches, and fragmented frames. It integrates frames, fragmented residuals, and fragmented frames aligned with residuals to effectively capture global and local information. The model extracts both 2D and 3D features in order to characterize these spatio-temporal variations. Experiments conducted on five UGC datasets and against state-of-the-art models ranked our proposed method among the top 2 in terms of average rank correlation (DIVA-VQA-L: 0.898 and DIVA-VQA-B: 0.886). The improved performance is offered at a low runtime complexity, with DIVA-VQA-B ranked top and DIVA-VQA-L third on average compared to the fastest existing NR-VQA method. Code and models are publicly available at: https://github.com/xinyiW915/DIVA-VQA.

[481] DINOMotion: advanced robust tissue motion tracking with DINOv2 in 2D-Cine MRI-guided radiotherapy

Soorena Salari, Catherine Spino, Laurie-Anne Pharand, Fabienne Lathuiliere, Hassan Rivaz, Silvain Beriault, Yiming Xiao

Main category: eess.IV

TL;DR: DINOMotion, a deep learning framework using DINOv2 and LoRA layers, improves motion tracking in 2D-Cine MRI-guided radiotherapy by offering robustness, efficiency, and interpretability.

DetailsMotivation: Existing methods for tissue motion tracking in 2D-Cine MRI-guided radiotherapy struggle with large misalignments and lack interpretability.

Method: DINOMotion employs DINOv2 with LoRA layers for efficient training and robust feature representation, detecting landmarks for interpretable image registration.

Result: Achieves high Dice scores (92.07% kidney, 90.90% liver, 95.23% lung) and low Hausdorff distances (5.47 mm, 8.31 mm, 6.72 mm), processing scans in ~30ms.

Conclusion: DINOMotion is a promising, efficient, and interpretable solution for real-time motion tracking in 2D-Cine MRI-guided radiotherapy.

Abstract: Accurate tissue motion tracking is critical to ensure treatment outcome and safety in 2D-Cine MRI-guided radiotherapy. This is typically achieved by registration of sequential images, but existing methods often face challenges with large misalignments and lack of interpretability. In this paper, we introduce DINOMotion, a novel deep learning framework based on DINOv2 with Low-Rank Adaptation (LoRA) layers for robust, efficient, and interpretable motion tracking. DINOMotion automatically detects corresponding landmarks to derive optimal image registration, enhancing interpretability by providing explicit visual correspondences between sequential images. The integration of LoRA layers reduces trainable parameters, improving training efficiency, while DINOv2’s powerful feature representations offer robustness against large misalignments. Unlike iterative optimization-based methods, DINOMotion directly computes image registration at test time. Our experiments on volunteer and patient datasets demonstrate its effectiveness in estimating both linear and nonlinear transformations, achieving Dice scores of 92.07% for the kidney, 90.90% for the liver, and 95.23% for the lung, with corresponding Hausdorff distances of 5.47 mm, 8.31 mm, and 6.72 mm, respectively. DINOMotion processes each scan in approximately 30ms and consistently outperforms state-of-the-art methods, particularly in handling large misalignments. These results highlight its potential as a robust and interpretable solution for real-time motion tracking in 2D-Cine MRI-guided radiotherapy.

[482] Efficient Image Denoising Using Global and Local Circulant Representation

Zhaoming Kong, Jiahuan Zhang, Xiaowei Yang

Main category: eess.IV

TL;DR: Haar-tSVD is a simple, efficient denoising algorithm leveraging nonlocal self-similarity and PCA-Haar transform connections, offering a balance between speed and performance.

DetailsMotivation: High demand for efficient image denoising due to increasing imaging data and devices.

Method: Uses tensor-SVD projection with Haar transform to capture global/local patch correlations, eliminating the need for learning local bases. Includes adaptive noise estimation via CNN and eigenvalue analysis.

Result: Validated on real-world tasks, Haar-tSVD efficiently removes noise while preserving details.

Conclusion: Haar-tSVD is robust, adaptable, and effective for denoising, with publicly available resources.

Abstract: The advancement of imaging devices and countless image data generated everyday impose an increasingly high demand on efficient and effective image denoising. In this paper, we present a computationally simple denoising algorithm, termed Haar-tSVD, aiming to explore the nonlocal self-similarity prior and leverage the connection between principal component analysis (PCA) and the Haar transform under circulant representation. We show that global and local patch correlations can be effectively captured through a unified tensor-singular value decomposition (t-SVD) projection with the Haar transform. This results in a one-step, highly parallelizable filtering method that eliminates the need for learning local bases to represent image patches, striking a balance between denoising speed and performance. Furthermore, we introduce an adaptive noise estimation scheme based on a CNN estimator and eigenvalue analysis to enhance the robustness and adaptability of the proposed method. Experiments on different real-world denoising tasks validate the efficiency and effectiveness of Haar-tSVD for noise removal and detail preservation. Datasets, code and results are publicly available at https://github.com/ZhaomingKong/Haar-tSVD.

[483] Cross-view Generalized Diffusion Model for Sparse-view CT Reconstruction

Jixiang Chen, Yiqun Lin, Yi Qin, Hualiang Wang, Xiaomeng Li

Main category: eess.IV

TL;DR: CvG-Diff improves sparse-view CT reconstruction by modeling deterministic degradation and introducing innovations like EPCT and SPDPS, achieving high-quality results in minimal steps.

DetailsMotivation: Conventional and deep-learning-based CT reconstruction methods struggle with artifacts and inefficiency, especially in highly sparse regimes.

Method: CvG-Diff reformulates reconstruction as a generalized diffusion process, using EPCT for error suppression and SPDPS for adaptive sampling.

Result: Achieves 38.34 dB PSNR and 0.9518 SSIM for 18-view CT in just 10 steps, outperforming state-of-the-art methods.

Conclusion: CvG-Diff offers a stable, efficient solution for high-quality sparse-view CT reconstruction.

Abstract: Sparse-view computed tomography (CT) reduces radiation exposure by subsampling projection views, but conventional reconstruction methods produce severe streak artifacts with undersampled data. While deep-learning-based methods enable single-step artifact suppression, they often produce over-smoothed results under significant sparsity. Though diffusion models improve reconstruction via iterative refinement and generative priors, they require hundreds of sampling steps and struggle with stability in highly sparse regimes. To tackle these concerns, we present the Cross-view Generalized Diffusion Model (CvG-Diff), which reformulates sparse-view CT reconstruction as a generalized diffusion process. Unlike existing diffusion approaches that rely on stochastic Gaussian degradation, CvG-Diff explicitly models image-domain artifacts caused by angular subsampling as a deterministic degradation operator, leveraging correlations across sparse-view CT at different sample rates. To address the inherent artifact propagation and inefficiency of sequential sampling in generalized diffusion model, we introduce two innovations: Error-Propagating Composite Training (EPCT), which facilitates identifying error-prone regions and suppresses propagated artifacts, and Semantic-Prioritized Dual-Phase Sampling (SPDPS), an adaptive strategy that prioritizes semantic correctness before detail refinement. Together, these innovations enable CvG-Diff to achieve high-quality reconstructions with minimal iterations, achieving 38.34 dB PSNR and 0.9518 SSIM for 18-view CT using only \textbf{10} steps on AAPM-LDCT dataset. Extensive experiments demonstrate the superiority of CvG-Diff over state-of-the-art sparse-view CT reconstruction methods. The code is available at https://github.com/xmed-lab/CvG-Diff.

[484] When Experts Disagree: Characterizing Annotator Variability for Vessel Segmentation in DSA Images

M. Geshvadi, G. So, D. D. Chlorogiannis, C. Galvin, E. Torio, A. Azimi, Y. Tachie-Baffour, N. Haouchine, A. Golby, M. Vangel, W. M. Wells, Y. Epelboym, R. Du, F. Durupinar, S. Frisken

Main category: eess.IV

TL;DR: The paper analyzes variability in 2D DSA cranial blood vessel segmentations by multiple annotators to quantify uncertainty and guide future annotations and automatic methods.

DetailsMotivation: To understand and quantify segmentation uncertainty in cranial blood vessel annotations, which can improve annotation processes and automatic segmentation.

Method: Analysis of variability among segmentations performed by multiple annotators on 2D DSA images.

Result: Characterization and quantification of segmentation uncertainty.

Conclusion: The findings can guide additional annotations and develop uncertainty-aware automatic segmentation methods.

Abstract: We analyze the variability among segmentations of cranial blood vessels in 2D DSA performed by multiple annotators in order to characterize and quantify segmentation uncertainty. We use this analysis to quantify segmentation uncertainty and discuss ways it can be used to guide additional annotations and to develop uncertainty-aware automatic segmentation methods.

[485] INSIGHT: Explainable Weakly-Supervised Medical Image Analysis

Wenbo Zhang, Junyu Chen, Christopher Kanan

Main category: eess.IV

TL;DR: INSIGHT is a weakly-supervised aggregator for volumetric scans and WSIs, integrating heatmap generation to improve localization of small, crucial details while achieving state-of-the-art classification and segmentation results.

DetailsMotivation: Current methods for processing large volumetric scans and WSIs rely on post-hoc visualization and often miss small, clinically important details.

Method: INSIGHT uses pre-trained feature maps, a detection module with small kernels for fine details, and a context module to suppress false positives, generating an internal heatmap.

Result: INSIGHT achieves top classification and segmentation performance on CT and WSI benchmarks.

Conclusion: INSIGHT effectively localizes diagnostically relevant regions and outperforms existing methods, with code and project details available online.

Abstract: Due to their large sizes, volumetric scans and whole-slide pathology images (WSIs) are often processed by extracting embeddings from local regions and then an aggregator makes predictions from this set. However, current methods require post-hoc visualization techniques (e.g., Grad-CAM) and often fail to localize small yet clinically crucial details. To address these limitations, we introduce INSIGHT, a novel weakly-supervised aggregator that integrates heatmap generation as an inductive bias. Starting from pre-trained feature maps, INSIGHT employs a detection module with small convolutional kernels to capture fine details and a context module with a broader receptive field to suppress local false positives. The resulting internal heatmap highlights diagnostically relevant regions. On CT and WSI benchmarks, INSIGHT achieves state-of-the-art classification results and high weakly-labeled semantic segmentation performance. Project website and code are available at: https://zhangdylan83.github.io/ewsmia/

[486] EvRWKV: A Continuous Interactive RWKV Framework for Effective Event-Guided Low-Light Image Enhancement

Wenjie Cai, Qingguo Meng, Zhenyu Wang, Xingbo Dong, Zhe Jin

Main category: eess.IV

TL;DR: EvRWKV is a novel framework for low-light image enhancement using event cameras, combining Cross-RWKV and EISFE modules for noise suppression and detail restoration, achieving state-of-the-art results.

DetailsMotivation: Traditional low-light enhancement methods struggle with noise and detail loss, while event cameras offer high dynamic range but lack effective fusion techniques.

Method: EvRWKV uses dual-domain processing: Cross-RWKV for temporal and cross-modal fusion, and EISFE for adaptive noise suppression and spatial alignment.

Result: The framework outperforms existing methods on real-world datasets, enhancing image quality by reducing noise and restoring details.

Conclusion: EvRWKV effectively addresses low-light imaging challenges by leveraging event cameras and continuous cross-modal interaction, setting a new benchmark.

Abstract: Capturing high-quality visual content under low-light conditions remains a challenging problem due to severe noise and underexposure, which degrade the performance of downstream applications. Traditional frame-based low-light image enhancement methods often amplify noise or fail to preserve structural details. Event cameras, offering high dynamic range and microsecond temporal resolution by asynchronously capturing brightness changes, emerge as a promising complement for low-light imaging. However, existing fusion methods fail to fully exploit this synergy, either by forcing modalities into a shared representation too early or by losing vital low-level correlations through isolated processing. To address these challenges, we propose EvRWKV, a novel framework that enables continuous cross-modal interaction through dual-domain processing. Our approach incorporates a Cross-RWKV module, leveraging the Receptance Weighted Key Value (RWKV) architecture for fine-grained temporal and cross-modal fusion, and an Event Image Spectral Fusion Enhancer (EISFE) module, which jointly performs adaptive frequency-domain noise suppression and spatial-domain deformable convolution alignment. This continuous interaction maintains feature consistency from low-level textures to high-level semantics. Extensive qualitative and quantitative evaluations on real-world low-light datasets (SDE, SDSD, RELED) demonstrate that EvRWKV achieves state-of-the-art performance, effectively enhancing image quality by suppressing noise, restoring structural details, and improving visual clarity in challenging low-light conditions.

Last updated: 2025-08-22
Built with Hugo, theme modified on Stack