Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 81]
- cs.CV [Total: 115]
- cs.AI [Total: 49]
- cs.SD [Total: 8]
- cs.LG [Total: 144]
- cs.MA [Total: 2]
- cs.MM [Total: 1]
- eess.AS [Total: 4]
- eess.IV [Total: 13]
cs.CL
[1] Understanding and Enhancing Mamba-Transformer Hybrids for Memory Recall and Language Modeling
Hyunji Lee, Wenhao Yu, Hongming Zhang, Kaixin Ma, Jiyeon Kim, Dong Yu, Minjoon Seo
Main category: cs.CL
TL;DR: Analysis of hybrid SSM-attention architectures reveals sequential hybrids excel at short contexts while parallel hybrids work better for long contexts. A data-centric approach using paraphrase-augmented training enhances recall capabilities.
Details
Motivation: To better understand architectural design choices in hybrid models combining state space models (SSMs) with attention mechanisms, and to improve their effectiveness through systematic analysis.Method: Analyzed sequential vs parallel integration of SSM and attention layers, and introduced a data-centric approach using continual training on paraphrase-augmented datasets.
Result: Sequential hybrids perform better on shorter contexts, parallel hybrids are more effective for longer contexts. Paraphrase-augmented training enhances recall while preserving other capabilities and outperforms architectural modifications.
Conclusion: Provides deeper understanding of hybrid SSM-attention models and practical guidance for designing architectures tailored to different use cases.
Abstract: Hybrid models that combine state space models (SSMs) with attention mechanisms have shown strong performance by leveraging the efficiency of SSMs and the high recall ability of attention. However, the architectural design choices behind these hybrid models remain insufficiently understood. In this work, we analyze hybrid architectures through the lens of memory utilization and overall performance, and propose a complementary method to further enhance their effectiveness. We first examine the distinction between sequential and parallel integration of SSM and attention layers. Our analysis reveals several interesting findings, including that sequential hybrids perform better on shorter contexts, whereas parallel hybrids are more effective for longer contexts. We also introduce a data-centric approach of continually training on datasets augmented with paraphrases, which further enhances recall while preserving other capabilities. It generalizes well across different base models and outperforms architectural modifications aimed at enhancing recall. Our findings provide a deeper understanding of hybrid SSM-attention models and offer practical guidance for designing architectures tailored to various use cases. Our findings provide a deeper understanding of hybrid SSM-attention models and offer practical guidance for designing architectures tailored to various use cases.
[2] Frame Semantic Patterns for Identifying Underreporting of Notifiable Events in Healthcare: The Case of Gender-Based Violence
Lívia Dutra, Arthur Lorenzi, Laís Berno, Franciany Campos, Karoline Biscardi, Kenneth Brown, Marcelo Viridiano, Frederico Belcavello, Ely Matos, Olívia Guaranha, Erik Santos, Sofia Reinach, Tiago Timponi Torrent
Main category: cs.CL
TL;DR: A semantic frame-based methodology for identifying gender-based violence reports in e-medical records using fine-grained patterns, achieving 0.726 precision on Brazilian Portuguese data.
Details
Motivation: To address underreporting of gender-based violence in healthcare records by developing an automated detection system that can identify notifiable events from unstructured text data.Method: Uses semantic frames to define 8 fine-grained patterns, searches them in 21 million sentences from e-SUS APS e-medical records in Brazilian Portuguese, with manual evaluation by linguists.
Result: The methodology effectively identifies reports of violence with 0.726 precision, demonstrating robustness in detecting GBV cases from unstructured medical text.
Conclusion: The approach provides a transparent, efficient, low-carbon, language-agnostic pipeline that can be adapted to other health surveillance contexts, enabling ethical and explainable NLP use in public health systems.
Abstract: We introduce a methodology for the identification of notifiable events in the domain of healthcare. The methodology harnesses semantic frames to define fine-grained patterns and search them in unstructured data, namely, open-text fields in e-medical records. We apply the methodology to the problem of underreporting of gender-based violence (GBV) in e-medical records produced during patients’ visits to primary care units. A total of eight patterns are defined and searched on a corpus of 21 million sentences in Brazilian Portuguese extracted from e-SUS APS. The results are manually evaluated by linguists and the precision of each pattern measured. Our findings reveal that the methodology effectively identifies reports of violence with a precision of 0.726, confirming its robustness. Designed as a transparent, efficient, low-carbon, and language-agnostic pipeline, the approach can be easily adapted to other health surveillance contexts, contributing to the broader, ethical, and explainable use of NLP in public health systems.
[3] Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations
Jean-Philippe Corbeil, Asma Ben Abacha, Jerome Tremblay, Phillip Swazinna, Akila Jeeson Daniel, Miguel Del-Agua, Francois Beaulieu
Main category: cs.CL
TL;DR: MEDIQA-OE 2025 is the first shared task focused on extracting medical orders from doctor-patient conversations to reduce clinician documentation burden and improve patient care.
Details
Motivation: Current clinical documentation uses speech recognition and summarization, but converting conversations into actionable medical orders for Electronic Health Records remains unexplored, presenting an opportunity to significantly reduce clinician workload.Method: Six teams participated using various approaches including both closed- and open-weight large language models (LLMs) to extract medical orders from doctor-patient conversations.
Result: The paper presents the MEDIQA-OE task, dataset, final leaderboard ranking, and participants’ solutions from the shared task.
Conclusion: The MEDIQA-OE 2025 shared task successfully established the first benchmark for extracting medical orders from conversations, demonstrating the potential of LLMs to address this important clinical documentation challenge.
Abstract: Clinical documentation increasingly uses automatic speech recognition and summarization, yet converting conversations into actionable medical orders for Electronic Health Records remains unexplored. A solution to this problem can significantly reduce the documentation burden of clinicians and directly impact downstream patient care. We introduce the MEDIQA-OE 2025 shared task, the first challenge on extracting medical orders from doctor-patient conversations. Six teams participated in the shared task and experimented with a broad range of approaches, and both closed- and open-weight large language models (LLMs). In this paper, we describe the MEDIQA-OE task, dataset, final leaderboard ranking, and participants’ solutions.
[4] Semantically-Aware LLM Agent to Enhance Privacy in Conversational AI Services
Jayden Serenari, Stephen Lee
Main category: cs.CL
TL;DR: LOPSIDED framework protects user privacy in LLM conversations by dynamically replacing sensitive PII with pseudonyms while preserving semantic context, then depseudonymizing responses.
Details
Motivation: Growing concern over privacy leaks when users share sensitive personal data with conversational AI systems, as exposed PII can lead to security breaches or identity theft.Method: Dynamic replacement of sensitive PII entities in user prompts with semantically consistent pseudonyms, preserving contextual integrity, followed by automatic depseudonymization of model responses.
Result: LOPSIDED reduces semantic utility errors by a factor of 5 compared to baseline techniques while enhancing privacy protection.
Conclusion: The framework successfully safeguards sensitive PII data in LLM interactions without degrading response quality, maintaining both privacy and semantic integrity.
Abstract: With the increasing use of conversational AI systems, there is growing concern over privacy leaks, especially when users share sensitive personal data in interactions with Large Language Models (LLMs). Conversations shared with these models may contain Personally Identifiable Information (PII), which, if exposed, could lead to security breaches or identity theft. To address this challenge, we present the Local Optimizations for Pseudonymization with Semantic Integrity Directed Entity Detection (LOPSIDED) framework, a semantically-aware privacy agent designed to safeguard sensitive PII data when using remote LLMs. Unlike prior work that often degrade response quality, our approach dynamically replaces sensitive PII entities in user prompts with semantically consistent pseudonyms, preserving the contextual integrity of conversations. Once the model generates its response, the pseudonyms are automatically depseudonymized, ensuring the user receives an accurate, privacy-preserving output. We evaluate our approach using real-world conversations sourced from ShareGPT, which we further augment and annotate to assess whether named entities are contextually relevant to the model’s response. Our results show that LOPSIDED reduces semantic utility errors by a factor of 5 compared to baseline techniques, all while enhancing privacy.
[5] Kad: A Framework for Proxy-based Test-time Alignment with Knapsack Approximation Deferral
Ayoub Hammal, Pierre Zweigenbaum, Caio Corro
Main category: cs.CL
TL;DR: Proposes a proxy-based test-time alignment method using guidance from small aligned models to reduce computational costs of aligning large language models.
Details
Motivation: Large language models require expensive alignment procedures after pre-training, and these costs increase prohibitively as models scale up in size.Method: Token-specific cascading approach with deferral rules reduced to 0-1 knapsack problem, using primal and dual approximations for optimal deferral decisions.
Result: Experimental results show benefits in both task performance and speculative decoding speed.
Conclusion: Proxy-based test-time alignment using small aligned models provides an effective way to reduce computational costs while maintaining performance.
Abstract: Several previous works concluded that the largest part of generation capabilities of large language models (LLM) are learned (early) during pre-training. However, LLMs still require further alignment to adhere to downstream task requirements and stylistic preferences, among other desired properties. As LLMs continue to scale in terms of size, the computational cost of alignment procedures increase prohibitively. In this work, we propose a novel approach to circumvent these costs via proxy-based test-time alignment, i.e. using guidance from a small aligned model. Our approach can be described as token-specific cascading method, where the token-specific deferral rule is reduced to 0-1 knapsack problem. In this setting, we derive primal and dual approximations of the optimal deferral decision. We experimentally show the benefits of our method both in task performance and speculative decoding speed.
[6] Elastic Architecture Search for Efficient Language Models
Shang Wang
Main category: cs.CL
TL;DR: ELM is a neural architecture search method that creates compact language models through flexible search spaces and knowledge distillation, outperforming existing methods on language tasks.
Details
Motivation: Address computational and memory concerns of large pre-trained language models by developing more efficient and compact alternatives.Method: Extends NAS with flexible search space using efficient transformer blocks, dynamic modules for dimension/head adjustment, and novel knowledge distillation losses.
Result: ELM-discovered models significantly outperform existing methods on masked language modeling and causal language modeling tasks.
Conclusion: ELM provides an effective approach for developing compact yet high-performing language models through improved architecture search and distillation techniques.
Abstract: As large pre-trained language models become increasingly critical to natural language understanding (NLU) tasks, their substantial computational and memory requirements have raised significant economic and environmental concerns. Addressing these challenges, this paper introduces the Elastic Language Model (ELM), a novel neural architecture search (NAS) method optimized for compact language models. ELM extends existing NAS approaches by introducing a flexible search space with efficient transformer blocks and dynamic modules for dimension and head number adjustment. These innovations enhance the efficiency and flexibility of the search process, which facilitates more thorough and effective exploration of model architectures. We also introduce novel knowledge distillation losses that preserve the unique characteristics of each block, in order to improve the discrimination between architectural choices during the search process. Experiments on masked language modeling and causal language modeling tasks demonstrate that models discovered by ELM significantly outperform existing methods.
[7] Dataset Creation and Baseline Models for Sexism Detection in Hausa
Fatima Adam Muhammad, Shamsuddeen Muhammad Hassan, Isa Inuwa-Dutse
Main category: cs.CL
TL;DR: This study introduces the first Hausa sexism detection dataset using community engagement and data augmentation, and evaluates machine learning models for detecting sexism in this low-resource language.
Details
Motivation: Sexism detection is well-developed in high-resource languages but limited in low-resource languages like Hausa, where cultural differences affect how sexism is expressed and perceived.Method: Created the first Hausa sexism detection dataset through community engagement, qualitative coding, and data augmentation with native speakers (n=66). Evaluated traditional ML classifiers and pre-trained multilingual language models with few-shot learning.
Result: Found challenges in capturing cultural nuance, especially with clarification-seeking and idiomatic expressions, with many false positives in such cases.
Conclusion: The study highlights the difficulties in sexism detection for low-resource languages due to cultural and linguistic nuances, and demonstrates the need for culturally-aware approaches.
Abstract: Sexism reinforces gender inequality and social exclusion by perpetuating stereotypes, bias, and discriminatory norms. Noting how online platforms enable various forms of sexism to thrive, there is a growing need for effective sexism detection and mitigation strategies. While computational approaches to sexism detection are widespread in high-resource languages, progress remains limited in low-resource languages where limited linguistic resources and cultural differences affect how sexism is expressed and perceived. This study introduces the first Hausa sexism detection dataset, developed through community engagement, qualitative coding, and data augmentation. For cultural nuances and linguistic representation, we conducted a two-stage user study (n=66) involving native speakers to explore how sexism is defined and articulated in everyday discourse. We further experiment with both traditional machine learning classifiers and pre-trained multilingual language models and evaluating the effectiveness few-shot learning in detecting sexism in Hausa. Our findings highlight challenges in capturing cultural nuance, particularly with clarification-seeking and idiomatic expressions, and reveal a tendency for many false positives in such cases.
[8] Quantitative Intertextuality from the Digital Humanities Perspective: A Survey
Siyu Duan
Main category: cs.CL
TL;DR: This paper provides a roadmap for quantitative intertextuality studies, reviewing data, methods, and applications across multiple languages and topics, from statistics to deep learning.
Details
Motivation: Advancements in natural language processing have enabled large-scale quantitative intertextuality research, building on literary theory foundations to study connections between texts.Method: The paper surveys quantitative methods for intertextuality studies, covering approaches from statistics to deep learning, using data from multiple languages and topics.
Result: The survey summarizes applications in humanities and social sciences research and associated platform tools, showing the evolution of intertextuality studies into the quantitative age.
Conclusion: Driven by computer technology advances, more precise, diverse, and large-scale intertext studies are anticipated, with intertextuality holding promise for broader interdisciplinary applications bridging AI and humanities.
Abstract: The connection between texts is referred to as intertextuality in literary theory, which served as an important theoretical basis in many digital humanities studies. Over the past decade, advancements in natural language processing have ushered intertextuality studies into the quantitative age. Large-scale intertextuality research based on cutting-edge methods has continuously emerged. This paper provides a roadmap for quantitative intertextuality studies, summarizing their data, methods, and applications. Drawing on data from multiple languages and topics, this survey reviews methods from statistics to deep learning. It also summarizes their applications in humanities and social sciences research and the associated platform tools. Driven by advances in computer technology, more precise, diverse, and large-scale intertext studies can be anticipated. Intertextuality holds promise for broader application in interdisciplinary research bridging AI and the humanities.
[9] Recursive numeral systems are highly regular and easy to process
Ponrawee Prasertsom, Andrea Silvi, Jennifer Culbertson, Moa Johansson, Devdatt Dubhashi, Kenny Smith
Main category: cs.CL
TL;DR: Recursive numeral systems optimize regularity and processing complexity rather than just lexicon size vs morphosyntactic complexity trade-off, with MDL-based measures better distinguishing natural vs unnatural systems.
Details
Motivation: Previous approaches failed to explain why only natural-language-like numeral systems optimize the trade-off between lexicon size and morphosyntactic complexity, relying on ad-hoc constraints to exclude unnatural systems.Method: Used Minimum Description Length (MDL) approach to measure regularity and processing complexity, analyzing how these factors distinguish natural recursive numeral systems from unnatural but mathematically possible ones.
Result: MDL-based measures of regularity and processing complexity successfully capture key differences between natural and unnatural numeral systems, showing that natural systems optimize these factors and that previous ad-hoc constraints naturally follow from regularity considerations.
Conclusion: Regularity across sets of forms is crucial for understanding language optimality, and MDL-based approaches provide better explanatory power for why natural numeral systems are preferred over mathematically possible alternatives.
Abstract: Previous work has argued that recursive numeral systems optimise the trade-off between lexicon size and average morphosyntatic complexity (Deni'c and Szymanik, 2024). However, showing that only natural-language-like systems optimise this tradeoff has proven elusive, and the existing solution has relied on ad-hoc constraints to rule out unnatural systems (Yang and Regier, 2025). Here, we argue that this issue arises because the proposed trade-off has neglected regularity, a crucial aspect of complexity central to human grammars in general. Drawing on the Minimum Description Length (MDL) approach, we propose that recursive numeral systems are better viewed as efficient with regard to their regularity and processing complexity. We show that our MDL-based measures of regularity and processing complexity better capture the key differences between attested, natural systems and unattested but possible ones, including “optimal” recursive numeral systems from previous work, and that the ad-hoc constraints from previous literature naturally follow from regularity. Our approach highlights the need to incorporate regularity across sets of forms in studies that attempt to measure and explain optimality in language.
[10] VISTA Score: Verification In Sequential Turn-based Assessment
Ashley Lewis, Andrew Perrault, Eric Fosler-Lussier, Michael White
Main category: cs.CL
TL;DR: VISTA is a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking, improving hallucination detection over existing methods.
Details
Motivation: Hallucination remains a major obstacle for deploying conversational AI in fact-critical settings, and existing metrics are limited for multi-turn dialogue evaluation.Method: Decomposes assistant turns into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements into four types.
Result: VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines across eight LLMs and four dialogue benchmarks, with human evaluation confirming improved annotator agreement.
Conclusion: By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.
Abstract: Hallucination–defined here as generating statements unsupported or contradicted by available evidence or conversational context–remains a major obstacle to deploying conversational AI systems in settings that demand factual reliability. Existing metrics either evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking. VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms that VISTA’s decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.
[11] LLM-Centric RAG with Multi-Granular Indexing and Confidence Constraints
Xiaofan Guo, Yaxuan Luan, Yue Kang, Xiangchen Song, Jinxu Guo
Main category: cs.CL
TL;DR: A confidence control method for retrieval-augmented generation that combines multi-granularity memory indexing with uncertainty estimation to improve coverage, stability, and reliability in complex knowledge environments.
Details
Motivation: To address insufficient coverage, unstable results, and limited reliability in retrieval-augmented generation under complex knowledge environments.Method: Builds hierarchical memory structure with multi-granularity knowledge representations, enables dynamic indexing from local to global context, and introduces uncertainty estimation to filter low-confidence paths during generation. Uses optimization objective with generation loss, entropy constraints, and variance regularization.
Result: Achieves superior performance over existing models in QA accuracy, retrieval recall, ranking quality, and factual consistency across different scenarios. Demonstrates effectiveness in maintaining information coverage while suppressing noise and false content.
Conclusion: Provides a new technical pathway for retrieval-augmented generation and offers practical evidence for improving reliability and controllability of large models in complex contexts.
Abstract: This paper addresses the issues of insufficient coverage, unstable results, and limited reliability in retrieval-augmented generation under complex knowledge environments, and proposes a confidence control method that integrates multi-granularity memory indexing with uncertainty estimation. The method builds a hierarchical memory structure that divides knowledge representations into different levels of granularity, enabling dynamic indexing and retrieval from local details to global context, and thus establishing closer semantic connections between retrieval and generation. On this basis, an uncertainty estimation mechanism is introduced to explicitly constrain and filter low-confidence paths during the generation process, allowing the model to maintain information coverage while effectively suppressing noise and false content. The overall optimization objective consists of generation loss, entropy constraints, and variance regularization, forming a unified confidence control framework. In the experiments, comprehensive sensitivity tests and comparative analyses were designed, covering hyperparameters, environmental conditions, and data structures, to verify the stability and robustness of the proposed method across different scenarios. The results show that the method achieves superior performance over existing models in QA accuracy, retrieval recall, ranking quality, and factual consistency, demonstrating the effectiveness of combining multi-granularity indexing with confidence control. This study not only provides a new technical pathway for retrieval-augmented generation but also offers practical evidence for improving the reliability and controllability of large models in complex contexts.
[12] Detecting Data Contamination in LLMs via In-Context Learning
Michał Zawalski, Meriem Boubdir, Klaudia Bałazy, Besmira Nushi, Pablo Ribalta
Main category: cs.CL
TL;DR: CoDeC is a practical method to detect and quantify training data contamination in LLMs by measuring how in-context learning affects model performance, distinguishing memorized vs unseen data.
Details
Motivation: To address the problem of detecting whether models have been trained on specific datasets (contamination) when training data is undisclosed, which can invalidate benchmark evaluations.Method: Uses in-context learning effects - measures how adding in-context examples changes model confidence, where memorized data shows disrupted patterns vs unseen data.
Result: Produces interpretable contamination scores that clearly separate seen/unseen datasets, reveals strong memorization evidence in open-weight models with undisclosed training data.
Conclusion: CoDeC is simple, automated, model- and dataset-agnostic, making it easy to integrate with benchmark evaluations for contamination detection.
Abstract: We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.
[13] Contrastive Knowledge Transfer and Robust Optimization for Secure Alignment of Large Language Models
Jiasen Zheng, Huajun Zhang, Xu Yan, Ran Hao, Chong Peng
Main category: cs.CL
TL;DR: Proposes a fine-tuning method combining contrastive distillation and noise-robust training to improve safety alignment and robustness in large language models.
Details
Motivation: Address limitations of large-scale language models in safety alignment and robustness, particularly their vulnerability to noisy inputs and distribution shifts.Method: Freezes backbone model and transfers teacher knowledge boundaries through distillation, while introducing noise perturbations and robust optimization constraints during training. Uses distillation loss, robustness loss, and regularization in unified optimization.
Result: Significantly outperforms existing baselines in knowledge transfer, robustness, and overall safety, achieving best performance across multiple key metrics.
Conclusion: Enriches theoretical system of parameter-efficient fine-tuning and provides new solution for building safer, more trustworthy alignment mechanisms.
Abstract: This paper addresses the limitations of large-scale language models in safety alignment and robustness by proposing a fine-tuning method that combines contrastive distillation with noise-robust training. The method freezes the backbone model and transfers the knowledge boundaries of the teacher model to the student model through distillation, thereby improving semantic consistency and alignment accuracy. At the same time, noise perturbations and robust optimization constraints are introduced during training to ensure that the model maintains stable predictive outputs under noisy and uncertain inputs. The overall framework consists of distillation loss, robustness loss, and a regularization term, forming a unified optimization objective that balances alignment ability with resistance to interference. To systematically validate its effectiveness, the study designs experiments from multiple perspectives, including distillation weight sensitivity, stability analysis under computation budgets and mixed-precision environments, and the impact of data noise and distribution shifts on model performance. Results show that the method significantly outperforms existing baselines in knowledge transfer, robustness, and overall safety, achieving the best performance across several key metrics. This work not only enriches the theoretical system of parameter-efficient fine-tuning but also provides a new solution for building safer and more trustworthy alignment mechanisms.
[14] Characterizing Selective Refusal Bias in Large Language Models
Adel Khorramrouz, Sharon Levy
Main category: cs.CL
TL;DR: LLM safety guardrails exhibit selective refusal bias across demographic groups, where models refuse harmful content generation for some groups but not others, creating new biases and safety vulnerabilities.
Details
Motivation: To investigate how safety guardrails in LLMs can introduce new biases by selectively refusing to generate harmful content for certain demographic groups while allowing it for others.Method: Analyzed refusal rates across targeted individual and intersectional demographic groups, examined types of LLM responses, and measured length of generated refusals. Also conducted indirect attacks targeting previously refused groups.
Result: Found evidence of selective refusal bias across gender, sexual orientation, nationality, and religion attributes. Identified safety vulnerabilities through indirect attacks on refused groups.
Conclusion: LLM safety guardrails need more equitable and robust performance across all demographic groups to prevent introducing new biases and maintain consistent safety standards.
Abstract: Safety guardrails in large language models(LLMs) are developed to prevent malicious users from generating toxic content at a large scale. However, these measures can inadvertently introduce or reflect new biases, as LLMs may refuse to generate harmful content targeting some demographic groups and not others. We explore this selective refusal bias in LLM guardrails through the lens of refusal rates of targeted individual and intersectional demographic groups, types of LLM responses, and length of generated refusals. Our results show evidence of selective refusal bias across gender, sexual orientation, nationality, and religion attributes. This leads us to investigate additional safety implications via an indirect attack, where we target previously refused groups. Our findings emphasize the need for more equitable and robust performance in safety guardrails across demographic groups.
[15] Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks
Rajarshi Haldar, Julia Hockenmaier
Main category: cs.CL
TL;DR: LLM-based evaluation of NLG systems shows low intra-rater reliability, making scores inconsistent across runs, but judicious use with proper guidelines may still be beneficial.
Details
Motivation: As NLG adoption grows, proper assessment becomes crucial. LLM-based evaluation has gained popularity due to better alignment with human preferences than traditional metrics, but reliability issues need investigation.Method: Conducted experiments measuring intra-rater reliability of LLM judges across different NLG tasks and benchmarks, analyzing score variance across multiple runs.
Result: LLM judges exhibit low intra-rater reliability, with significant variance in assigned scores across different runs, making ratings inconsistent and sometimes arbitrary.
Conclusion: While LLM judges show reliability issues, careful application following proper guidelines may still make them useful for NLG evaluation despite the inconsistency problems.
Abstract: As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.
[16] Probability Distributions Computed by Hard-Attention Transformers
Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski, Ryan Cotterell, David Chiang
Main category: cs.CL
TL;DR: This paper analyzes the expressivity of transformer language models (which generate strings autoregressively) rather than transformer language recognizers (which accept/reject strings), showing that autoregression and probabilistic modeling can increase expressivity and break equivalences.
Details
Motivation: Most existing expressivity results for transformers treat them as language recognizers, but in practice they are used as language models for autoregressive and probabilistic string generation. The paper aims to characterize what probability distributions transformer language models can actually express.Method: The authors analyze the expressivity of transformer language models by examining how making transformer language recognizers autoregressive and probabilistic affects their capabilities, comparing expressivity between different transformer variants.
Result: The study shows that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case.
Conclusion: The paper provides a characterization of the probability distributions that transformer language models can express, teasing apart what functions transformers are capable of in their most common use-case as language models rather than as language recognizers.
Abstract: Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
[17] Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+
Mason Shipton, York Hay Ng, Aditya Khan, Phuong Hanh Hoang, Xiang Lu, A. Seza Doğruöz, En-Shiun Annie Lee
Main category: cs.CL
TL;DR: Extends URIEL+ linguistic knowledge base by adding script vectors, integrating Glottolog for expanded language coverage, and improving lineage imputation to reduce data sparsity for multilingual research.
Details
Motivation: Address data sparsity in URIEL+ (missing features, incomplete entries, limited genealogical coverage) that limits usefulness for cross-lingual transfer, especially for low-resource languages.Method: Three main contributions: 1) Introduce script vectors for 7,488 languages, 2) Integrate Glottolog to add 18,710 languages, 3) Expand lineage imputation for 26,449 languages by propagating typological and script features across genealogies.
Result: Reduced feature sparsity by 14% for script vectors, increased language coverage by up to 19,015 languages (1,007%), improved imputation quality metrics by up to 33%, and showed up to 6% performance gains in cross-lingual transfer tasks.
Conclusion: The extensions make URIEL+ more complete and inclusive for multilingual research, particularly benefiting low-resource languages.
Abstract: The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity remains prevalent, in the form of missing feature types, incomplete language entries, and limited genealogical coverage. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, this paper extends URIEL+ with three contributions: introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These additions reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and improve imputation quality metrics by up to 33%. Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups. Our advances make URIEL+ more complete and inclusive for multilingual research.
[18] MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models
Zixin Chen, Hongzhan Lin, Kaixin Li, Ziyang Luo, Yayue Deng, Jing Ma
Main category: cs.CL
TL;DR: MemeArena is an agent-based arena-style evaluation framework that provides context-aware and unbiased assessment for multimodal LLMs’ understanding of multimodal harmfulness in memes.
Details
Motivation: Existing evaluation approaches focus only on binary classification accuracy, failing to capture interpretive nuance of harmfulness across diverse contexts in social media memes.Method: Simulates diverse interpretive contexts to formulate evaluation tasks, elicits perspective-specific analyses from mLLMs, and integrates varied viewpoints to reach consensus among evaluators.
Result: Effectively reduces evaluation biases of judge agents, with judgment results closely aligning with human preferences.
Conclusion: Provides valuable insights into reliable and comprehensive mLLM evaluations for multimodal harmfulness understanding.
Abstract: The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to effectively understand multimodal harmfulness. Existing evaluation approaches predominantly focus on mLLMs' detection accuracy for binary classification tasks, which often fail to reflect the in-depth interpretive nuance of harmfulness across diverse contexts. In this paper, we propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs' understanding of multimodal harmfulness. Specifically, MemeArena simulates diverse interpretive contexts to formulate evaluation tasks that elicit perspective-specific analyses from mLLMs. By integrating varied viewpoints and reaching consensus among evaluators, it enables fair and unbiased comparisons of mLLMs’ abilities to interpret multimodal harmfulness. Extensive experiments demonstrate that our framework effectively reduces the evaluation biases of judge agents, with judgment results closely aligning with human preferences, offering valuable insights into reliable and comprehensive mLLM evaluations in multimodal harmfulness understanding. Our code and data are publicly available at https://github.com/Lbotirx/MemeArena.
[19] Identifying the Periodicity of Information in Natural Language
Yulin Ou, Yu Wang, Yang Xu, Hendrik Buschmeier
Main category: cs.CL
TL;DR: The paper introduces AutoPeriod of Surprisal (APS) to detect periodicity patterns in natural language information, finding that human language exhibits significant periodicity beyond typical structural units.
Details
Motivation: To investigate the degree of periodicity in natural language's encoded information, building on recent theoretical advances in information density.Method: Developed AutoPeriod of Surprisal (APS) using canonical periodicity detection algorithms to identify significant periods in surprisal sequences of single documents.
Result: Found that a considerable proportion of human language shows strong periodicity patterns, including new periods outside typical structural units like sentence boundaries, confirmed via harmonic regression modeling.
Conclusion: Periodicity in language information results from both structured factors and longer-distance driving factors; APS method shows potential for LLM-generation detection.
Abstract: Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.
[20] Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani, J Ross Mitchell
Main category: cs.CL
TL;DR: BEAM benchmark addresses limitations in evaluating LLMs’ long-term memory with coherent, diverse conversations and probing questions. LIGHT framework enhances LLM performance using three complementary memory systems inspired by human cognition.
Details
Motivation: Existing benchmarks lack narrative coherence, cover narrow domains, and only test simple recall tasks, making them inadequate for evaluating LLMs' long-term memory abilities in conversational settings.Method: 1) Created BEAM benchmark with 100 conversations (up to 10M tokens) and 2,000 validated questions. 2) Proposed LIGHT framework with three memory systems: long-term episodic memory, short-term working memory, and scratchpad for accumulating facts.
Result: LLMs with 1M token context windows struggle as dialogues lengthen. LIGHT consistently improves performance across models, achieving 3.5%-12.69% average improvement over strongest baselines. Ablation study confirms each memory component’s contribution.
Conclusion: The proposed BEAM benchmark and LIGHT framework effectively address the challenges of evaluating and enhancing LLMs’ long-term memory capabilities in extended conversational contexts.
Abstract: Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.
[21] Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Rajan Agarwal, Aarush Gupta
Main category: cs.CL
TL;DR: LLINK is a compute-efficient method that improves LLM performance on low-resource non-Latin scripts by aligning multilingual sentence embeddings to the decoder’s latent space using a lightweight contrastive projector, without changing tokenizers or retraining the decoder.
Details
Motivation: Instruction-tuned LLMs underperform on low-resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling, creating a need for efficient solutions that don't require full model retraining.Method: Align sentence embeddings from a frozen multilingual encoder to the decoder’s latent embedding space via contrastive projection, expand into K soft slots, and train with minimal adapters so the frozen decoder can consume the signal without tokenizer changes.
Result: LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged Q&A evaluations, with improvements attributed to reduced tokenization inflation and stronger cross-lingual alignment.
Conclusion: Treating low-resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs, though residual weaknesses in numeric fidelity remain.
Abstract: Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder’s latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged Q&A evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.
[22] MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models
Kangkun Mao, Jinru Ding, Jiayuan Chen, Mouxiao Bian, Ruiyao Chen, Xinwei Peng, Sijie Ren, Linyang Li, Jie Xu
Main category: cs.CL
TL;DR: MedCalc-Eval is a comprehensive benchmark with 700+ medical calculation tasks to evaluate LLMs’ quantitative reasoning in clinical decision-making, addressing limitations of existing datasets.
Details
Motivation: Existing medical LLM benchmarks focus on question answering and descriptive reasoning but overlook quantitative reasoning critical for clinical decision-making. Current datasets like MedCalc-Bench cover few calculation tasks and don't reflect real-world computational scenarios.Method: Created MedCalc-Eval benchmark with 700+ tasks across equation-based calculations (e.g., Cockcroft-Gault, BMI) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). Developed MedCalc-Env reinforcement learning environment on InternBootcamp framework for multi-step clinical reasoning and planning.
Result: Fine-tuned Qwen2.5-32B model achieved state-of-the-art results on MedCalc-Eval with significant improvements in numerical sensitivity, formula selection, and reasoning robustness.
Conclusion: MedCalc-Eval provides a broader and more challenging evaluation setting for medical LLMs. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding.
Abstract: As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs’ medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at https://github.com/maokangkun/MedCalc-Eval.
[23] Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?
Deokhyung Kang, Seonjeong Hwang, Daehui Kim, Hyounghun Kim, Gary Geunbae Lee
Main category: cs.CL
TL;DR: The multilingual reasoning gap in language models stems from understanding failures, which can be detected and selectively mitigated through translation.
Details
Motivation: To understand and address the multilingual reasoning gap where models perform better in high-resource languages than low-resource ones, by identifying its underlying causes.Method: Proposed Selective Translation strategy that translates multilingual input to English only when understanding failures are detected, using supervised detection methods.
Result: Selective Translation bridges the multilingual reasoning gap, achieving near full-translation performance while using translation for only about 20% of inputs.
Conclusion: Understanding failures are the primary cause of the multilingual reasoning gap and can be effectively detected and selectively mitigated, providing a path toward more equitable multilingual reasoning.
Abstract: Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still suffer from a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have reduced this gap, its underlying causes remain largely unexplored. In this paper, we address this by showing that the multilingual reasoning gap largely stems from failures in language understanding-the model’s inability to represent the multilingual input meaning into the dominant language (i.e., English) within its reasoning trace. This motivates us to examine whether understanding failures can be detected, as this ability could help mitigate the multilingual reasoning gap. To this end, we evaluate a range of detection methods and find that understanding failures can indeed be identified, with supervised approaches performing best. Building on this, we propose Selective Translation, a simple yet effective strategy that translates the multilingual input into English only when an understanding failure is detected. Experimental results show that Selective Translation bridges the multilingual reasoning gap, achieving near full-translation performance while using translation for only about 20% of inputs. Together, our work demonstrates that understanding failures are the primary cause of the multilingual reasoning gap and can be detected and selectively mitigated, providing key insight into its origin and a promising path toward more equitable multilingual reasoning. Our code and data are publicly available at https://github.com/deokhk/RLM_analysis.
[24] A Unified Representation Underlying the Judgment of Large Language Models
Yi-Long Lu, Jiajun Song, Wei Wang
Main category: cs.CL
TL;DR: LLMs use a unified Valence-Assent Axis (VAA) that combines evaluative judgments and factual assent, creating a control signal that subordinates reasoning to justify predetermined conclusions, leading to bias and hallucinations.
Details
Motivation: To determine whether AI judgment relies on specialized modules or a unified domain-general resource, specifically investigating if decodable neural representations in LLMs are truly independent systems.Method: Analyzed diverse evaluative judgments across multiple LLMs, identified the dominant Valence-Assent Axis (VAA), and conducted direct interventions to test its function as a control signal.
Result: Found that VAA jointly encodes subjective valence and factual assent, creating a dependency where reasoning is subordinated to justify evaluative states, even at the cost of factual accuracy.
Conclusion: LLMs have a convergent architecture where coherent judgment systematically undermines faithful reasoning, providing a mechanistic explanation for systemic bias and hallucinations.
Abstract: A central architectural question for both biological and artificial intelligence is whether judgment relies on specialized modules or a unified, domain-general resource. While the discovery of decodable neural representations for distinct concepts in Large Language Models (LLMs) has suggested a modular architecture, whether these representations are truly independent systems remains an open question. Here we provide evidence for a convergent architecture. Across a range of LLMs, we find that diverse evaluative judgments are computed along a dominant dimension, which we term the Valence-Assent Axis (VAA). This axis jointly encodes subjective valence (“what is good”) and the model’s assent to factual claims (“what is true”). Through direct interventions, we show this unified representation creates a critical dependency: the VAA functions as a control signal that steers the generative process to construct a rationale consistent with its evaluative state, even at the cost of factual accuracy. This mechanism, which we term the subordination of reasoning, shifts the process of reasoning from impartial inference toward goal-directed justification. Our discovery offers a mechanistic account for systemic bias and hallucination, revealing how an architecture that promotes coherent judgment can systematically undermine faithful reasoning.
[25] TransAlign: Machine Translation Encoders are Strong Word Aligners, Too
Benedikt Ebing, Christian Goldschmied, Goran Glavaš
Main category: cs.CL
TL;DR: TransAlign is a novel word aligner that uses MT model encoders for cross-lingual transfer, outperforming existing methods in token classification tasks.
Details
Motivation: Current multilingual word aligners for cross-lingual transfer rely on encoder models like mBERT or LaBSE, while MT models are underutilized despite their natural alignment capabilities.Method: Proposes TransAlign, a word aligner that leverages the encoder of massively multilingual machine translation models for label projection in translate-test and translate-train approaches.
Result: TransAlign achieves strong word alignment performance and substantially outperforms popular word aligners and state-of-the-art non-alignment-based label projection methods.
Conclusion: MT model encoders can be effectively utilized for word alignment, providing superior performance for cross-lingual transfer in token classification tasks.
Abstract: In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test – evaluating on noisy source language data translated from the target language – and translate-train – training on noisy target language data translated from the source language – have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder architectures, yielding poor WA results. In this work, in contrast, we propose TransAlign, a novel word aligner that utilizes the encoder of a massively multilingual MT model. We show that TransAlign not only achieves strong WA performance but substantially outperforms popular WA and state-of-the-art non-WA-based label projection methods in MT-based XLT for token classification.
[26] ThoughtProbe: Classifier-Guided LLM Thought Space Exploration via Probing Representations
Zijian Wang, Chang Xu
Main category: cs.CL
TL;DR: ThoughtProbe is an inference-time framework that uses LLMs’ hidden reasoning features to guide tree-structured exploration, employing classifiers to prioritize candidates and branch aggregation to identify optimal answers.
Details
Motivation: To improve LLM reasoning performance by leveraging hidden representations as discriminative signals rather than manipulating them, enabling more efficient exploration of reasoning chains.Method: Uses tree-structured response space exploration with classifiers for scoring/ranking candidates, followed by branch aggregation that marginalizes over supporting branches using CoT scores.
Result: Achieves significant improvements across multiple arithmetic reasoning benchmarks by effectively covering and identifying valid reasoning chains.
Conclusion: ThoughtProbe’s comprehensive exploration framework successfully enhances LLM reasoning performance through efficient resource allocation and optimal answer identification.
Abstract: This paper introduces ThoughtProbe, a novel inference time framework that leverages the hidden reasoning features of Large Language Models (LLMs) to improve their reasoning performance. Unlike previous works that manipulate the hidden representations to steer LLM generation, we harness them as discriminative signals to guide the tree structured response space exploration. In each node expansion, a classifier serves as a scoring and ranking mechanism that efficiently allocates computational resources by prioritizing higher score candidates for continuation. After completing the tree expansion, we collect answers from all branches to form a candidate answer pool. We then propose a branch aggregation method that marginalizes over all supporting branches by aggregating their CoT scores, thereby identifying the optimal answer from the pool. Experimental results show that our framework’s comprehensive exploration not only covers valid reasoning chains but also effectively identifies them, achieving significant improvements across multiple arithmetic reasoning benchmarks.
[27] From the Rock Floor to the Cloud: A Systematic Survey of State-of-the-Art NLP in Battery Life Cycle
Tosin Adewumi, Martin Karlsson, Marcus Liwicki, Mikael Sjödahl, Lama Alkhaled, Rihab Gargouri, Nudrat Habib, Franz Hennie
Main category: cs.CL
TL;DR: A systematic survey of NLP applications across the entire battery life cycle, introducing a Technical Language Processing (TLP) framework for digital battery passports and battery predictions.
Details
Motivation: To comprehensively review NLP applications throughout the complete battery life cycle rather than focusing on single stages or methods, and address challenges in the battery domain through a novel TLP framework.Method: Used PRISMA systematic review methodology with three databases (Google Scholar, IEEE Xplore, Scopus), assessed 274 papers and critically reviewed 66 relevant papers. Proposed a TLP framework incorporating agentic AI and optimized prompts.
Result: Found emerging NLP tasks in battery domain that facilitate materials discovery and other life cycle stages, but identified challenges like lack of standard benchmarks. Publicly provided review artifacts for validation.
Conclusion: The proposed TLP framework with agentic AI and optimized prompts is suitable for addressing current challenges in battery NLP applications, particularly for digital battery passports and general battery predictions.
Abstract: We present a comprehensive systematic survey of the application of natural language processing (NLP) along the entire battery life cycle, instead of one stage or method, and introduce a novel technical language processing (TLP) framework for the EU’s proposed digital battery passport (DBP) and other general battery predictions. We follow the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) method and employ three reputable databases or search engines, including Google Scholar, Institute of Electrical and Electronics Engineers Xplore (IEEE Xplore), and Scopus. Consequently, we assessed 274 scientific papers before the critical review of the final 66 relevant papers. We publicly provide artifacts of the review for validation and reproducibility. The findings show that new NLP tasks are emerging in the battery domain, which facilitate materials discovery and other stages of the life cycle. Notwithstanding, challenges remain, such as the lack of standard benchmarks. Our proposed TLP framework, which incorporates agentic AI and optimized prompts, will be apt for tackling some of the challenges.
[28] Balancing Knowledge Updates: Toward Unified Modular Editing in LLMs
Jiahao Liu, Zijian Wang, Kuo Zhao, Dong Hu
Main category: cs.CL
TL;DR: IntAttn-Edit is a knowledge editing method that jointly updates both MLP and attention modules in LLMs using a knowledge balancing strategy, achieving better performance than methods that only edit MLP modules.
Details
Motivation: Most existing knowledge editing methods focus only on MLP modules, ignoring attention modules which also store factual knowledge, leading to residual outdated knowledge and limited editing effectiveness.Method: Extends associative memory paradigm to jointly update both MLP and Attn modules using a knowledge balancing strategy that allocates update magnitudes proportional to each module’s contribution to knowledge storage.
Result: Achieves higher edit success, better generalization, and stronger knowledge preservation than prior methods on standard benchmarks. The balancing strategy maintains optimal performance across diverse settings.
Conclusion: Attention modules play a substantial role in factual knowledge storage, especially in earlier layers, and joint editing of both MLP and Attn modules with balanced updates significantly improves knowledge editing effectiveness.
Abstract: Knowledge editing has emerged as an efficient approach for updating factual knowledge in large language models (LLMs). It typically locates knowledge storage modules and then modifies their parameters. However, most existing methods focus on the weights of multilayer perceptron (MLP) modules, which are often identified as the main repositories of factual information. Other components, such as attention (Attn) modules, are often ignored during editing. This imbalance can leave residual outdated knowledge and limit editing effectiveness. We perform comprehensive knowledge localization experiments on advanced LLMs and find that Attn modules play a substantial role in factual knowledge storage and retrieval, especially in earlier layers. Based on these insights, we propose IntAttn-Edit, a method that extends the associative memory paradigm to jointly update both MLP and Attn modules. Our approach uses a knowledge balancing strategy that allocates update magnitudes in proportion to each module’s measured contribution to knowledge storage. Experiments on standard benchmarks show that IntAttn-Edit achieves higher edit success, better generalization, and stronger knowledge preservation than prior methods. Further analysis shows that the balancing strategy keeps editing performance within an optimal range across diverse settings.
[29] Awal – Community-Powered Language Technology for Tamazight
Alp Öktem, Farida Boudichat
Main category: cs.CL
TL;DR: Awal is a community-powered initiative for developing language technology resources for Tamazight, addressing data scarcity through a collaborative platform. Despite positive reception, actual contributions were modest due to barriers like limited confidence in written Tamazight and standardization challenges.
Details
Motivation: To address the underrepresentation of Tamazight in digital spaces and persistent data scarcity for language technology development.Method: Launched awaldigital.org platform enabling speakers to contribute translation and voice data through community-driven crowdsourcing approach.
Result: After 18 months: 6,421 translation pairs and 3 hours of speech data collected. Contributions concentrated among linguists and activists, revealing significant barriers to broader participation.
Conclusion: Standard crowdsourcing approaches have limitations for languages with complex sociolinguistic contexts like Tamazight. The initiative continues working on improved open-source machine translation models using collected data.
Abstract: This paper presents Awal, a community-powered initiative for developing language technology resources for Tamazight. We provide a comprehensive review of the NLP landscape for Tamazight, examining recent progress in computational resources, and the emergence of community-driven approaches to address persistent data scarcity. Launched in 2024, awaldigital.org platform addresses the underrepresentation of Tamazight in digital spaces through a collaborative platform enabling speakers to contribute translation and voice data. We analyze 18 months of community engagement, revealing significant barriers to participation including limited confidence in written Tamazight and ongoing standardization challenges. Despite widespread positive reception, actual data contribution remained concentrated among linguists and activists. The modest scale of community contributions – 6,421 translation pairs and 3 hours of speech data – highlights the limitations of applying standard crowdsourcing approaches to languages with complex sociolinguistic contexts. We are working on improved open-source MT models using the collected data.
[30] Dynamic Affective Memory Management for Personalized LLM Agents
Junfeng Lu, Yueyan Li
Main category: cs.CL
TL;DR: A new memory management system for AI agents that uses Bayesian-inspired memory updates with entropy minimization to address memory redundancy, staleness, and poor integration issues in personalized AI systems.
Details
Motivation: Current AI agent systems rely on personalized external memory databases but suffer from memory redundancy, staleness, and poor memory-context integration due to ineffective memory updates during interaction.Method: Proposes a Bayesian-inspired memory update algorithm using memory entropy concept, enabling agents to autonomously maintain dynamically updated memory vector databases by minimizing global entropy for better personalization.
Result: Experimental results on DABench benchmark show superior performance in personalization, logical coherence, and accuracy. Ablation studies validate the Bayesian update mechanism effectively alleviates memory bloat.
Conclusion: The work provides new insights into designing long-term memory systems for AI agents, demonstrating that dynamic memory management with entropy minimization improves personalized service delivery.
Abstract: Advances in large language models are making personalized AI agents a new research focus. While current agent systems primarily rely on personalized external memory databases to deliver customized experiences, they face challenges such as memory redundancy, memory staleness, and poor memory-context integration, largely due to the lack of effective memory updates during interaction. To tackle these issues, we propose a new memory management system designed for affective scenarios. Our approach employs a Bayesian-inspired memory update algorithm with the concept of memory entropy, enabling the agent to autonomously maintain a dynamically updated memory vector database by minimizing global entropy to provide more personalized services. To better evaluate the system’s effectiveness in this context, we propose DABench, a benchmark focusing on emotional expression and emotional change toward objects. Experimental results demonstrate that, our system achieves superior performance in personalization, logical coherence, and accuracy. Ablation studies further validate the effectiveness of the Bayesian-inspired update mechanism in alleviating memory bloat. Our work offers new insights into the design of long-term memory systems.
[31] VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision
Xuan Gong, Senmiao Wang, Hanbo Huang, Ruoyu Sun, Shiyu Liang
Main category: cs.CL
TL;DR: VCORE introduces variance-controlled optimization-based reweighting to improve supervised fine-tuning on long chain-of-thought trajectories by adaptively allocating supervision across tokens based on their contributions to reasoning.
Details
Motivation: Standard cross-entropy loss treats all tokens equally in chain-of-thought training, ignoring heterogeneous token contributions and leading to misallocated supervision and weak generalization in complex reasoning tasks.Method: VCORE reformulates chain-of-thought supervision as a constrained optimization problem from an optimization-theoretic perspective, enabling principled and adaptive allocation of supervision across tokens.
Result: VCORE consistently outperforms existing token reweighting methods, achieving substantial performance gains on mathematical and coding benchmarks across in-domain and out-of-domain settings using Qwen3 series and LLaMA-3.1-8B-Instruct models.
Conclusion: VCORE provides more effective initialization for subsequent reinforcement learning and establishes a stronger foundation for advancing LLM reasoning capabilities through principled supervision allocation.
Abstract: Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has emerged as a crucial technique for enhancing the reasoning abilities of large language models (LLMs). However, the standard cross-entropy loss treats all tokens equally, ignoring their heterogeneous contributions across a reasoning trajectory. This uniform treatment leads to misallocated supervision and weak generalization, especially in complex, long-form reasoning tasks. To address this, we introduce \textbf{V}ariance-\textbf{C}ontrolled \textbf{O}ptimization-based \textbf{RE}weighting (VCORE), a principled framework that reformulates CoT supervision as a constrained optimization problem. By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens, thereby aligning the training objective more closely with the goal of robust reasoning generalization. Empirical evaluations demonstrate that VCORE consistently outperforms existing token reweighting methods. Across both in-domain and out-of-domain settings, VCORE achieves substantial performance gains on mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B, 32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more effective initialization for subsequent reinforcement learning, establishing a stronger foundation for advancing the reasoning capabilities of LLMs. The Code will be released at https://github.com/coder-gx/VCORE.
[32] Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning
Chenyang Shao, Sijian Ren, Fengli Xu, Yong Li
Main category: cs.CL
TL;DR: Efficient collaborative reasoning framework using diffusion language models (DLMs) to generate candidate thoughts and LLMs to evaluate them, reducing computational burden while maintaining reasoning quality.
Details
Motivation: LLMs' autoregressive generation requires excessive computation for marginal performance gains in reasoning tasks, while DLMs can efficiently produce diverse samples in a single forward pass.Method: Propose a collaborative framework where DLMs generate diverse candidate thoughts through parallel denoising, and LLMs evaluate the quality of these thoughts.
Result: Experiments across diverse benchmarks show strong performance in complex reasoning tasks with reduced computational overhead.
Conclusion: The framework offers a promising direction for efficient reasoning by leveraging the complementary strengths of DLMs and LLMs.
Abstract: In recent years, large language models (LLMs) have witnessed remarkable advancements, with the test-time scaling law consistently enhancing the reasoning capabilities. Through systematic evaluation and exploration of a diverse spectrum of intermediate thoughts, LLMs demonstrate the potential to generate deliberate reasoning steps, thereby substantially enhancing reasoning accuracy. However, LLMs’ autoregressive generation paradigm results in reasoning performance scaling sub-optimally with test-time computation, often requiring excessive computational overhead to propose thoughts while yielding only marginal performance gains. In contrast, diffusion language models (DLMs) can efficiently produce diverse samples through parallel denoising in a single forward pass, inspiring us to leverage them for proposing intermediate thoughts, thereby alleviating the computational burden associated with autoregressive generation while maintaining quality. In this work, we propose an efficient collaborative reasoning framework, leveraging DLMs to generate candidate thoughts and LLMs to evaluate their quality. Experiments across diverse benchmarks demonstrate that our framework achieves strong performance in complex reasoning tasks, offering a promising direction for future research. Our code is open-source at https://anonymous.4open.science/r/Diffuse-Thinking-EC60.
[33] The aftermath of compounds: Investigating Compounds and their Semantic Representations
Swarang Joshi
Main category: cs.CL
TL;DR: This study compares computational embeddings (GloVe vs BERT) against human semantic judgments for English compound words, finding BERT better captures compositional semantics and predictability is a strong predictor of semantic transparency.
Details
Motivation: To investigate how well computational embeddings align with human semantic judgments in processing English compound words, specifically comparing static vs contextualized embeddings.Method: Compared GloVe and BERT embeddings against human ratings of lexeme meaning dominance and semantic transparency using association strength, frequency, and predictability measures, with Spearman correlation and regression analyses.
Result: BERT embeddings better capture compositional semantics than GloVe, and predictability ratings are strong predictors of semantic transparency in both human and model data.
Conclusion: The findings advance computational psycholinguistics by clarifying factors driving compound word processing and offering insights into embedding-based semantic modeling.
Abstract: This study investigates how well computational embeddings align with human semantic judgments in the processing of English compound words. We compare static word vectors (GloVe) and contextualized embeddings (BERT) against human ratings of lexeme meaning dominance (LMD) and semantic transparency (ST) drawn from a psycholinguistic dataset. Using measures of association strength (Edinburgh Associative Thesaurus), frequency (BNC), and predictability (LaDEC), we compute embedding-derived LMD and ST metrics and assess their relationships with human judgments via Spearmans correlation and regression analyses. Our results show that BERT embeddings better capture compositional semantics than GloVe, and that predictability ratings are strong predictors of semantic transparency in both human and model data. These findings advance computational psycholinguistics by clarifying the factors that drive compound word processing and offering insights into embedding-based semantic modeling.
[34] Effect of Domain Generalization Techniques in Low Resource Systems
Mahi Aminu, Chisom Chibuike, Fatimo Adebanjo, Omokolade Awosanya, Samuel Oyeneye
Main category: cs.CL
TL;DR: Two causal domain generalization methods (causal data augmentation and invariant causal representation learning) improve robustness to distribution shifts in low-resource NLP tasks, with consistent cross-domain accuracy gains in sentiment classification.
Details
Motivation: Address distribution shift challenges in low-resource settings where data scarcity and limited domain diversity hinder robust generalization, using causal mechanisms to learn domain-invariant features.Method: 1) Causal data augmentation (CDA) generates counterfactual examples for sentiment classification on NaijaSenti Twitter corpus. 2) Invariant causal representation learning (ICRL) using DINER framework adapted for multilingual sentiment analysis.
Result: Both approaches enhance robustness to unseen domains: CDA yields consistent cross-domain accuracy gains in sentiment classification, while ICRL with DINER improves out-of-distribution performance in multilingual sentiment analysis with varying gains across languages.
Conclusion: Causal domain generalization techniques are effective for improving model robustness in low-resource NLP settings, though performance gains may vary across different languages and applications.
Abstract: Machine learning models typically assume that training and test data follow the same distribution, an assumption that often fails in real-world scenarios due to distribution shifts. This issue is especially pronounced in low-resource settings, where data scarcity and limited domain diversity hinder robust generalization. Domain generalization (DG) approaches address this challenge by learning features that remain invariant across domains, often using causal mechanisms to improve model robustness. In this study, we examine two distinct causal DG techniques in low-resource natural language tasks. First, we investigate a causal data augmentation (CDA) approach that automatically generates counterfactual examples to improve robustness to spurious correlations. We apply this method to sentiment classification on the NaijaSenti Twitter corpus, expanding the training data with semantically equivalent paraphrases to simulate controlled distribution shifts. Second, we explore an invariant causal representation learning (ICRL) approach using the DINER framework, originally proposed for debiasing aspect-based sentiment analysis. We adapt DINER to a multilingual setting. Our findings demonstrate that both approaches enhance robustness to unseen domains: counterfactual data augmentation yields consistent cross-domain accuracy gains in sentiment classification, while causal representation learning with DINER improves out-of-distribution performance in multilingual sentiment analysis, albeit with varying gains across languages.
[35] A Transformer-based Neural Architecture Search Method
Shang Wang, Huanrong Tang, Jianquan Ouyang
Main category: cs.CL
TL;DR: Neural architecture search using Transformer with multi-objective genetic algorithm for machine translation, incorporating perplexity as auxiliary metric alongside BLEU scores.
Details
Motivation: To find better neural network structures for translation tasks by searching across different multihead attention computation ways and encoder-decoder combinations.Method: Transformer-based neural architecture search with multi-objective genetic algorithm, using both BLEU scores and perplexity as evaluation metrics to iteratively improve neural network populations.
Result: The searched neural network structures outperform all baseline models, and using perplexity as auxiliary metric finds better models than using BLEU score alone.
Conclusion: Multi-objective genetic algorithm with auxiliary perplexity metric effectively discovers superior Transformer architectures for translation tasks.
Abstract: This paper presents a neural architecture search method based on Transformer architecture, searching cross multihead attention computation ways for different number of encoder and decoder combinations. In order to search for neural network structures with better translation results, we considered perplexity as an auxiliary evaluation metric for the algorithm in addition to BLEU scores and iteratively improved each individual neural network within the population by a multi-objective genetic algorithm. Experimental results show that the neural network structures searched by the algorithm outperform all the baseline models, and that the introduction of the auxiliary evaluation metric can find better models than considering only the BLEU score as an evaluation metric.
[36] BiSparse-AAS: Bilinear Sparse Attention and Adaptive Spans Framework for Scalable and Efficient Text Summarization
Desta Haileselassie Hagos, Legand L. Burge, Anietie Andy, Anis Yazidi, Vladimir Vlassov
Main category: cs.CL
TL;DR: BiSparse-AAS is a novel transformer framework that combines sparse attention, adaptive spans, and bilinear attention to overcome quadratic complexity limitations in long document summarization, achieving significant performance improvements while maintaining efficiency.
Details
Motivation: Transformer-based architectures face scalability issues due to quadratic complexity when processing long documents, limiting their practical application in real-world text summarization tasks.Method: Combines three key components: sparse attention to reduce computational costs by focusing on relevant input parts, adaptive spans that dynamically adjust attention ranges, and bilinear attention to model complex token interactions within the refined context.
Result: Consistently outperforms state-of-the-art baselines in both extractive and abstractive summarization, achieving average ROUGE improvements of 68.1% on CNN/DailyMail and 52.6% on XSum, while maintaining strong performance on OpenWebText and Gigaword datasets.
Conclusion: BiSparse-AAS provides a unified, practical solution that addresses efficiency, scalability, and long-sequence modeling challenges for real-world text summarization applications.
Abstract: Transformer-based architectures have advanced text summarization, yet their quadratic complexity limits scalability on long documents. This paper introduces BiSparse-AAS (Bilinear Sparse Attention with Adaptive Spans), a novel framework that combines sparse attention, adaptive spans, and bilinear attention to address these limitations. Sparse attention reduces computational costs by focusing on the most relevant parts of the input, while adaptive spans dynamically adjust the attention ranges. Bilinear attention complements both by modeling complex token interactions within this refined context. BiSparse-AAS consistently outperforms state-of-the-art baselines in both extractive and abstractive summarization tasks, achieving average ROUGE improvements of about 68.1% on CNN/DailyMail and 52.6% on XSum, while maintaining strong performance on OpenWebText and Gigaword datasets. By addressing efficiency, scalability, and long-sequence modeling, BiSparse-AAS provides a unified, practical solution for real-world text summarization applications.
[37] Detecting Prefix Bias in LLM-based Reward Models
Ashwin Kumar, Yuzi He, Aram H. Markosyan, Bobbie Chern, Imanol Arrieta-Ibarra
Main category: cs.CL
TL;DR: This paper investigates prefix bias in LLM-based reward models trained on human preference data, revealing systematic biases across racial and gender dimensions, and proposes a data augmentation strategy for mitigation.
Details
Motivation: While RLHF is widely used for fine-tuning language models with human preference data, the potential biases in resulting reward models remain underexplored, particularly prefix bias triggered by minor query variations.Method: The authors introduce novel methods to detect and evaluate prefix bias, conduct comprehensive evaluation across diverse open-source preference datasets and reward model architectures, and propose a data augmentation strategy to mitigate these biases.
Result: The study reveals significant biases in preference models across racial and gender dimensions, showing susceptibility to prefix bias regardless of underlying model architecture. The proposed data augmentation strategy effectively reduces the impact of prefix bias.
Conclusion: The findings highlight the critical need for bias-aware dataset design and evaluation in developing fair and reliable reward models, contributing to fairness in AI.
Abstract: Reinforcement Learning with Human Feedback (RLHF) has emerged as a key paradigm for task-specific fine-tuning of language models using human preference data. While numerous publicly available preference datasets provide pairwise comparisons of responses, the potential for biases in the resulting reward models remains underexplored. In this work, we introduce novel methods to detect and evaluate prefix bias – a systematic shift in model preferences triggered by minor variations in query prefixes – in LLM-based reward models trained on such datasets. We leverage these metrics to reveal significant biases in preference models across racial and gender dimensions. Our comprehensive evaluation spans diverse open-source preference datasets and reward model architectures, demonstrating susceptibility to this kind of bias regardless of the underlying model architecture. Furthermore, we propose a data augmentation strategy to mitigate these biases, showing its effectiveness in reducing the impact of prefix bias. Our findings highlight the critical need for bias-aware dataset design and evaluation in developing fair and reliable reward models, contributing to the broader discourse on fairness in AI.
[38] SQLSpace: A Representation Space for Text-to-SQL to Discover and Mitigate Robustness Gaps
Neha Srikanth, Victor Bursztyn, Puneet Mathur, Ani Nenkova
Main category: cs.CL
TL;DR: SQLSpace is a human-interpretable representation for text-to-SQL examples that enables detailed analysis of benchmarks, model performance evaluation, and targeted query rewriting.
Details
Motivation: To create a compact, generalizable representation for text-to-SQL examples that facilitates better understanding of benchmark composition and model performance beyond simple accuracy metrics.Method: Develop SQLSpace representations derived from text-to-SQL examples with minimal human intervention, then apply them to analyze benchmark composition, evaluate model performance granularly, and improve models through query rewriting based on correctness estimation.
Result: SQLSpace reveals compositional differences between benchmarks, exposes performance patterns hidden by accuracy scores alone, and supports modeling of query success for targeted improvements.
Conclusion: SQLSpace provides valuable analytical capabilities for text-to-SQL research that would be difficult with raw examples alone, enabling deeper insights into benchmark characteristics and model behavior.
Abstract: We introduce SQLSpace, a human-interpretable, generalizable, compact representation for text-to-SQL examples derived with minimal human intervention. We demonstrate the utility of these representations in evaluation with three use cases: (i) closely comparing and contrasting the composition of popular text-to-SQL benchmarks to identify unique dimensions of examples they evaluate, (ii) understanding model performance at a granular level beyond overall accuracy scores, and (iii) improving model performance through targeted query rewriting based on learned correctness estimation. We show that SQLSpace enables analysis that would be difficult with raw examples alone: it reveals compositional differences between benchmarks, exposes performance patterns obscured by accuracy alone, and supports modeling of query success.
[39] Patient-Centered Summarization Framework for AI Clinical Summarization: A Mixed-Methods Design
Maria Lizarazo Jimenez, Ana Gabriela Claros, Kieran Green, David Toro-Tobon, Felipe Larios, Sheena Asthana, Camila Wenczenovicz, Kerly Guevara Maldonado, Luis Vilatuna-Andrango, Cristina Proano-Velez, Satya Sai Sri Bandi, Shubhangi Bagewadi, Megan E. Branda, Misk Al Zahidy, Saturnino Luz, Mirella Lapata, Juan P. Brito, Oscar J. Ponce-Ponte
Main category: cs.CL
TL;DR: Proposes Patient-Centered Summaries (PCS) as a new AI standard for clinical summarization that captures patient values, preferences and concerns, not just biological information. Evaluates open-source LLMs against human performance in generating PCS from clinical conversations.
Details
Motivation: Current LLM-generated clinical summaries focus too much on patients' biology while neglecting their preferences, values, wishes, and concerns, which are essential for patient-centered care.Method: Mixed-methods approach: Patient and clinician interviews to define PCS requirements, creation of gold-standard PCS annotations from 88 consultations, evaluation of 5 open-source LLMs using zero-shot and few-shot prompting with ROUGE-L, BERTScore, and qualitative metrics.
Result: Best zero-shot performance: Mistral-8B (ROUGE-L 0.189) and Llama-3.1-8B (BERTScore 0.673); best few-shot: Llama-3.1-8B (ROUGE-L 0.206, BERTScore 0.683). Models matched human performance in completeness and fluency, but human PCS were better in correctness and patient-centeredness.
Conclusion: Open-source LLMs can generate clinically useful patient-centered summaries but still lag behind human experts in capturing patient values and ensuring accuracy, highlighting the need for continued development in patient-centered AI clinical summarization.
Abstract: Large Language Models (LLMs) are increasingly demonstrating the potential to reach human-level performance in generating clinical summaries from patient-clinician conversations. However, these summaries often focus on patients’ biology rather than their preferences, values, wishes, and concerns. To achieve patient-centered care, we propose a new standard for Artificial Intelligence (AI) clinical summarization tasks: Patient-Centered Summaries (PCS). Our objective was to develop a framework to generate PCS that capture patient values and ensure clinical utility and to assess whether current open-source LLMs can achieve human-level performance in this task. We used a mixed-methods process. Two Patient and Public Involvement groups (10 patients and 8 clinicians) in the United Kingdom participated in semi-structured interviews exploring what personal and contextual information should be included in clinical summaries and how it should be structured for clinical use. Findings informed annotation guidelines used by eight clinicians to create gold-standard PCS from 88 atrial fibrillation consultations. Sixteen consultations were used to refine a prompt aligned with the guidelines. Five open-source LLMs (Llama-3.2-3B, Llama-3.1-8B, Mistral-8B, Gemma-3-4B, and Qwen3-8B) generated summaries for 72 consultations using zero-shot and few-shot prompting, evaluated with ROUGE-L, BERTScore, and qualitative metrics. Patients emphasized lifestyle routines, social support, recent stressors, and care values. Clinicians sought concise functional, psychosocial, and emotional context. The best zero-shot performance was achieved by Mistral-8B (ROUGE-L 0.189) and Llama-3.1-8B (BERTScore 0.673); the best few-shot by Llama-3.1-8B (ROUGE-L 0.206, BERTScore 0.683). Completeness and fluency were similar between experts and models, while correctness and patient-centeredness favored human PCS.
[40] DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models
Malik H. Altakrori, Nizar Habash, Abdelhakim Freihat, Younes Samih, Kirill Chirkunov, Muhammed AbuOdeh, Radu Florian, Teresa Lynn, Preslav Nakov, Alham Fikri Aji
Main category: cs.CL
TL;DR: DialectalArabicMMLU is a new benchmark that extends MMLU-Redux to evaluate LLM performance across 5 major Arabic dialects, revealing significant performance gaps in dialectal understanding.
Details
Motivation: While Arabic and multilingual benchmarks exist for Modern Standard Arabic, dialectal varieties remain underrepresented despite their prevalence in everyday communication, creating a need for more inclusive evaluation.Method: Manual translation and adaptation of 3K multiple-choice question-answer pairs from MMLU-Redux into five major Arabic dialects (Syrian, Egyptian, Emirati, Saudi, Moroccan), creating 15K QA pairs across 32 domains.
Result: Evaluation of 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) revealed substantial performance variation across dialects, showing persistent gaps in dialectal generalization.
Conclusion: DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, promoting more inclusive evaluation and future model development.
Abstract: We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects. While recently developed Arabic and multilingual benchmarks have advanced LLM evaluation for Modern Standard Arabic (MSA), dialectal varieties remain underrepresented despite their prevalence in everyday communication. DialectalArabicMMLU extends the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of 15K QA pairs across 32 academic and professional domains (22K QA pairs when also including English and MSA). The benchmark enables systematic assessment of LLM reasoning and comprehension beyond MSA, supporting both task-based and linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) and report substantial performance variation across dialects, revealing persistent gaps in dialectal generalization. DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, thus promoting more inclusive evaluation and future model development.
[41] Multilingual BERT language model for medical tasks: Evaluation on domain-specific adaptation and cross-linguality
Yinghao Luo, Lang Zhou, Amrish Jhingoer, Klaske Vliegenthart Jongbloed, Carlijn Jordans, Ben Werkhoven, Tom Seinen, Erik van Mulligen, Casper Rokx, Yunlei Li
Main category: cs.CL
TL;DR: Domain adaptation through further pre-training on medical corpora significantly improves multilingual BERT performance on healthcare NLP tasks for low-resource languages (Dutch, Romanian, Spanish), with clinical domain adaptation outperforming general biomedical adaptation and showing cross-lingual transferability.
Details
Motivation: Limited availability of domain-specific NLP tools for low-resource languages in healthcare applications, and underexplored medical NLP tasks despite multilingual BERT's potential to bridge language gaps.Method: Four experiments of further pre-training on domain-specific corpora to create medical domain models, followed by fine-tuning on three downstream tasks: Dutch patient screening, Romanian and Spanish clinical named entity recognition.
Result: Domain adaptation significantly enhanced task performance, with clinical domain-adapted models outperforming general biomedical domain-adapted models. Evidence of cross-lingual transferability was observed.
Conclusion: Domain adaptation and cross-lingual transfer are feasible approaches to mitigate training data scarcity and improve model performance in multilingual medical NLP systems for low-resource languages.
Abstract: In multilingual healthcare applications, the availability of domain-specific natural language processing(NLP) tools is limited, especially for low-resource languages. Although multilingual bidirectional encoder representations from transformers (BERT) offers a promising motivation to mitigate the language gap, the medical NLP tasks in low-resource languages are still underexplored. Therefore, this study investigates how further pre-training on domain-specific corpora affects model performance on medical tasks, focusing on three languages: Dutch, Romanian and Spanish. In terms of further pre-training, we conducted four experiments to create medical domain models. Then, these models were fine-tuned on three downstream tasks: Automated patient screening in Dutch clinical notes, named entity recognition in Romanian and Spanish clinical notes. Results show that domain adaptation significantly enhanced task performance. Furthermore, further differentiation of domains, e.g. clinical and general biomedical domains, resulted in diverse performances. The clinical domain-adapted model outperformed the more general biomedical domain-adapted model. Moreover, we observed evidence of cross-lingual transferability. Moreover, we also conducted further investigations to explore potential reasons contributing to these performance differences. These findings highlight the feasibility of domain adaptation and cross-lingual ability in medical NLP. Within the low-resource language settings, these findings can provide meaningful guidance for developing multilingual medical NLP systems to mitigate the lack of training data and thereby improve the model performance.
[42] Data-Efficient Domain Adaptation for LLM-based MT using Contrastive Preference Optimization
Inacio Vieira, Antonio Castaldo, James O’Doherty, Sheila Castilho
Main category: cs.CL
TL;DR: CPO enables data-efficient domain adaptation by using model’s raw outputs as rejected samples and human-approved translations as chosen ones, achieving similar performance to SFT with 10x less data.
Details
Motivation: LLMs need domain adaptation but SFT is expensive; CPO offers a more data-efficient alternative by leveraging contrastive learning from model's own outputs.Method: Synthesize preference pairs using base model’s raw output as ‘rejected’ and human-approved TM entry as ‘chosen’, applying CPO for domain adaptation.
Result: With only 14.7k preference pairs, model achieves performance close to SFT trained on 160k+ samples in English-Brazilian Portuguese and English-Korean MT tasks.
Conclusion: CPO provides significant data efficiency for domain adaptation and generalizes to other generative tasks where initial drafts can contrast with golden references.
Abstract: LLMs often require adaptation to domain-specific requirements, a process that can be expensive when relying solely on SFT. We present an empirical study on applying CPO to simulate a post-editing workflow for data-efficient domain adaptation. Our approach synthesizes preference pairs by treating the base model’s own raw output as the ‘rejected’ translation and the human-approved TM entry as the ‘chosen’ one. This method provides direct feedback on the model’s current knowledge, guiding it to align with domain-specific standards. Experiments in English-Brazilian Portuguese and English-Korean show that, by using just 14.7k preference pairs, the model achieves performance close to that of a model trained on 160k+ samples with SFT, demonstrating significant data efficiency. Although we showcase its effectiveness in MT, this application of CPO naturally generalizes to other generative tasks where a model’s initial drafts can serve as a contrastive signal against a golden reference.
[43] MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval
Qi Luo, Xiaonan Li, Yuxin Wang, Tingshuo Fan, Yuan Li, Xinchi Chen, Xipeng Qiu
Main category: cs.CL
TL;DR: MARAG-R1 is a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms for broader and more precise information access, achieving state-of-the-art results in corpus-level reasoning tasks.
Details
Motivation: Existing RAG systems rely on a single retriever with fixed top-k selection, which restricts access to a narrow and static subset of the corpus, becoming the primary bottleneck for comprehensive external information acquisition in corpus-level reasoning tasks.Method: MARAG-R1 equips LLMs with four retrieval tools (semantic search, keyword search, filtering, and aggregation) and learns both how and when to use them through a two-stage training process: supervised fine-tuning followed by reinforcement learning, allowing the model to interleave reasoning and retrieval.
Result: Experiments on GlobalQA, HotpotQA, and 2WikiMultiHopQA demonstrate that MARAG-R1 substantially outperforms strong baselines and achieves new state-of-the-art results in corpus-level reasoning tasks.
Conclusion: The proposed multi-tool RAG framework with reinforcement learning enables more comprehensive external information acquisition and better corpus-level reasoning compared to single-retriever approaches.
Abstract: Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data, resulting in factual inaccuracies and weak adaptability to new information. Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge; However, the effectiveness of RAG critically depends on whether the model can adequately access relevant information. Existing RAG systems rely on a single retriever with fixed top-k selection, restricting access to a narrow and static subset of the corpus. As a result, this single-retriever paradigm has become the primary bottleneck for comprehensive external information acquisition, especially in tasks requiring corpus-level reasoning. To overcome this limitation, we propose MARAG-R1, a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms for broader and more precise information access. MARAG-R1 equips the model with four retrieval tools – semantic search, keyword search, filtering, and aggregation – and learns both how and when to use them through a two-stage training process: supervised fine-tuning followed by reinforcement learning. This design allows the model to interleave reasoning and retrieval, progressively gathering sufficient evidence for corpus-level synthesis. Experiments on GlobalQA, HotpotQA, and 2WikiMultiHopQA demonstrate that MARAG-R1 substantially outperforms strong baselines and achieves new state-of-the-art results in corpus-level reasoning tasks.
[44] SpecAttn: Speculating Sparse Attention
Harsh Shah
Main category: cs.CL
TL;DR: SpecAttn is a training-free method that integrates with speculative decoding to enable efficient sparse attention in pre-trained transformers, reducing KV cache accesses by over 75% with minimal perplexity increase.
Details
Motivation: LLMs face computational bottlenecks during inference due to quadratic complexity of self-attention, especially with long contexts. Existing methods need efficient attention mechanisms without retraining.Method: Uses draft model’s attention weights from speculative decoding to identify important tokens. Employs KL divergence-based layer alignment, GPU-optimized sorting-free top-p token selection, and dynamic KV cache pruning.
Result: Achieves over 75% reduction in KV cache accesses with only 15.29% perplexity increase on PG-19 dataset, significantly outperforming existing sparse attention methods.
Conclusion: Speculative execution can be enhanced to provide approximate verification without significant performance degradation, demonstrating effective sparse attention through existing speculative decoding pipelines.
Abstract: Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.
[45] Culture Cartography: Mapping the Landscape of Cultural Knowledge
Caleb Ziems, William Held, Jane Yu, Amir Goldberg, David Grusky, Diyi Yang
Main category: cs.CL
TL;DR: CultureCartography is a mixed-initiative method where LLMs identify knowledge gaps through low-confidence questions, allowing humans to fill these gaps and steer towards culturally salient topics, producing knowledge that improves model performance on culture benchmarks.
Details
Motivation: LLMs need culture-specific knowledge that may not be learned during pre-training, requiring identification of knowledge that is salient to in-group users but unknown to LLMs.Method: CultureCartography: LLM initializes annotation with low-confidence questions, human respondents fill gaps and steer towards salient topics through direct edits, implemented as CultureExplorer tool.
Result: CultureExplorer produces knowledge that leading models (DeepSeek R1, GPT-4o) miss even with web search. Fine-tuning on this data boosts Llama-3.1-8B accuracy by up to 19.2% on culture benchmarks.
Conclusion: Mixed-initiative collaboration between humans and LLMs effectively identifies and fills cultural knowledge gaps, improving model performance on culture-specific tasks.
Abstract: To serve global users safely and productively, LLMs need culture-specific knowledge that might not be learned during pre-training. How do we find such knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The most common solutions are single-initiative: either researchers define challenging questions that users passively answer (traditional annotation), or users actively produce data that researchers structure as benchmarks (knowledge extraction). The process would benefit from mixed-initiative collaboration, where users guide the process to meaningfully reflect their cultures, and LLMs steer the process towards more challenging questions that meet the researcher’s goals. We propose a mixed-initiative methodology called CultureCartography. Here, an LLM initializes annotation with questions for which it has low-confidence answers, making explicit both its prior knowledge and the gaps therein. This allows a human respondent to fill these gaps and steer the model towards salient topics through direct edits. We implement this methodology as a tool called CultureExplorer. Compared to a baseline where humans answer LLM-proposed questions, we find that CultureExplorer more effectively produces knowledge that leading models like DeepSeek R1 and GPT-4o are missing, even with web search. Fine-tuning on this data boosts the accuracy of Llama-3.1-8B by up to 19.2% on related culture benchmarks.
[46] Continuous Autoregressive Language Models
Chenze Shao, Darren Li, Fandong Meng, Jie Zhou
Main category: cs.CL
TL;DR: CALM introduces continuous next-vector prediction to replace discrete next-token generation, compressing K tokens into single vectors to reduce generative steps and improve computational efficiency.
Details
Motivation: Overcome the fundamental bottleneck of sequential token-by-token generation in LLMs by increasing semantic bandwidth per generative step.Method: Uses high-fidelity autoencoder to compress token chunks into continuous vectors, enabling next-vector prediction with likelihood-free training framework for robust continuous domain operations.
Result: Achieves performance of strong discrete baselines at significantly lower computational cost, with over 99.9% token reconstruction accuracy.
Conclusion: Next-vector prediction establishes a powerful and scalable pathway towards ultra-efficient language models.
Abstract: The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: https://github.com/shaochenze/calm. Project: https://shaochenze.github.io/blog/2025/CALM.
[47] MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, Feng Zhao
Main category: cs.CL
TL;DR: MindSearch is an LLM-based multi-agent framework that mimics human cognitive processes for web information seeking and integration, addressing challenges in complex query retrieval, distributed information across multiple pages, and context length limitations.
Details
Motivation: Current LLM-search engine combinations have unsatisfying performance due to three main challenges: complex queries requiring multiple retrievals, information spread across multiple web pages with noise, and web page content exceeding LLM context limits.Method: Uses a multi-agent framework with WebPlanner that models information seeking as dynamic graph construction (decomposing queries into sub-questions) and WebSearcher that performs hierarchical information retrieval. Enables parallel processing of large-scale web pages.
Result: Significantly improves response quality in depth and breadth on both close-set and open-set QA problems. Processes 300+ web pages in 3 minutes (equivalent to 3 hours of human effort). Responses preferred by humans over ChatGPT-Web and Perplexity.ai.
Conclusion: MindSearch delivers a competitive solution to proprietary AI search engines, demonstrating that mimicking human cognitive processes through multi-agent design effectively addresses web information seeking challenges.
Abstract: Information seeking and integration is a complex cognitive task that consumes enormous time and effort. Inspired by the remarkable progress of Large Language Models, recent works attempt to solve this task by combining LLMs and search engines. However, these methods still obtain unsatisfying performance due to three challenges: (1) complex requests often cannot be accurately and completely retrieved by the search engine once (2) corresponding information to be integrated is spread over multiple web pages along with massive noise, and (3) a large number of web pages with long contents may quickly exceed the maximum context length of LLMs. Inspired by the cognitive process when humans solve these problems, we introduce MindSearch to mimic the human minds in web information seeking and integration, which can be instantiated by a simple yet effective LLM-based multi-agent framework. The WebPlanner models the human mind of multi-step information seeking as a dynamic graph construction process: it decomposes the user query into atomic sub-questions as nodes in the graph and progressively extends the graph based on the search result from WebSearcher. Tasked with each sub-question, WebSearcher performs hierarchical information retrieval with search engines and collects valuable information for WebPlanner. The multi-agent design of MindSearch enables the whole framework to seek and integrate information parallelly from larger-scale (e.g., more than 300) web pages in 3 minutes, which is worth 3 hours of human effort. MindSearch demonstrates significant improvement in the response quality in terms of depth and breadth, on both close-set and open-set QA problems. Besides, responses from MindSearch based on InternLM2.5-7B are preferable by humans to ChatGPT-Web and Perplexity.ai applications, which implies that MindSearch can already deliver a competitive solution to the proprietary AI search engine.
[48] SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks
Fenia Christopoulou, Ronald Cardenas, Gerasimos Lampouras, Haitham Bou-Ammar, Jun Wang
Main category: cs.CL
TL;DR: SparsePO is a new Direct Preference Optimization variant that learns token-level weights for KL divergence and reward terms, focusing alignment on important tokens rather than treating all tokens equally.
Details
Motivation: Human preferences are often determined by specific words or phrases (e.g., toxic terms), not equally by all tokens. Current DPO methods treat all tokens equally, which may not align with how humans actually evaluate responses.Method: Proposes SparsePO with two weight-mask variants: one derived from the reference model and one learned during training. The method induces sparsity in masks to balance reward and KL divergence at token level.
Result: +10% win rate in summarization and +3% in dialogue scenarios. Maintains model reasoning and summary quality (relevancy and faithfulness) while improving alignment.
Conclusion: Token-level weighting in preference optimization is effective. SparsePO successfully aligns models to preferences without compromising other desirable model behaviors.
Abstract: Direct alignment algorithms have proven an effective step for aligning language models to human-desired behaviors. Current variants of the Direct Preference Optimization objective have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected equally by each word in a sequence but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly. Notably, our method induces sparsity in the learned masks, allowing the model to learn how to best balance reward and KL divergence contributions at the token level, learning an optimal level of mask sparsity. Extensive experiments illustrate the effectiveness of our approach at aligning to preference proxies, including sentiment control, helpfulness and harmlessness, and summary quality. Our method obtains +10% and +3% win rate points in summarization and dialogue scenarios, respectively, without compromising model reasoning or the relevancy and faithfulness of the summary response.
[49] LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
Nam V. Nguyen, Thong T. Doan, Luong Tran, Van Nguyen, Quang Pham
Main category: cs.CL
TL;DR: LibMoE is a unified framework that enables reproducible, efficient, and extensible research on Mixture of Experts (MoE) architectures, addressing computational barriers and providing analytical tools for studying routing dynamics, initialization effects, and training regimes.
Details
Motivation: Systematic research on MoE architectures is severely constrained by prohibitive computational costs of training and evaluation, making large-scale studies inaccessible to most researchers.Method: Developed LibMoE framework with unified implementations for pretraining and sparse-upcycling regimes, plus transparent analytical tools for probing routing and expert dynamics. Conducted comprehensive analysis across routing dynamics, lightweight initialization effects, and training regime differences.
Result: The framework enables analysis of expert selection patterns, routing stability/optimality, routing entropy revealing task specialization, initialization effects on load balancing, and distinct routing patterns between sparse upcycling vs full pretraining.
Conclusion: LibMoE lowers barriers to MoE research, standardizes evaluation, establishes reliable benchmarks, and broadens access to guide future innovations in mixture of experts architectures.
Abstract: Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations. Project page: https://fsoft-aic.github.io/fsoft-LibMoE.github.io.
[50] Multilingual State Space Models for Structured Question Answering in Indic Languages
Arpita Vats, Rahul Raja, Mrinal Mathur, Vinija Jain, Aman Chadha
Main category: cs.CL
TL;DR: This paper applies State Space Models (SSMs) to build efficient question answering systems for Indic languages, demonstrating significant improvements in handling linguistic complexities and establishing foundational benchmarks.
Details
Motivation: Indic languages present unique NLP challenges due to their diversity, complexity, rich morphology, and complex syntax, requiring specialized approaches for question answering tasks.Method: Evaluated multiple SSM architectures across diverse Indic language datasets, leveraging SSMs’ ability to model long-term and short-term dependencies in sequential data.
Result: SSMs effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation for Indic languages.
Conclusion: This work establishes the first application of SSMs to Indic language QA, proposes enhancements for low-resource settings, and sets foundational benchmarks for future research.
Abstract: The diversity and complexity of Indic languages present unique challenges for natural language processing (NLP) tasks, particularly in the domain of question answering (QA).To address these challenges, this paper explores the application of State Space Models (SSMs),to build efficient and contextually aware QA systems tailored for Indic languages. SSMs are particularly suited for this task due to their ability to model long-term and short-term dependencies in sequential data, making them well-equipped to handle the rich morphology, complex syntax, and contextual intricacies characteristic of Indian languages. We evaluated multiple SSM architectures across diverse datasets representing various Indic languages and conducted a comparative analysis of their performance. Our results demonstrate that these models effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation. This work represents the first application of SSMs to question answering tasks in Indic languages, establishing a foundational benchmark for future research in this domain. We propose enhancements to existing SSM frameworks, optimizing their applicability to low-resource settings and multilingual scenarios prevalent in Indic languages.
[51] SMOL: Professionally translated parallel data for 115 under-represented languages
Isaac Caswell, Elizabeth Nielsen, Jiaming Luo, Colin Cherry, Geza Kovacs, Hadar Shemtov, Partha Talukdar, Dinesh Tewari, Baba Mamadi Diane, Djibrila Diane, Solo Farabado Cissé, Koulako Moussa Doumbouya, Edoardo Ferrante, Alessandro Guasoni, Christopher Homan, Mamadou K. Keita, Sudhamoy DebBarma, Ali Kuzhuget, David Anugraha, Muhammad Ravi Shulthan Habibi, Genta Indra Winata, Anthony Munthali, Sina Ahmadi, Andrei Chemyshev, Mingfei Lau, Jonathan Eng
Main category: cs.CL
TL;DR: SMOL is an open-source training data suite for machine translation in 124 low-resource languages, containing 6.1M translated tokens across sentence-level (SMOLSENT) and document-level (SMOLDOC) datasets, which improves translation quality when used with LLMs.
Details
Motivation: To unlock machine translation for low-resource languages that lack public resources, addressing the translation gap for many under-resourced languages.Method: Created SMOL dataset with two sub-datasets: SMOLSENT for broad token coverage and SMOLDOC for document-level topic coverage, then used these to prompt or fine-tune Large Language Models.
Result: Demonstrated robust chrF improvements in translation quality, and provided the first factuality datasets with ratings and rationales for most of these languages.
Conclusion: SMOL successfully enables machine translation for low-resource languages and provides valuable factuality resources, representing a significant contribution to multilingual NLP.
Abstract: We open-source SMOL (Set of Maximal Overall Leverage), a suite of training data to unlock machine translation for low-resource languages. SMOL has been translated into 124 (and growing) under-resourced languages (125 language pairs), including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOLSENT, a set of sentences chosen for broad unique token coverage, and SMOLDOC, a document-level resource focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust chrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOLDOC, yielding the first factuality datasets for most of these languages.
[52] Scalable Best-of-N Selection for Large Language Models via Self-Certainty
Zhewei Kang, Xuandong Zhao, Dawn Song
Main category: cs.CL
TL;DR: Self-certainty is a novel metric that uses LLMs’ inherent probability distributions to estimate response quality without external reward models, enabling efficient best-of-N selection for reasoning tasks.
Details
Motivation: Current best-of-N selection methods rely on computationally intensive reward models or have limitations with open-ended generation tasks. There's a need for more efficient and scalable alternatives.Method: Proposes self-certainty metric that leverages LLMs’ probability distributions across multiple samples to estimate response quality, hypothesizing that higher distributional self-certainty correlates with better accuracy.
Result: Self-certainty scales effectively with sample size N, complements chain-of-thought reasoning, improves performance beyond greedy decoding, and generalizes to open-ended tasks where traditional methods fail.
Conclusion: Self-certainty provides a practical and efficient way to improve LLM reasoning capabilities without computational overhead of reward models, establishing it as a viable alternative for best-of-N selection.
Abstract: Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size N, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty
[53] More of the Same: Persistent Representational Harms Under Increased Representation
Jennifer Mickel, Maria De-Arteaga, Leqi Liu, Kevin Tian
Main category: cs.CL
TL;DR: GAS(P) is a novel evaluation methodology for detecting distribution-level group representational biases in generative AI systems, particularly for unprompted groups. Applied to gendered representations in occupations, it reveals persistent biases in how different genders are described despite increased representation.
Details
Motivation: To address the critical gap in measuring how people are represented (not just who is represented) in generative AI systems, and to recognize and mitigate representational harms that persist even when representation appears improved.Method: Developed GAS(P) methodology to surface distribution-level group representational biases in generated text, focusing on unprompted groups. Applied to gendered representations across occupations in state-of-the-art large language models by analyzing word choice differences in biographies and personas.
Result: Found statistically significant distribution-level differences in word choice used to describe different genders across occupations, with many differences associated with representational harms and stereotypes. Showed that even when gender distribution appears balanced, representational biases persist in how genders are represented.
Conclusion: Naively increasing unprompted representation may inadvertently proliferate representational biases. The proposed GAS(P) methodology enables systematic and rigorous measurement of these distribution-level representational bias problems in generative AI systems.
Abstract: To recognize and mitigate the harms of generative AI systems, it is crucial to consider whether and how different societal groups are represented by these systems. A critical gap emerges when naively measuring or improving who is represented, as this does not consider how people are represented. In this work, we develop GAS(P), an evaluation methodology for surfacing distribution-level group representational biases in generated text, tackling the setting where groups are unprompted (i.e., groups are not specified in the input to generative systems). We apply this novel methodology to investigate gendered representations in occupations across state-of-the-art large language models. We show that, even though the gender distribution when models are prompted to generate biographies leads to a large representation of women, even representational biases persist in how different genders are represented. Our evaluation methodology reveals that there are statistically significant distribution-level differences in the word choice used to describe biographies and personas of different genders across occupations, and we show that many of these differences are associated with representational harms and stereotypes. Our empirical findings caution that naively increasing (unprompted) representation may inadvertently proliferate representational biases, and our proposed evaluation methodology enables systematic and rigorous measurement of the problem.
[54] (How) Do Language Models Track State?
Belinda Z. Li, Zifan Carl Guo, Jacob Andreas
Main category: cs.CL
TL;DR: Transformer LMs learn efficient state tracking mechanisms for permutation composition tasks, with two main approaches: associative scan and parity-based pruning with refinement.
Details
Motivation: To understand how transformer LMs track unobserved state in evolving scenarios, using permutation composition as a model problem that can represent various state tracking tasks.Method: Study LMs trained on permutation composition tasks, analyze learned mechanisms, and use intermediate training to steer toward specific algorithms.
Result: LMs learn two main mechanisms: associative scan (similar to theoretical work) and parity-based pruning with associative scan refinement. The associative scan approach generalizes better and converges faster.
Conclusion: Transformer LMs can learn interpretable state-tracking mechanisms, and their emergence can be predicted and controlled through training interventions.
Abstract: Transformer language models (LMs) exhibit behaviors – from storytelling to code generation – that seem to require tracking the unobserved state of an evolving world. How do they do this? We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). Despite the simple algebraic structure of this problem, many other tasks (e.g., simulation of finite automata and evaluation of boolean expressions) can be reduced to permutation composition, making it a natural model for state tracking in general. We show that LMs consistently learn one of two state tracking mechanisms for this task. The first closely resembles the “associative scan” construction used in recent theoretical work by Liu et al. (2023) and Merrill et al. (2024). The second uses an easy-to-compute feature (permutation parity) to partially prune the space of outputs, and then refines this with an associative scan. LMs that learn the former algorithm tend to generalize better and converge faster, and we show how to steer LMs toward one or the other with intermediate training tasks that encourage or suppress the heuristics. Our results demonstrate that transformer LMs, whether pre-trained or fine-tuned, can learn to implement efficient and interpretable state-tracking mechanisms, and the emergence of these mechanisms can be predicted and controlled.
[55] A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models
Palakorn Achananuparp, Ee-Peng Lim, Yao Lu
Main category: cs.CL
TL;DR: A multi-stage framework (inference, retrieval, reranking) using taxonomy-guided reasoning examples improves occupation classification by aligning LLM outputs with taxonomic knowledge, offering a cost-effective alternative to large models like GPT-4o.
Details
Motivation: Occupation classification faces challenges from data scarcity and manual annotation difficulties. LLMs show promise but their knowledge of occupational taxonomies is unclear, especially for smaller models.Method: Proposed a multi-stage framework with inference, retrieval, and reranking stages that integrates taxonomy-guided reasoning examples to align outputs with taxonomic knowledge.
Result: Evaluations on large-scale datasets show enhanced performance in occupation and skill classification tasks, providing a cost-effective alternative to frontier models while maintaining strong performance.
Conclusion: The framework offers a practical and scalable solution for occupation classification and related tasks across LLMs, significantly reducing computational costs while maintaining performance.
Abstract: Automatically annotating job data with standardized occupations from taxonomies, known as occupation classification, is crucial for labor market analysis. However, this task is often hindered by data scarcity and the challenges of manual annotations. While large language models (LLMs) hold promise due to their extensive world knowledge and in-context learning capabilities, their effectiveness depends on their knowledge of occupational taxonomies, which remains unclear. In this study, we assess the ability of LLMs to generate precise taxonomic entities from taxonomy, highlighting their limitations, especially for smaller models. To address these challenges, we propose a multi-stage framework consisting of inference, retrieval, and reranking stages, which integrates taxonomy-guided reasoning examples to enhance performance by aligning outputs with taxonomic knowledge. Evaluations on a large-scale dataset show that our framework not only enhances occupation and skill classification tasks, but also provides a cost-effective alternative to frontier models like GPT-4o, significantly reducing computational costs while maintaining strong performance. This makes it a practical and scalable solution for occupation classification and related tasks across LLMs.
[56] FUSE : A Ridge and Random Forest-Based Metric for Evaluating MT in Indigenous Languages
Rahul Raja, Arpita Vats
Main category: cs.CL
TL;DR: FUSE system won AmericasNLP 2025 Shared Task 3 on MT evaluation for Indigenous languages, achieving highest correlation with human judgments by combining lexical, phonetic, semantic, and fuzzy token similarity features with Ridge regression and Gradient Boosting.
Details
Motivation: Conventional MT metrics like BLEU, TER, and ChrF fail to adequately evaluate translations into Indigenous languages due to polysynthesis, complex morphology, and non-standardized orthography, particularly in capturing semantic adequacy and fluency.Method: FUSE integrates Ridge regression and Gradient Boosting with multiple linguistic similarity features (lexical, phonetic, semantic, fuzzy token), multilingual sentence embeddings, and phonological encodings. Trained on human-annotated development sets.
Result: FUSE ranked first overall based on average Pearson correlation with human annotations, consistently achieving higher Pearson and Spearman correlations with human judgments compared to conventional metrics.
Conclusion: FUSE demonstrates that combining diverse linguistic features with learning-based modeling provides a robust and linguistically informed solution for MT evaluation in low-resource settings, particularly for morphologically rich Indigenous languages.
Abstract: This paper presents the winning submission of the RaaVa team to the AmericasNLP 2025 Shared Task 3 on Automatic Evaluation Metrics for Machine Translation (MT) into Indigenous Languages of America, where our system ranked first overall based on average Pearson correlation with the human annotations. We introduce Feature-Union Scorer (FUSE) for Evaluation, FUSE integrates Ridge regression and Gradient Boosting to model translation quality. In addition to FUSE, we explore five alternative approaches leveraging different combinations of linguistic similarity features and learning paradigms. FUSE Score highlights the effectiveness of combining lexical, phonetic, semantic, and fuzzy token similarity with learning-based modeling to improve MT evaluation for morphologically rich and low-resource languages. MT into Indigenous languages poses unique challenges due to polysynthesis, complex morphology, and non-standardized orthography. Conventional automatic metrics such as BLEU, TER, and ChrF often fail to capture deeper aspects like semantic adequacy and fluency. Our proposed framework, formerly referred to as FUSE, incorporates multilingual sentence embeddings and phonological encodings to better align with human evaluation. We train supervised models on human-annotated development sets and evaluate held-out test data. Results show that FUSE consistently achieves higher Pearson and Spearman correlations with human judgments, offering a robust and linguistically informed solution for MT evaluation in low-resource settings.
[57] HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar, Shashank Nag, Jason Clemons, Lizy John, Poulami Das
Main category: cs.CL
TL;DR: HELIOS is a framework that improves throughput and batch sizes in Early-Exit LLMs by using multiple models and dynamic switching to maximize early exits, along with greedy early exiting and selective layer loading.
Details
Motivation: Existing EE-LLM frameworks are bottlenecked by tokens that don't exit early, requiring all model layers to be loaded even when unused, which limits memory savings and batch size scaling.Method: HELIOS employs multiple models with dynamic switching to maximize early exits, uses greedy early exiting for tokens that remain unchanged, and selectively loads only the most likely used layers to save memory.
Result: HELIOS achieves 1.48× higher throughput and 15.14× larger batch size compared to existing EE-LLM frameworks.
Conclusion: HELIOS effectively addresses the limitations of single-model EE-LLM frameworks through multi-model collaboration and adaptive resource management, significantly improving inference efficiency.
Abstract: Early-Exit Large Language Models (EE-LLMs) enable high throughput inference by allowing tokens to exit early at intermediate layers. However, their throughput is limited by the computational and memory savings. Existing EE-LLM frameworks rely on a single model and therefore, their token generation latencies are bottlenecked by tokens that do not exit early and traverse additional layers. Moreover, early exits are only known at runtime and depend on the request. Therefore, these frameworks load the weights of all model layers even though large portions remain unused when tokens exit early. The lack of memory savings limit us from scaling the batch sizes. We propose $\textit{HELIOS}$, a framework that improves both token generation latency and batch sizes to enable high-throughput in EE-LLMs. HELIOS exploits two insights. $\textit{First}$, early exits are often complimentary across models, tokens that do not exit early on one model often take an early-exit on another. HELIOS employs multiple models and dynamically switches between them to collectively maximize the number of tokens that exit early, and minimize token generation latencies. $\textit{Second}$, even when a predicted token does not exit early due to poor confidence, it often remains unchanged even after additional layer traversal. HELIOS greedily allows such tokens to exit early and only loads the weights of the most likely to be used layers, yielding memory savings which is then re-purposed to increase batch sizes. HELIOS employs real-time profiling to accurately identify the early-exit distributions, and adaptively switches between models by tracking tokens in real-time to minimize the performance degradation caused by greedy model loading and exiting. Our evaluations show that HELIOS achieves $1.48\times$ higher throughput and $15.14\times$ larger batch size compared to existing EE-LLM frameworks.
[58] Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions
Wang Bill Zhu, Tianqi Chen, Xinyan Velocity Yu, Ching Ying Lin, Jade Law, Mazen Jizzini, Jorge J. Nieva, Ruishan Liu, Robin Jia
Main category: cs.CL
TL;DR: LLMs fail to correct false presuppositions in cancer patient questions, posing medical safety risks. A new benchmark Cancer-Myth reveals models correct only 43% of false presuppositions, and current mitigation strategies cause performance trade-offs.
Details
Motivation: Cancer patients increasingly use LLMs for medical information, but current benchmarks don't evaluate how models handle real patient questions with false assumptions, creating safety concerns for medical decision-making.Method: Created Cancer-Myth dataset with 585 expert-verified cancer questions containing false presuppositions, and Cancer-Myth-NFP set with 150 questions without false presuppositions. Tested frontier LLMs and mitigation strategies including GEPA-optimized precautionary prompts.
Result: No LLM corrected false presuppositions more than 43% on Cancer-Myth. Mitigation strategies improved accuracy to 80% but caused 41% false positives on Cancer-Myth-NFP and 10% performance drop on other medical benchmarks.
Conclusion: LLMs have critical reliability gaps in handling false presuppositions, prompting alone is insufficient for remediation, and more robust safeguards are needed for medical AI systems.
Abstract: Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions in the questions, posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM – including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet – corrects these false presuppositions more than $43%$ of the time. To study mitigation strategies, we further construct a 150-question Cancer-Myth-NFP set, in which physicians confirm the absence of false presuppositions. We find typical mitigation strategies, such as adding precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth to $80%$, but at the cost of misidentifying presuppositions in $41%$ of Cancer-Myth-NFP questions and causing a $10%$ relative performance drop on other medical benchmarks. These findings highlight a critical gap in the reliability of LLMs, show that prompting alone is not a reliable remedy for false presuppositions, and underscore the need for more robust safeguards in medical AI systems.
[59] Minitron-SSM: Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning
Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Marcin Chochowski, Yashaswi Karnati, Raviraj Joshi, Ameya Sunil Mahabaleshwarkar, Zijia Chen, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
Main category: cs.CL
TL;DR: A novel group-aware pruning strategy for compressing Hybrid LLM architectures (combining Attention and State Space Models) that preserves SSM structural integrity, enabling 50% parameter reduction (8B to 4B) with 40x fewer training tokens, achieving higher accuracy and 2x faster inference.
Details
Motivation: Hybrid LLM architectures achieve state-of-the-art performance but are large and expensive to train. While compression works well for Attention-only models, it's unclear if similar techniques apply to Hybrid models, particularly for preserving SSM capabilities.Method: Group-aware pruning strategy that specifically preserves SSM block structure, combined with SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation retraining similar to MINITRON technique.
Result: Successfully compressed Nemotron-H 8B Hybrid model to 4B parameters using 40x fewer training tokens. The compressed model achieves higher accuracy than similarly-sized models and 2x faster inference speed.
Conclusion: The proposed SSM-aware compression approach effectively compresses Hybrid LLM architectures, significantly advancing the Pareto frontier by enabling smaller, more accurate, and faster models with dramatically reduced training costs.
Abstract: Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.
[60] Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen
Main category: cs.CL
TL;DR: A zero-shot video-to-text summarization approach that generates screenplay representations from TV episodes, integrating visual, dialogue, and character information while introducing a new multimodal evaluation metric.
Details
Motivation: Vision-Language Models struggle to balance visual and textual information when summarizing complex multimodal inputs like TV show episodes, and existing summarization metrics fail to properly assess multimodal content.Method: Propose a zero-shot approach that builds screenplay representations from episodes using only audio, video, and transcripts as input, simultaneously generating screenplays and naming characters without prior training.
Result: On SummScreen3D dataset, the approach generates summaries with 20% more relevant visual information than state-of-the-art VLMs like Gemini 1.5, while requiring 75% less video input.
Conclusion: The proposed zero-shot screenplay summarization method effectively integrates multimodal information and the new MFactSum metric provides better evaluation of multimodal summaries compared to existing metrics.
Abstract: Vision-Language Models (VLMs) often struggle to balance visual and textual information when summarizing complex multimodal inputs, such as entire TV show episodes. In this paper, we propose a zero-shot video-to-text summarization approach that builds its own screenplay representation of an episode, effectively integrating key video moments, dialogue, and character information into a unified document. Unlike previous approaches, we simultaneously generate screenplays and name the characters in zero-shot, using only the audio, video, and transcripts as input. Additionally, we highlight that existing summarization metrics can fail to assess the multimodal content in summaries. To address this, we introduce MFactSum, a multimodal metric that evaluates summaries with respect to both vision and text modalities. Using MFactSum, we evaluate our screenplay summaries on the SummScreen3D dataset, demonstrating superiority against state-of-the-art VLMs such as Gemini 1.5 by generating summaries containing 20% more relevant visual information while requiring 75% less of the video as input.
[61] Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models
Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, Guo-Jun Qi
Main category: cs.CL
TL;DR: DCoLT is a reasoning framework for diffusion language models that treats intermediate diffusion steps as latent thinking actions and optimizes the entire reasoning trajectory using outcome-based RL, enabling bidirectional non-linear reasoning.
Details
Motivation: Traditional Chain-of-Thought methods follow linear causal thinking, but DCoLT aims to enable more flexible bidirectional reasoning without strict grammatical constraints in intermediate steps.Method: Uses outcome-based RL to optimize the entire diffusion trajectory, implemented on SEDD (continuous-time discrete diffusion) and LLaDA (discrete-time masked diffusion) with an Unmasking Policy Module for token prediction order.
Result: DCoLT-reinforced DLMs outperform other training methods on math and code tasks, with LLaDA showing accuracy improvements of +9.8% (GSM8K), +5.7% (MATH), +11.4% (MBPP), and +19.5% (HumanEval).
Conclusion: DCoLT effectively enhances reasoning capabilities of diffusion language models through trajectory optimization, achieving significant performance gains on complex reasoning tasks.
Abstract: We introduce the Diffusion Chain of Lateral Thought (DCoLT), a reasoning framework for diffusion language models. DCoLT treats each intermediate step in the reverse diffusion process as a latent “thinking” action and optimizes the entire reasoning trajectory to maximize the reward on the correctness of the final answer with outcome-based Reinforcement Learning (RL). Unlike traditional Chain-of-Thought (CoT) methods that follow a causal, linear thinking process, DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought. We implement DCoLT on two representative Diffusion Language Models (DLMs). First, we choose SEDD as a representative continuous-time discrete diffusion model, where its concrete score derives a probabilistic policy to maximize the RL reward over the entire sequence of intermediate diffusion steps. We further consider the discrete-time masked diffusion language model – LLaDA, and find that the order to predict and unmask tokens plays an essential role to optimize its RL action resulting from the ranking-based Unmasking Policy Module (UPM) defined by the Plackett-Luce model. Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCoLT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCoLT-reinforced LLaDA boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and HumanEval.
[62] VeriFastScore: Speeding up long-form factuality evaluation
Rishanth Rajendhran, Amir Zadeh, Matthew Sarte, Chuan Li, Mohit Iyyer
Main category: cs.CL
TL;DR: VeriFastScore is a fine-tuned Llama3.1 8B model that simultaneously extracts and verifies claims from text using Google Search evidence, achieving 6.6x speedup over VeriScore while maintaining strong correlation.
Details
Motivation: Existing factuality evaluation methods like FactScore and VeriScore are slow (100+ seconds per response) due to multiple LLM calls for claim decomposition and verification, limiting practical use in large-scale scenarios.Method: Fine-tune Llama3.1 8B on synthetic data to perform concurrent claim decomposition, verifiability judgment, and verification against noisy evidence (~4K tokens on average) from Google Search.
Result: Achieves strong correlation with VeriScore (r=0.80 example level, r=0.94 system level) with 6.6x overall speedup (9.9x excluding evidence retrieval).
Conclusion: VeriFastScore provides a fast, accurate alternative to existing factuality evaluation methods, enabling practical large-scale use, with model and datasets publicly released.
Abstract: Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim. While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore. To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.
[63] W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models
Shang Wang
Main category: cs.CL
TL;DR: W-PCA is a novel zero-shot neural architecture search method for lightweight language models that uses parameter count and PCA-based metrics to evaluate models without training, achieving faster evaluation times and competitive performance.
Details
Motivation: To address challenges in zero-shot NAS methods including biased evaluation metrics and computational inefficiencies for lightweight language models.Method: Weight-weighted PCA (W-PCA) using parameter count and principal components in FFN layers as evaluation proxies, eliminating gradient computations for efficiency.
Result: Significantly reduces training time vs one-shot NAS, achieves higher test scores than training-based methods, and shows superior ranking correlation with reduced solving time.
Conclusion: W-PCA provides an efficient and effective zero-shot NAS approach for lightweight language models with improved computational efficiency and performance.
Abstract: The demand for efficient natural language processing (NLP) systems has led to the development of lightweight language models. Previous work in this area has primarily focused on manual design or training-based neural architecture search (NAS) methods. Recently, zero-shot NAS methods have been proposed for evaluating language models without the need for training. However, prevailing approaches to zero-shot NAS often face challenges such as biased evaluation metrics and computational inefficiencies. In this paper, we introduce weight-weighted PCA (W-PCA), a novel zero-shot NAS method specifically tailored for lightweight language models. Our approach utilizes two evaluation proxies: the parameter count and the number of principal components with cumulative contribution exceeding $\eta$ in the feed-forward neural (FFN) layer. Additionally, by eliminating the need for gradient computations, we optimize the evaluation time, thus enhancing the efficiency of designing and evaluating lightweight language models. We conduct a comparative analysis on the GLUE and SQuAD datasets to evaluate our approach. The results demonstrate that our method significantly reduces training time compared to one-shot NAS methods and achieves higher scores in the testing phase compared to previous state-of-the-art training-based methods. Furthermore, we perform ranking evaluations on a dataset sampled from the FlexiBERT search space. Our approach exhibits superior ranking correlation and further reduces solving time compared to other zero-shot NAS methods that require gradient computation.
[64] Token Distillation: Attention-aware Input Embeddings For New Tokens
Konstantin Dobler, Desmond Elliott, Gerard de Melo
Main category: cs.CL
TL;DR: Token Distillation method quickly learns high-quality input embeddings for new tokens by distilling representations from original tokenization, outperforming strong baselines without expensive training.
Details
Motivation: Current language models use static vocabularies that perform poorly on underrepresented domains, and existing embedding initialization methods require expensive further training.Method: Propose Token Distillation - distilling representations obtained using the original tokenization to learn high-quality input embeddings for new tokens.
Result: Experimental results with various open-weight models show Token Distillation outperforms even strong baselines.
Conclusion: Token Distillation provides an effective solution for adding new tokens to language models without expensive training requirements.
Abstract: Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods require expensive further training or pretraining of additional modules. In this paper, we propose Token Distillation and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that Token Distillation outperforms even strong baselines.
[65] AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy
Sebastian Antony Joseph, Syed Murtaza Husain, Stella S. R. Offner, Stéphanie Juneau, Paul Torrey, Adam S. Bolton, Juan P. Farias, Niall Gaffney, Greg Durrett, Junyi Jessy Li
Main category: cs.CL
TL;DR: AstroVisBench is the first benchmark for evaluating LLMs’ capabilities in scientific computing and visualization in astronomy, showing current models have significant limitations in assisting astronomy research.
Details
Motivation: To evaluate whether LLM-mediated scientific workflows can produce correct scientific insights through data processing and visualization, which hasn't been addressed in previous work.Method: Created AstroVisBench benchmark that evaluates LLMs’ ability to create astronomy-specific workflows and generate complex plots, using a novel LLM-as-a-judge workflow validated by professional astronomers.
Result: Evaluation of state-of-the-art language models reveals a significant gap in their ability to engage in astronomy research as useful assistants.
Conclusion: AstroVisBench provides a strong end-to-end evaluation framework for AI scientists and offers a path forward for developing visualization-based workflows across scientific domains.
Abstract: Large Language Models (LLMs) are being explored for applications in scientific research, including their capabilities to synthesize literature, answer research questions, generate research ideas, and even conduct computational experiments. Ultimately, our goal is for these to help scientists derive novel scientific insights. In many areas of science, such insights often arise from processing and visualizing data to understand its patterns. However, evaluating whether an LLM-mediated scientific workflow produces outputs conveying the correct scientific insights is challenging to evaluate and has not been addressed in past work. We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain. AstroVisBench judges a language model’s ability to both (1) create astronomy-specific workflows to process and analyze data and (2) visualize the results of these workflows through complex plots. Our evaluation of visualizations uses a novel LLM-as-a-judge workflow, which is validated against annotation by five professional astronomers. Using AstroVisBench we present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants. This evaluation provides a strong end-to-end evaluation for AI scientists that offers a path forward for the development of visualization-based workflows, which are central to a broad range of domains from physics to biology.
[66] Accelerating Diffusion LLMs via Adaptive Parallel Decoding
Daniel Israel, Guy Van den Broeck, Aditya Grover
Main category: cs.CL
TL;DR: APD is a novel parallel decoding method that dynamically adjusts parallel token sampling in diffusion LLMs using a multiplicative mixture with a small autoregressive model, achieving higher throughput with minimal quality loss.
Details
Motivation: Autoregressive decoding bottlenecks LLM generation speed, while current diffusion LLMs struggle to match autoregressive speed without quality sacrifice. APD aims to enable efficient parallel token generation.Method: Adaptive Parallel Decoding (APD) uses a multiplicative mixture between dLLM marginal probabilities and joint probabilities from a small auxiliary autoregressive model, with KV caching and masked input optimization.
Result: APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks compared to standard approaches.
Conclusion: APD offers a flexible approach to tradeoff throughput and quality in diffusion LLMs through three tunable parameters, enabling efficient parallel decoding.
Abstract: The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
[67] Mathematics Isn’t Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations
Aditya Tomar, Nihar Ranjan Sahoo, Ashish Mittal, Rudra Murthy, Pushpak Bhattacharyya
Main category: cs.CL
TL;DR: LLMs perform worse on culturally adapted math problems compared to US-centric ones, but reasoning capabilities help mitigate this performance gap.
Details
Motivation: Existing math benchmarks like GSM8K are culturally biased toward Western norms, ignoring diverse cultural contexts in problem presentation.Method: Created culturally adapted GSM8K variants for Africa, India, China, Korea, and Japan using prompt-based transformations with manual verification, then evaluated 6 LLMs across 5 prompting strategies.
Result: Models consistently performed best on original US-centric dataset and worse on culturally adapted versions, though reasoning-capable models showed more resilience.
Conclusion: Cultural context in math problem presentation affects LLM performance, but strong reasoning abilities can help bridge cultural gaps in mathematical tasks.
Abstract: Although mathematics is often considered culturally neutral, the way mathematical problems are presented can carry implicit cultural context. Existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. In this work, we create culturally adapted variants of the GSM8K test set for five regions Africa, India, China, Korea, and Japan using prompt-based transformations followed by manual verification. We evaluate six large language models (LLMs), ranging from 8B to 72B parameters, across five prompting strategies to assess their robustness to cultural variation in math problem presentation. Our findings reveal a consistent performance gap: models perform best on the original US-centric dataset and comparatively worse on culturally adapted versions. However, models with reasoning capabilities are more resilient to these shifts, suggesting that deeper reasoning helps bridge cultural presentation gaps in mathematical tasks
[68] FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing
Shoutao Guo, Shaolei Zhang, Qingkai Fang, Zhengrui Ma, Min Zhang, Yang Feng
Main category: cs.CL
TL;DR: FastLongSpeech is a framework that enables Large Speech-Language Models (LSLMs) to efficiently process long-form speech without requiring dedicated long-speech training data, using iterative fusion and dynamic compression training.
Details
Motivation: Existing LSLMs focus on short-speech tasks or speech generation, but efficient processing of long-form speech remains challenging due to scarce training data and high computational costs of long sequences.Method: Uses iterative fusion strategy to compress long-speech sequences into manageable lengths, and dynamic compression training that exposes models to short-speech sequences at varying compression ratios to transfer capabilities to long-speech tasks.
Result: The method shows strong performance in both long-speech and short-speech tasks while significantly improving inference efficiency. A new benchmark called LongSpeech-Eval was developed to assess long-speech capabilities.
Conclusion: FastLongSpeech successfully extends LSLM capabilities for efficient long-speech processing without needing dedicated long-speech training data, addressing a critical gap in speech processing research.
Abstract: The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.
[69] Multilingual Political Views of Large Language Models: Identification and Steering
Daniil Gurgurov, Katharina Trinley, Ivan Vykopal, Josef van Genabith, Simon Ostermann, Roberto Zamparelli
Main category: cs.CL
TL;DR: This paper analyzes political biases in large language models across multiple architectures, languages, and model sizes, finding consistent libertarian-left leanings and demonstrating that these biases can be actively manipulated.
Details
Motivation: There are concerns about LLMs' political influence, but existing studies have limited scope in terms of models, languages, and lack investigation into whether political biases can be controlled.Method: Evaluated 7 open-source LLMs across 14 languages using Political Compass Test with 11 paraphrases per statement, and tested manipulability using center-of-mass activation intervention.
Result: Larger models consistently shift toward libertarian-left positions with significant cross-language variations. The activation intervention successfully steers model responses toward alternative ideological positions across multiple languages.
Conclusion: Political biases in LLMs are systematic, vary by model size and language, and can be actively manipulated, highlighting the need for careful consideration of political alignment in LLM deployment.
Abstract: Large language models (LLMs) are increasingly used in everyday tools and applications, raising concerns about their potential influence on political views. While prior research has shown that LLMs often exhibit measurable political biases–frequently skewing toward liberal or progressive positions–key gaps remain. Most existing studies evaluate only a narrow set of models and languages, leaving open questions about the generalizability of political biases across architectures, scales, and multilingual settings. Moreover, few works examine whether these biases can be actively controlled. In this work, we address these gaps through a large-scale study of political orientation in modern open-source instruction-tuned LLMs. We evaluate seven models, including LLaMA-3.1, Qwen-3, and Aya-Expanse, across 14 languages using the Political Compass Test with 11 semantically equivalent paraphrases per statement to ensure robust measurement. Our results reveal that larger models consistently shift toward libertarian-left positions, with significant variations across languages and model families. To test the manipulability of political stances, we utilize a simple center-of-mass activation intervention technique and show that it reliably steers model responses toward alternative ideological positions across multiple languages. Our code is publicly available at https://github.com/d-gurgurov/Political-Ideologies-LLMs.
[70] Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges
Weiyuan Li, Xintao Wang, Siyu Yuan, Rui Xu, Jiangjie Chen, Qingqing Dong, Yanghua Xiao, Deqing Yang
Main category: cs.CL
TL;DR: This paper introduces ComplexEval, a benchmark to systematically study Auxiliary Information Induced Biases in LLM-as-judge evaluations, revealing significant bias susceptibility across models that increases with task complexity.
Details
Motivation: As LLMs handle more complex tasks, reliable evaluation becomes challenging. The LLM-as-judge paradigm is scalable but its reliability in complex tasks with nuanced criteria remains understudied.Method: Constructed ComplexEval benchmark to systematically expose and quantify 6 previously unexplored biases across 12 basic and 3 advanced scenarios, investigating bias susceptibility in various models.
Result: All evaluated models showed significant susceptibility to biases, with bias magnitude scaling with task complexity. Large Reasoning Models (LRMs) paradoxically showed high vulnerability despite their reasoning capabilities.
Conclusion: The analysis provides crucial insights for improving evaluation accuracy and verifiability, paving the way for more general and robust evaluation models.
Abstract: As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks–where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical–remains understudied. In this paper, we constructed ComplexEval, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.
[71] Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains
Md. Faiyaz Abdullah Sayeedi, Md. Mahbub Alam, Subhey Sadi Rahman, Md. Adnanul Islam, Jannatul Ferdous Deepti, Tasnim Mohiuddin, Md Mofijul Islam, Swakkhar Shatabda
Main category: cs.CL
TL;DR: Translation Tangles is a framework and dataset for evaluating translation quality and fairness in LLMs across 24 language pairs, addressing performance gaps and bias amplification issues.
Details
Motivation: LLMs show uneven performance across languages and domains, and can amplify biases from training data, especially in low-resource languages, raising fairness concerns.Method: Benchmarking 24 bidirectional language pairs across domains using multiple metrics, plus a hybrid bias detection pipeline combining rule-based heuristics, semantic similarity filtering, and LLM-based validation.
Result: Created a high-quality, bias-annotated dataset with 1,439 human-evaluated translation-reference pairs, with code and dataset publicly available on GitHub.
Conclusion: Translation Tangles provides a unified framework to systematically evaluate and address translation quality and fairness issues in LLMs, particularly for low-resource languages.
Abstract: The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: https://github.com/faiyazabdullah/TranslationTangles
[72] Prompt-MII: Meta-Learning Instruction Induction for LLMs
Emily Xiao, Yixiao Zeng, Ada Chen, Chin-Jou Li, Amanda Bertsch, Graham Neubig
Main category: cs.CL
TL;DR: PROMPT-MII is a reinforcement learning framework that generates compact instructions for new datasets, achieving comparable performance to in-context learning while using 3-13x fewer tokens.
Details
Motivation: In-context learning (ICL) is effective for adapting LLMs to new tasks but incurs high inference costs as context length grows.Method: PROMPT-MII uses reinforcement learning to meta-learn an instruction induction model that generates compact, descriptive prompts from training examples.
Result: PROMPT-MII improves downstream model quality by 4-9 F1 points (10-20% relative), matching ICL performance while requiring 3-13x fewer tokens.
Conclusion: The method successfully reduces inference costs while maintaining performance, making it a practical alternative to traditional in-context learning.
Abstract: A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows. In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. Specifically, we propose PROMPT-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks. PROMPT-MII improves downstream model quality by 4-9 F1 points (10-20% relative), matching ICL performance while requiring 3-13x fewer tokens.
[73] Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging
Tiancheng Hu, Benjamin Minixhofer, Nigel Collier
Main category: cs.CL
TL;DR: Simple weight interpolation between pre- and post-alignment models can create Pareto-optimal models that improve accuracy while recovering calibration lost during alignment.
Details
Motivation: The alignment tax involves not just accuracy drops but also severe loss of calibration, making models overconfident and less reliable.Method: Post-hoc weight interpolation between model weights before and after alignment.
Result: Consistently reveals Pareto-optimal interpolations that improve accuracy beyond both parent models while substantially recovering calibration.
Conclusion: Simple model merging provides computationally efficient method for mitigating the full scope of alignment tax, yielding more capable and reliable models.
Abstract: The “alignment tax” of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model’s weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.
[74] KAT-Coder Technical Report
Zizheng Zhan, Ken Deng, Jinghui Wang, Xiaojiang Zhang, Huaixi Tang, Minglei Zhang, Zhiyi Lai, Haoyang Huang, Wen Xiang, Kun Wu, Wenhao Zhuang, Shaojie Wang, Shangpeng Yan, Kepeng Lei, Zongxian Feng, Huiming Wang, Zheng Lin, Mengtong Li, Mengfei Xie, Yinghan Cui, Xuxing Chen, Chao Wang, Weihao Li, Wenqiang Zhu, Jiarong Zhang, Jingxuan Xu, Songwei Yu, Yifan Yao, Xinping Lei, C. Zhang, Han Li, Junqi Xiong, Zuchen Gao, Dailin Li, Haimo Li, Jiaheng Liu, Yuqun Zhang, Junyi Peng, Haotian Zhang, Bin Chen
Main category: cs.CL
TL;DR: KAT-Coder is a large-scale agentic code model trained through a multi-stage curriculum to bridge the gap between static text-based training and dynamic real-world agentic execution in software development.
Details
Motivation: To address the challenge of bridging static text-based training with dynamic real-world agentic execution in coding workflows, enabling autonomous reasoning, planning, and action in interactive software development.Method: Multi-stage training curriculum: Mid-Term Training (enhances reasoning/planning with real software data), Supervised Fine-Tuning (million-sample dataset across 20 languages), Reinforcement Fine-Tuning (multi-ground-truth reward formulation), and Reinforcement-to-Deployment Adaptation (Error-Masked SFT and Tree-Structured Trajectory Training).
Result: KAT-Coder achieves robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. The 32B model KAT-Dev has been open-sourced.
Conclusion: The multi-stage curriculum enables effective agentic coding capabilities, successfully bridging the training-execution gap and providing a foundation for real-world deployment of intelligent coding agents.
Abstract: Recent advances in large language models (LLMs) have enabled progress in agentic coding, where models autonomously reason, plan, and act within interactive software development workflows. However, bridging the gap between static text-based training and dynamic real-world agentic execution remains a core challenge. In this technical report, we present KAT-Coder, a large-scale agentic code model trained through a multi-stage curriculum encompassing Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances reasoning, planning, and reflection capabilities through a corpus of real software engineering data and synthetic agentic interactions. The SFT stage constructs a million-sample dataset balancing twenty programming languages, ten development contexts, and ten task archetypes. The RFT stage introduces a novel multi-ground-truth reward formulation for stable and sample-efficient policy optimization. Finally, the Reinforcement-to-Deployment phase adapts the model to production-grade IDE environments using Error-Masked SFT and Tree-Structured Trajectory Training. In summary, these stages enable KAT-Coder to achieve robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. Our KAT series 32B model, KAT-Dev, has been open-sourced on https://huggingface.co/Kwaipilot/KAT-Dev.
[75] DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference
Xiang Liu, Xuming Hu, Xiaowen Chu, Eunsol Choi
Main category: cs.CL
TL;DR: DiffAdapt is a lightweight framework that improves LLM reasoning efficiency by classifying problem difficulty and selecting optimal inference strategies, reducing token usage by up to 22.4% while maintaining accuracy.
Details
Motivation: LLMs often generate unnecessarily long reasoning traces (overthinking) on easy problems, wasting computational resources despite high accuracy.Method: Analyzed token probability entropy patterns, then developed DiffAdapt - a small probe that classifies LLM hidden states into Easy/Normal/Hard categories and selects appropriate inference strategies (prompt, temperature, max tokens) per question.
Result: Achieved comparable or improved accuracy while reducing token usage by up to 22.4% across five models and eight benchmarks.
Conclusion: DiffAdapt provides a practical path toward compute-efficient reasoning without fine-tuning base LLMs, enabling efficient problem-solving through difficulty-adaptive inference strategies.
Abstract: Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22–25% entropy reduction from easy to medium difficulty regions, suggesting an {overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM’s final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4%, establishing a practical path toward compute-efficient reasoning.
[76] E2Rank: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker
Qi Liu, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Jiaxin Mao
Main category: cs.CL
TL;DR: E2Rank is a unified framework that extends text embedding models to perform both retrieval and listwise reranking through continued training, achieving state-of-the-art performance with high efficiency.
Details
Motivation: Text embedding models are efficient for retrieval but have limited ranking fidelity compared to dedicated rerankers, especially LLM-based listwise rerankers that capture fine-grained interactions.Method: E2Rank uses cosine similarity between query and document embeddings as a unified ranking function, with listwise ranking prompts constructed from queries and candidate documents that act like pseudo-relevance feedback.
Result: Achieves SOTA results on BEIR reranking benchmark, competitive performance on BRIGHT benchmark with low latency, and improves embedding performance on MTEB benchmark.
Conclusion: A single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy.
Abstract: Text embedding models serve as a fundamental component in real-world search applications. By mapping queries and documents into a shared embedding space, they deliver competitive retrieval performance with high efficiency. However, their ranking fidelity remains limited compared to dedicated rerankers, especially recent LLM-based listwise rerankers, which capture fine-grained query-document and document-document interactions. In this paper, we propose a simple yet effective unified framework E2Rank, means Efficient Embedding-based Ranking (also means Embedding-to-Rank), which extends a single text embedding model to perform both high-quality retrieval and listwise reranking through continued training under a listwise ranking objective, thereby achieving strong effectiveness with remarkable efficiency. By applying cosine similarity between the query and document embeddings as a unified ranking function, the listwise ranking prompt, which is constructed from the original query and its candidate documents, serves as an enhanced query enriched with signals from the top-K documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval models. This design preserves the efficiency and representational quality of the base embedding model while significantly improving its reranking performance. Empirically, E2Rank achieves state-of-the-art results on the BEIR reranking benchmark and demonstrates competitive performance on the reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also show that the ranking training process improves embedding performance on the MTEB benchmark. Our findings indicate that a single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy.
[77] From Memorization to Reasoning in the Spectrum of Loss Curvature
Jack Merullo, Srihita Vatsavaya, Lucius Bushnaq, Owen Lewis
Main category: cs.CL
TL;DR: The paper shows that memorization in transformers can be identified and removed using loss landscape curvature analysis, revealing that memorized data creates sharper curvature patterns in model weights.
Details
Motivation: To understand how memorization manifests in transformer models and develop methods to remove it while preserving general model capabilities.Method: Using loss landscape curvature decomposition to identify memorization patterns in weights, followed by weight editing procedures that suppress high-curvature components associated with memorized data.
Result: The editing procedure effectively reduces recitation of memorized data more than existing unlearning methods while maintaining lower perplexity. However, it negatively impacts specific tasks like fact retrieval and arithmetic that rely on specialized weight directions.
Conclusion: Memorization can be disentangled and removed from transformers using curvature analysis, revealing that certain tasks depend on specialized weight structures rather than general mechanisms, even when those tasks don’t involve memorized data.
Abstract: We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data’s activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.
[78] SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu, Daniel McDuff, Tim Althoff
Main category: cs.CL
TL;DR: SynthWorlds is a framework that disentangles reasoning complexity from factual knowledge by creating parallel corpora of real and synthetic worlds, enabling precise evaluation of language models’ reasoning abilities separate from memorization.
Details
Motivation: Current evaluation methods cannot cleanly separate language models' reasoning ability from their parametric world knowledge, as benchmark performance often reflects factual recall rather than genuine reasoning.Method: Construct parallel corpora with identical interconnected structure: real-mapped world (where models can use parametric knowledge) and synthetic-mapped world (where such knowledge is meaningless). Design mirrored tasks like multi-hop QA and page navigation that maintain equal reasoning difficulty across both worlds.
Result: Experiments reveal a persistent knowledge advantage gap - the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap.
Conclusion: SynthWorlds provides a controlled environment for precise evaluation of reasoning vs memorization in language models, enabling testable comparisons that were previously challenging.
Abstract: Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.
[79] DiagramEval: Evaluating LLM-Generated Diagrams via Graphs
Chumeng Liang, Jiaxuan You
Main category: cs.CL
TL;DR: DiagramEval is a novel evaluation metric that assesses LLM-generated demonstration diagrams by treating them as graphs with text elements as nodes and connections as edges, using node alignment and path alignment metrics.
Details
Motivation: Standard image generative models struggle to produce clear diagrams with well-defined structure, and there's a lack of discriminative and explainable metrics for evaluating LLM-generated diagrams.Method: DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using node alignment and path alignment metrics.
Result: The method effectively evaluates diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of the metrics and providing enhanced explainability.
Conclusion: DiagramEval offers a novel approach to evaluating LLM-generated diagrams with improved discriminative power and explainability, providing valuable insights into the characteristics of such diagrams.
Abstract: Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: https://github.com/ulab-uiuc/diagram-eval.
[80] NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium
Dinghong Song, Jierui Xu, Weichu Yang, Pengfei Su, Dong Li
Main category: cs.CL
TL;DR: The paper presents optimized matrix multiplication techniques for LLM inference on AWS Trainium AI accelerators, achieving significant performance improvements over AWS’s implementation.
Details
Motivation: Trainium AI accelerators offer cost-effective solutions for LLM workloads but present challenges due to their systolic array architecture and data layout requirements, making high-performance optimization difficult.Method: Designed high-performance matrix multiplication using kernel fusion and novel caching strategies to reduce data movement, maximize SRAM bandwidth, and avoid expensive matrix transpose operations.
Result: Achieved average 1.35x speedup (up to 2.22x) at kernel level and average 1.66x speedup (up to 2.49x) for end-to-end LLM inference across nine datasets and four recent LLMs.
Conclusion: The proposed techniques successfully overcome Trainium’s architectural challenges and significantly improve LLM inference performance compared to state-of-the-art implementations.
Abstract: AI accelerators, customized to AI workloads, provide cost-effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM training and inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance matrix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evaluating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x speedup (up to 2.22x), which translates to an average 1.66x speedup (up to 2.49x) for end-to-end LLM inference.
[81] The End of Manual Decoding: Towards Truly End-to-End Language Models
Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan, Jiahao Xu, Haitao Mi, Xiaoying Tang, Yan Wang
Main category: cs.CL
TL;DR: AutoDeco enables truly end-to-end LLM generation by learning to control its own decoding strategy through lightweight heads that dynamically predict temperature and top-p values alongside token generation.
Details
Motivation: Current LLMs rely on non-differentiable decoding processes requiring manual hyperparameter tuning, which breaks the 'end-to-end' promise and limits adaptability.Method: Augment standard transformer with lightweight heads that dynamically predict context-specific temperature and top-p values at each step, transforming decoding into a parametric token-level process.
Result: AutoDeco significantly outperforms default decoding strategies and achieves performance comparable to oracle-tuned baselines across eight benchmarks, while enabling instruction-based decoding control.
Conclusion: AutoDeco opens a new paradigm for steerable and interactive LLM decoding by enabling models to self-regulate sampling strategies and respond to natural language commands about randomness levels.
Abstract: The “end-to-end” label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly “end-to-end” generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from “hacking the test set”-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., “generate with low randomness”) and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.
cs.CV
[82] Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench
Fenfen Lin, Yesheng Liu, Haiyu Xu, Chen Yue, Zheqi He, Mingxuan Zhao, Miguel Hu Chen, Jiakang Liu, JG Yao, Xi Yang
Main category: cs.CV
TL;DR: MeasureBench is a benchmark for evaluating vision-language models’ ability to read measurement instruments, showing current VLMs struggle with fine-grained spatial grounding despite recognizing numbers.
Details
Motivation: Current vision-language models find it surprisingly challenging to read measurement instruments, which is effortless for humans and requires little domain expertise.Method: Created MeasureBench with real-world and synthesized measurement images, plus an extensible pipeline for procedural generation of gauges with controllable visual appearance.
Result: Even the strongest frontier VLMs struggle with measurement reading, particularly failing at indicator localization - they can read digits but misidentify pointer positions, leading to large numeric errors.
Conclusion: Current VLMs have fundamental limitations in fine-grained spatial grounding. The benchmark aims to help advance visually grounded numeracy and precise spatial perception in VLMs.
Abstract: Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. A consistent failure mode is indicator localization: models can read digits or labels but misidentify the key positions of pointers or alignments, leading to big numeric errors despite plausible textual reasoning. We have also conducted preliminary experiments with reinforcement learning over synthetic data, and find encouraging results on in-domain synthetic subset but less promising for real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.
[83] PF-DAformer: Proximal Femur Segmentation via Domain Adaptive Transformer for Dual-Center QCT
Rochak Dhakal, Chen Zhao, Zixin Shi, Joyce H. Keyak, Tadashi S. Kaneko, Kuan-Jui Su, Hui Shen, Hong-Wen Deng, Weihua Zhou
Main category: cs.CV
TL;DR: A domain-adaptive transformer framework for multi-institutional QCT segmentation that addresses domain shift using adversarial and statistical alignment methods.
Details
Motivation: Overcome domain shift in automated QCT segmentation models to enable reliable multi-center osteoporosis research and reproducible radiomics analysis across different institutions with varying scanners and settings.Method: 3D TransUNet backbone with dual alignment strategies: adversarial alignment via Gradient Reversal Layer (GRL) to discourage site-specific encoding, and statistical alignment via Maximum Mean Discrepancy (MMD) to reduce distributional mismatches between institutions.
Result: Trained and validated on one of the largest hip fracture cohorts (1,024 QCT scans from Tulane University and 384 from Rochester, Minnesota) for proximal femur segmentation.
Conclusion: The proposed dual-alignment framework enables scanner-agnostic feature learning while preserving anatomical detail, addressing the critical challenge of domain shift in multi-institutional QCT analysis.
Abstract: Quantitative computed tomography (QCT) plays a crucial role in assessing bone strength and fracture risk by enabling volumetric analysis of bone density distribution in the proximal femur. However, deploying automated segmentation models in practice remains difficult because deep networks trained on one dataset often fail when applied to another. This failure stems from domain shift, where scanners, reconstruction settings, and patient demographics vary across institutions, leading to unstable predictions and unreliable quantitative metrics. Overcoming this barrier is essential for multi-center osteoporosis research and for ensuring that radiomics and structural finite element analysis results remain reproducible across sites. In this work, we developed a domain-adaptive transformer segmentation framework tailored for multi-institutional QCT. Our model is trained and validated on one of the largest hip fracture related research cohorts to date, comprising 1,024 QCT images scans from Tulane University and 384 scans from Rochester, Minnesota for proximal femur segmentation. To address domain shift, we integrate two complementary strategies within a 3D TransUNet backbone: adversarial alignment via Gradient Reversal Layer (GRL), which discourages the network from encoding site-specific cues, and statistical alignment via Maximum Mean Discrepancy (MMD), which explicitly reduces distributional mismatches between institutions. This dual mechanism balances invariance and fine-grained alignment, enabling scanner-agnostic feature learning while preserving anatomical detail.
[84] DC4GS: Directional Consistency-Driven Adaptive Density Control for 3D Gaussian Splatting
Moonsoo Jeong, Dongbeen Kim, Minseong Kim, Sungkil Lee
Main category: cs.CV
TL;DR: DC4GS introduces directional consistency into adaptive density control for 3D Gaussian splatting, reducing primitive count by up to 30% while improving reconstruction quality.
Details
Motivation: Conventional adaptive density control relies only on positional gradient magnitudes, which may lead to redundant primitive splitting and suboptimal alignment with local structures.Method: Incorporates directional consistency through angular coherence of gradients to better capture local structural complexities, and uses DC to define optimal split positions for sub-primitives.
Result: Reduces number of primitives by up to 30% compared to existing ADC methods while greatly enhancing reconstruction fidelity.
Conclusion: Directional consistency-driven adaptive density control effectively optimizes primitive placement and reduces redundancy in 3D Gaussian splatting.
Abstract: We present a Directional Consistency (DC)-driven Adaptive Density Control (ADC) for 3D Gaussian Splatting (DC4GS). Whereas the conventional ADC bases its primitive splitting on the magnitudes of positional gradients, we further incorporate the DC of the gradients into ADC, and realize it through the angular coherence of the gradients. Our DC better captures local structural complexities in ADC, avoiding redundant splitting. When splitting is required, we again utilize the DC to define optimal split positions so that sub-primitives best align with the local structures than the conventional random placement. As a consequence, our DC4GS greatly reduces the number of primitives (up to 30% in our experiments) than the existing ADC, and also enhances reconstruction fidelity greatly.
[85] Scale-Aware Curriculum Learning for Ddata-Efficient Lung Nodule Detection with YOLOv11
Yi Luo, Yike Guo, Hamed Hooshangnejad, Kai Ding
Main category: cs.CV
TL;DR: SACL is a dynamic curriculum learning method for lung nodule detection that adapts to data scale, showing significant improvements in data-limited scenarios without changing model architecture.
Details
Motivation: Existing deep learning approaches struggle in clinical settings with limited annotated data, and traditional static curriculum learning fails in data-scarce scenarios.Method: Proposes Scale Adaptive Curriculum Learning (SACL) with three mechanisms: adaptive epoch scheduling, hard sample injection, and scale-aware optimization, evaluated on LUNA25 dataset using YOLOv11.
Result: SACL achieves comparable performance to static curriculum learning on full dataset, but shows significant improvements of 4.6%, 3.5%, and 2.0% over baseline at 10%, 20%, and 50% of training data respectively.
Conclusion: SACL enables robust training across varying data scales without architectural modifications, providing a practical solution for healthcare institutions with limited annotation resources.
Abstract: Lung nodule detection in chest CT is crucial for early lung cancer diagnosis, yet existing deep learning approaches face challenges when deployed in clinical settings with limited annotated data. While curriculum learning has shown promise in improving model training, traditional static curriculum strategies fail in data-scarce scenarios. We propose Scale Adaptive Curriculum Learning (SACL), a novel training strategy that dynamically adjusts curriculum design based on available data scale. SACL introduces three key mechanisms:(1) adaptive epoch scheduling, (2) hard sample injection, and (3) scale-aware optimization. We evaluate SACL on the LUNA25 dataset using YOLOv11 as the base detector. Experimental results demonstrate that while SACL achieves comparable performance to static curriculum learning on the full dataset in mAP50, it shows significant advantages under data-limited conditions with 4.6%, 3.5%, and 2.0% improvements over baseline at 10%, 20%, and 50% of training data respectively. By enabling robust training across varying data scales without architectural modifications, SACL provides a practical solution for healthcare institutions to develop effective lung nodule detection systems despite limited annotation resources.
[86] HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition
Jiacheng Hong, Kunzhen Wu, Mingrui Yu, Yichao Gu, Shengze Xue, Shuangjiu Xiao, Deli Dong
Main category: cs.CV
TL;DR: HiGS is a hierarchical generative framework for 3D scene generation that enables multi-step associative semantic spatial composition, allowing users to iteratively expand scenes with fine-grained control while maintaining spatial and semantic consistency.
Details
Motivation: Existing single-step generation methods struggle to balance scene complexity with minimal user input. The approach is inspired by human cognitive processes in scene modeling that progress from global to local, focus on key elements, and complete scenes through semantic association.Method: Proposes HiGS framework with Progressive Hierarchical Spatial-Semantic Graph (PHiSSG) that dynamically organizes spatial relationships and semantic dependencies. It maintains one-to-one mapping between graph nodes and objects, supports recursive layout optimization, and enables iterative scene expansion through user-selected key semantic objects.
Result: HiGS outperforms single-stage methods in layout plausibility, style consistency, and user preference. It offers controllable and extensible paradigm for efficient 3D scene construction.
Conclusion: The hierarchical multi-step approach provides better balance between scene complexity and user control compared to single-step generation methods, making it suitable for applications in gaming, film, and virtual reality.
Abstract: Three-dimensional scene generation holds significant potential in gaming, film, and virtual reality. However, most existing methods adopt a single-step generation process, making it difficult to balance scene complexity with minimal user input. Inspired by the human cognitive process in scene modeling, which progresses from global to local, focuses on key elements, and completes the scene through semantic association, we propose HiGS, a hierarchical generative framework for multi-step associative semantic spatial composition. HiGS enables users to iteratively expand scenes by selecting key semantic objects, offering fine-grained control over regions of interest while the model completes peripheral areas automatically. To support structured and coherent generation, we introduce the Progressive Hierarchical Spatial-Semantic Graph (PHiSSG), which dynamically organizes spatial relationships and semantic dependencies across the evolving scene structure. PHiSSG ensures spatial and geometric consistency throughout the generation process by maintaining a one-to-one mapping between graph nodes and generated objects and supporting recursive layout optimization. Experiments demonstrate that HiGS outperforms single-stage methods in layout plausibility, style consistency, and user preference, offering a controllable and extensible paradigm for efficient 3D scene construction.
[87] SYNAPSE-Net: A Unified Framework with Lesion-Aware Hierarchical Gating for Robust Segmentation of Heterogeneous Brain Lesions
Md. Mehedi Hassan, Shafqat Alam, Shahriar Ahmed Seam, Maruf Ahmed
Main category: cs.CV
TL;DR: Unified Multi-Stream SYNAPSE-Net is a novel adaptive framework for automated brain lesion segmentation that achieves state-of-the-art performance across multiple brain pathologies by integrating multi-stream CNN encoders, Swin Transformer bottleneck, dynamic cross-modal attention fusion, and hierarchical gated decoder.
Details
Motivation: Current deep learning models for brain lesion segmentation are specialized point solutions that lack generalization and have high performance variance, limiting their clinical reliability.Method: Hybrid architecture with multi-stream CNN encoders, Swin Transformer bottleneck, dynamic cross-modal attention fusion mechanism, and hierarchical gated decoder, trained with variance reduction strategy combining pathology-specific data augmentation and difficulty-aware sampling.
Result: Achieved state-of-the-art DSC of 0.831 (HD95: 3.03) on WMH dataset, best boundary accuracy on ISLES 2022 (HD95: 9.69), and highest DSC of 0.8651 for tumor core on BraTS 2020.
Conclusion: The unified adaptive framework provides a robust and clinically feasible solution for automated brain lesion segmentation across multiple pathologies.
Abstract: Automated segmentation of heterogeneous brain lesions from multi-modal MRI remains a critical challenge in clinical neuroimaging. Current deep learning models are typically specialized `point solutions’ that lack generalization and high performance variance, limiting their clinical reliability. To address these gaps, we propose the Unified Multi-Stream SYNAPSE-Net, an adaptive framework designed for both generalization and robustness. The framework is built on a novel hybrid architecture integrating multi-stream CNN encoders, a Swin Transformer bottleneck for global context, a dynamic cross-modal attention fusion (CMAF) mechanism, and a hierarchical gated decoder for high-fidelity mask reconstruction. The architecture is trained with a variance reduction strategy that combines pathology specific data augmentation and difficulty-aware sampling method. The model was evaluated on three different challenging public datasets: the MICCAI 2017 WMH Challenge, the ISLES 2022 Challenge, and the BraTS 2020 Challenge. Our framework attained a state-of-the-art DSC value of 0.831 with the HD95 value of 3.03 in the WMH dataset. For ISLES 2022, it achieved the best boundary accuracy with a statistically significant difference (HD95 value of 9.69). For BraTS 2020, it reached the highest DSC value for the tumor core region (0.8651). These experimental findings suggest that our unified adaptive framework achieves state-of-the-art performance across multiple brain pathologies, providing a robust and clinically feasible solution for automated segmentation. The source code and the pre-trained models are available at https://github.com/mubid-01/SYNAPSE-Net-pre.
[88] Referee: Reference-aware Audiovisual Deepfake Detection
Hyemin Boo, Eunsang Lee, Jiyoung Lee
Main category: cs.CV
TL;DR: Referee is a reference-aware audiovisual deepfake detection method that uses speaker-specific cues from one-shot examples to detect manipulations by matching identity-related queries across modalities.
Details
Motivation: Deepfakes from advanced generative models pose serious threats, and existing detection methods struggle to generalize to unseen forgeries, requiring more robust approaches.Method: Leverages speaker-specific cues from one-shot examples, matches and aligns identity-related queries from reference and target content into cross-modal features, and jointly reasons about audiovisual synchrony and identity consistency.
Result: Achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols on FakeAVCeleb, FaceForensics++, and KoDF datasets.
Conclusion: Cross-modal identity verification is crucial for future deepfake detection, and Referee demonstrates effective generalization capabilities.
Abstract: Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.
[89] Semantic Frame Aggregation-based Transformer for Live Video Comment Generation
Anam Fatima, Yi Yu, Janak Kapuriya, Julien Lalanne, Jainendra Shukla
Main category: cs.CV
TL;DR: The paper introduces SFAT, a Semantic Frame Aggregation-based Transformer model for generating contextually appropriate live video comments by prioritizing relevant video frames and leveraging multimodal knowledge from both video content and viewer chats.
Details
Motivation: Existing approaches for live video comment generation overlook prioritizing video frames relevant to ongoing viewer interactions, which is crucial for producing contextually appropriate comments. Current datasets also have limitations in language diversity and video categories.Method: Proposed SFAT model uses CLIP’s visual-text multimodal knowledge, assigns weights to video frames based on semantic relevance to viewer conversations, employs weighted sum of frames to emphasize informative content, and uses a cross-attention comment decoder that attends to both chat and video modalities.
Result: Created a large-scale English video comments dataset from Twitch covering 11 categories (438 hours, 3.2M comments). SFAT model demonstrated effectiveness compared to existing methods for generating comments from live video and ongoing dialogue contexts.
Conclusion: The SFAT model successfully addresses the gap in prioritizing semantically relevant video frames for live comment generation and provides a comprehensive English dataset to advance research in this area.
Abstract: Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. To address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SFAT) model for live video comment generation. This method not only leverages CLIP’s visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with a cross-attention mechanism that attends to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories, we have constructed a large scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SFAT model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts.
[90] MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation
Arghavan Rezvani, Xiangyi Yan, Anthony T. Wu, Kun Han, Pooya Khosravi, Xiaohui Xie
Main category: cs.CV
TL;DR: MoME adapts Mixture of Experts (MoE) from LLMs to medical vision-language tasks, using multi-scale visual features and textual embeddings for dynamic expert selection in medical image segmentation.
Details
Motivation: To leverage the successful MoE paradigm from LLMs for medical imaging tasks, integrating vision-language models to handle the intricacies of medical imagery with textual information.Method: Proposes MoME architecture that dynamically selects experts using multi-scale visual features enriched with textual embeddings, trained on 10 datasets with 3,410 CT scans.
Result: Demonstrates strong performance on comprehensive medical imaging segmentation benchmark with competitive precision across multiple datasets.
Conclusion: MoME presents a novel architecture that effectively integrates foundation models for medical imaging, achieving robust results in medical image analysis through the MoE paradigm.
Abstract: In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.
[91] Incremental Human-Object Interaction Detection with Invariant Relation Representation Learning
Yana Wei, Zeen Chi, Chongyu Wang, Yu Wu, Shipeng Yan, Yongfei Liu, Xuming He
Main category: cs.CV
TL;DR: Proposes an exemplar-free incremental relation distillation (IRD) framework for incremental human-object interaction detection to handle dynamic open-world environments, addressing catastrophic forgetting, interaction drift, and zero-shot HOI detection.
Details
Motivation: Human-object interactions evolve continuously in open-world environments, challenging conventional closed-world HOI detection models. Inspired by humans' progressive knowledge acquisition, the authors explore incremental HOI detection to develop agents capable of discerning human-object relations in dynamic environments.Method: Proposes IRD framework that decouples learning of objects and relations, and introduces two unique distillation losses for learning invariant relation features across different HOI combinations that share the same relation. The method is exemplar-free.
Result: Extensive experiments on HICO-DET and V-COCO datasets demonstrate superiority over state-of-the-art baselines in mitigating forgetting, strengthening robustness against interaction drift, and generalization on zero-shot HOIs.
Conclusion: The IRD framework effectively addresses the challenges of incremental HOI detection in dynamic environments, showing strong performance in handling catastrophic forgetting, interaction drift, and zero-shot HOI detection without requiring exemplars.
Abstract: In open-world environments, human-object interactions (HOIs) evolve continuously, challenging conventional closed-world HOI detection models. Inspired by humans’ ability to progressively acquire knowledge, we explore incremental HOI detection (IHOID) to develop agents capable of discerning human-object relations in such dynamic environments. This setup confronts not only the common issue of catastrophic forgetting in incremental learning but also distinct challenges posed by interaction drift and detecting zero-shot HOI combinations with sequentially arriving data. Therefore, we propose a novel exemplar-free incremental relation distillation (IRD) framework. IRD decouples the learning of objects and relations, and introduces two unique distillation losses for learning invariant relation features across different HOI combinations that share the same relation. Extensive experiments on HICO-DET and V-COCO datasets demonstrate the superiority of our method over state-of-the-art baselines in mitigating forgetting, strengthening robustness against interaction drift, and generalization on zero-shot HOIs. Code is available at \href{https://github.com/weiyana/ContinualHOI}{this HTTP URL}
[92] VitalLens 2.0: High-Fidelity rPPG for Heart Rate Variability Estimation from Face Video
Philipp V. Rouast
Main category: cs.CV
TL;DR: VitalLens 2.0 is a new deep learning model that significantly improves remote photoplethysmography (rPPG) accuracy for estimating heart rate, respiratory rate, and heart rate variability metrics from face video.
Details
Motivation: To advance remote physiological monitoring by developing a more accurate and robust model for estimating multiple physiological signals from facial video analysis.Method: Combined a new model architecture with substantially larger and more diverse training data (1,413 unique individuals) and evaluated on a combined test set of 422 individuals from four datasets.
Result: Achieved state-of-the-art performance with MAE of 1.57 bpm for HR, 1.08 bpm for RR, 10.18 ms for HRV-SDNN, and 16.45 ms for HRV-RMSSD, significantly outperforming previous methods.
Conclusion: VitalLens 2.0 represents a significant advancement in rPPG technology, enabling robust estimation of multiple physiological signals and is now available via API for developers.
Abstract: This report introduces VitalLens 2.0, a new deep learning model for estimating physiological signals from face video. This new model demonstrates a significant leap in accuracy for remote photoplethysmography (rPPG), enabling the robust estimation of not only heart rate (HR) and respiratory rate (RR) but also Heart Rate Variability (HRV) metrics. This advance is achieved through a combination of a new model architecture and a substantial increase in the size and diversity of our training data, now totaling 1,413 unique individuals. We evaluate VitalLens 2.0 on a new, combined test set of 422 unique individuals from four public and private datasets. When averaging results by individual, VitalLens 2.0 achieves a Mean Absolute Error (MAE) of 1.57 bpm for HR, 1.08 bpm for RR, 10.18 ms for HRV-SDNN, and 16.45 ms for HRV-RMSSD. These results represent a new state-of-the-art, significantly outperforming previous methods. This model is now available to developers via the VitalLens API at https://rouast.com/api.
[93] AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception
Mario Camarena, Het Patel, Fatemeh Nazari, Evangelos Papalexakis, Mohamadhossein Noruzoliaee, Jia Chen
Main category: cs.CV
TL;DR: AD-SAM is a fine-tuned version of SAM specifically for autonomous driving semantic segmentation, featuring dual-encoder architecture and deformable decoder that outperforms existing models on road scene benchmarks.
Details
Motivation: To adapt the Segment Anything Model (SAM) for the spatial and geometric complexity of autonomous driving scenes, addressing limitations in handling road environments and improving segmentation accuracy.Method: Extends SAM with dual-encoder combining ViT-H global context and ResNet-50 local details, deformable fusion module for feature alignment, and progressive multi-stage refinement with deformable attention. Uses hybrid loss combining Focal, Dice, Lovasz-Softmax, and Surface losses.
Result: Achieves 68.1 mIoU on Cityscapes and 59.5 mIoU on BDD100K, outperforming SAM, G-SAM, and DeepLabV3 by up to +22.9 and +19.2 mIoU. Shows strong cross-domain generalization (0.87 retention score), faster convergence (30-40 epochs), and data efficiency (0.607 mIoU with only 1000 samples).
Conclusion: Targeted architectural and optimization enhancements to foundation models enable reliable and scalable autonomous driving perception with improved accuracy, generalization, and efficiency.
Abstract: This paper presents the Autonomous Driving Segment Anything Model (AD-SAM), a fine-tuned vision foundation model for semantic segmentation in autonomous driving (AD). AD-SAM extends the Segment Anything Model (SAM) with a dual-encoder and deformable decoder tailored to spatial and geometric complexity of road scenes. The dual-encoder produces multi-scale fused representations by combining global semantic context from SAM’s pretrained Vision Transformer (ViT-H) with local spatial detail from a trainable convolutional deep learning backbone (i.e., ResNet-50). A deformable fusion module aligns heterogeneous features across scales and object geometries. The decoder performs progressive multi-stage refinement using deformable attention. Training is guided by a hybrid loss that integrates Focal, Dice, Lovasz-Softmax, and Surface losses, improving semantic class balance, boundary precision, and optimization stability. Experiments on the Cityscapes and Berkeley DeepDrive 100K (BDD100K) benchmarks show that AD-SAM surpasses SAM, Generalized SAM (G-SAM), and a deep learning baseline (DeepLabV3) in segmentation accuracy. It achieves 68.1 mean Intersection over Union (mIoU) on Cityscapes and 59.5 mIoU on BDD100K, outperforming SAM, G-SAM, and DeepLabV3 by margins of up to +22.9 and +19.2 mIoU in structured and diverse road scenes, respectively. AD-SAM demonstrates strong cross-domain generalization with a 0.87 retention score (vs. 0.76 for SAM), and faster, more stable learning dynamics, converging within 30-40 epochs, enjoying double the learning speed of benchmark models. It maintains 0.607 mIoU with only 1000 samples, suggesting data efficiency critical for reducing annotation costs. These results confirm that targeted architectural and optimization enhancements to foundation models enable reliable and scalable AD perception.
[94] Hierarchical Transformers for Unsupervised 3D Shape Abstraction
Aditya Vora, Lily Goli, Andrea Tagliasacchi, Hao Zhang
Main category: cs.CV
TL;DR: HiT is a hierarchical neural field representation that learns general 3D shape hierarchies in an unsupervised manner using a hierarchical transformer with compressed codebook.
Details
Motivation: To learn general hierarchical structures across different 3D shape categories without fixed hierarchical constraints, enabling automatic discovery of common substructures.Method: Uses hierarchical transformer (HiT) with compressed codebook to learn parent-child relationships in tree hierarchy, trained with reconstruction loss without fixed hierarchical structure constraints.
Result: Successfully captures meaningful containment relationships and demonstrates effectiveness through unsupervised shape segmentation across 55 ShapeNet categories with multiple granularity levels.
Conclusion: HiT provides a flexible approach for learning general hierarchical structures from 3D shape data, outperforming previous methods with fixed hierarchical constraints.
Abstract: We introduce HiT, a novel hierarchical neural field representation for 3D shapes that learns general hierarchies in a coarse-to-fine manner across different shape categories in an unsupervised setting. Our key contribution is a hierarchical transformer (HiT), where each level learns parent-child relationships of the tree hierarchy using a compressed codebook. This codebook enables the network to automatically identify common substructures across potentially diverse shape categories. Unlike previous works that constrain the task to a fixed hierarchical structure (e.g., binary), we impose no such restriction, except for limiting the total number of nodes at each tree level. This flexibility allows our method to infer the hierarchical structure directly from data, over multiple shape categories, and representing more general and complex hierarchies than prior approaches. When trained at scale with a reconstruction loss, our model captures meaningful containment relationships between parent and child nodes. We demonstrate its effectiveness through an unsupervised shape segmentation task over all 55 ShapeNet categories, where our method successfully segments shapes into multiple levels of granularity.
[95] ZEBRA: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding
Haonan Wang, Jingyu Lu, Hongrui Li, Xiaomeng Li
Main category: cs.CV
TL;DR: ZEBRA is a zero-shot fMRI-to-image reconstruction framework that disentangles subject-related and semantic-related components in fMRI data, enabling generalization to unseen subjects without subject-specific training.
Details
Motivation: Current fMRI-to-image reconstruction methods require subject-specific models or fine-tuning, limiting scalability and real-world applicability. There's a need for methods that can generalize across subjects without additional data or retraining.Method: ZEBRA uses adversarial training to decompose fMRI representations into subject-related and semantic-related components, explicitly disentangling them to isolate subject-invariant, semantic-specific representations.
Result: ZEBRA significantly outperforms zero-shot baselines and achieves performance comparable to fully fine-tuned models on several metrics, demonstrating effective generalization to unseen subjects.
Conclusion: ZEBRA represents a scalable and practical step toward universal neural decoding by eliminating the need for subject-specific adaptation while maintaining reconstruction quality.
Abstract: Recent advances in neural decoding have enabled the reconstruction of visual experiences from brain activity, positioning fMRI-to-image reconstruction as a promising bridge between neuroscience and computer vision. However, current methods predominantly rely on subject-specific models or require subject-specific fine-tuning, limiting their scalability and real-world applicability. In this work, we introduce ZEBRA, the first zero-shot brain visual decoding framework that eliminates the need for subject-specific adaptation. ZEBRA is built on the key insight that fMRI representations can be decomposed into subject-related and semantic-related components. By leveraging adversarial training, our method explicitly disentangles these components to isolate subject-invariant, semantic-specific representations. This disentanglement allows ZEBRA to generalize to unseen subjects without any additional fMRI data or retraining. Extensive experiments show that ZEBRA significantly outperforms zero-shot baselines and achieves performance comparable to fully finetuned models on several metrics. Our work represents a scalable and practical step toward universal neural decoding. Code and model weights are available at: https://github.com/xmed-lab/ZEBRA.
[96] WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond
Zhicong Sun, Jacqueline Lo, Jinxing Hu
Main category: cs.CV
TL;DR: The paper introduces WildfireX-SLAM, a large-scale synthetic dataset for 3D Gaussian splatting-based SLAM in forest and wildfire environments, addressing the lack of real-world datasets.
Details
Motivation: Current 3DGS-based SLAM methods focus on small indoor scenes, but developing them for large-scale forest scenes has practical applications in wildfire response and forest management. However, collecting real-world datasets is costly and infeasible.Method: Built a pipeline using Unreal Engine 5 to collect synthetic aerial and ground views with ground-truth camera poses and multiple data modalities from UAVs. The pipeline allows flexible control over environmental factors like lighting, weather, and wildfire conditions.
Result: Created WildfireX-SLAM dataset with 5.5k low-altitude RGB-D aerial images covering 16 km² of forest. Conducted benchmark analysis revealing unique challenges for 3DGS-based SLAM in forest environments.
Conclusion: The dataset enables research on 3DGS-based SLAM for forest and wildfire scenarios, highlighting challenges and potential improvements for future work. The dataset and code will be publicly available.
Abstract: 3D Gaussian splatting (3DGS) and its subsequent variants have led to remarkable progress in simultaneous localization and mapping (SLAM). While most recent 3DGS-based SLAM works focus on small-scale indoor scenes, developing 3DGS-based SLAM methods for large-scale forest scenes holds great potential for many real-world applications, especially for wildfire emergency response and forest management. However, this line of research is impeded by the absence of a comprehensive and high-quality dataset, and collecting such a dataset over real-world scenes is costly and technically infeasible. To this end, we have built a large-scale, comprehensive, and high-quality synthetic dataset for SLAM in wildfire and forest environments. Leveraging the Unreal Engine 5 Electric Dreams Environment Sample Project, we developed a pipeline to easily collect aerial and ground views, including ground-truth camera poses and a range of additional data modalities from unmanned aerial vehicle. Our pipeline also provides flexible controls on environmental factors such as light, weather, and types and conditions of wildfire, supporting the need for various tasks covering forest mapping, wildfire emergency response, and beyond. The resulting pilot dataset, WildfireX-SLAM, contains 5.5k low-altitude RGB-D aerial images from a large-scale forest map with a total size of 16 km2. On top of WildfireX-SLAM, a thorough benchmark is also conducted, which not only reveals the unique challenges of 3DGS-based SLAM in the forest but also highlights potential improvements for future works. The dataset and code will be publicly available. Project page: https://zhicongsun.github.io/wildfirexslam.
[97] E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources
Tong Shen, Jingai Yu, Dong Zhou, Dong Li, Emad Barsoum
Main category: cs.CV
TL;DR: E-MMDiT is an efficient multimodal diffusion transformer with only 304M parameters that achieves competitive image generation performance while requiring minimal training resources - 25M public data in 1.5 days on 8 GPUs.
Details
Motivation: Address the limitations of existing diffusion models that require large-scale training data, significant computational resources, or suffer from high latency and heavy structure.Method: Proposes token reduction strategy using compressive visual tokenizer, multi-path compression module, Position Reinforcement for spatial coherence, Alternating Subregion Attention (ASA) for computational efficiency, and AdaLN-affine for efficient modulation.
Result: Achieves 0.66 on GenEval for 512px generation, reaching 0.72 with post-training techniques like GRPO, while being trained with minimal resources.
Conclusion: E-MMDiT serves as a strong and practical baseline for future research and contributes to democratizing generative AI models by making them more accessible and resource-efficient.
Abstract: Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at https://github.com/AMD-AGI/Nitro-E and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.
[98] Improving Cross-view Object Geo-localization: A Dual Attention Approach with Cross-view Interaction and Multi-Scale Spatial Features
Xingtao Ling Yingying Zhu
Main category: cs.CV
TL;DR: The paper proposes a Cross-view and Cross-attention Module (CVCAM) and Multi-head Spatial Attention Module (MHSAM) to improve cross-view object geo-localization by enabling better information transfer between views and refining spatial relationship features, while also introducing a new G2D dataset for Ground-to-Drone localization.
Details
Motivation: Existing methods for cross-view object geo-localization fail to effectively transfer information between views and don't refine spatial relationship feature maps, causing the model to focus on irrelevant edge noise and affecting localization performance.Method: Proposes CVCAM for multiple iterations of interaction between views to exchange contextual information and suppress edge noise, and MHSAM using convolutional kernels of various sizes to extract multi-scale spatial features. Also creates a new G2D dataset for Ground-to-Drone localization.
Result: Extensive experiments on CVOGL and G2D datasets show the proposed method achieves high localization accuracy and surpasses current state-of-the-art methods.
Conclusion: The proposed CVCAM and MHSAM modules effectively address limitations in cross-view information transfer and spatial feature refinement, significantly improving cross-view object geo-localization performance, while the new G2D dataset fills an important gap in Ground-to-Drone localization tasks.
Abstract: Cross-view object geo-localization has recently gained attention due to potential applications. Existing methods aim to capture spatial dependencies of query objects between different views through attention mechanisms to obtain spatial relationship feature maps, which are then used to predict object locations. Although promising, these approaches fail to effectively transfer information between views and do not further refine the spatial relationship feature maps. This results in the model erroneously focusing on irrelevant edge noise, thereby affecting localization performance. To address these limitations, we introduce a Cross-view and Cross-attention Module (CVCAM), which performs multiple iterations of interaction between the two views, enabling continuous exchange and learning of contextual information about the query object from both perspectives. This facilitates a deeper understanding of cross-view relationships while suppressing the edge noise unrelated to the query object. Furthermore, we integrate a Multi-head Spatial Attention Module (MHSAM), which employs convolutional kernels of various sizes to extract multi-scale spatial features from the feature maps containing implicit correspondences, further enhancing the feature representation of the query object. Additionally, given the scarcity of datasets for cross-view object geo-localization, we created a new dataset called G2D for the “Ground-to-Drone” localization task, enriching existing datasets and filling the gap in “Ground-to-Drone” localization task. Extensive experiments on the CVOGL and G2D datasets demonstrate that our proposed method achieves high localization accuracy, surpassing the current state-of-the-art.
[99] AFM-Net: Advanced Fusing Hierarchical CNN Visual Priors with Global Sequence Modeling for Remote Sensing Image Scene Classification
Yuanhao Tang, Xuechao Zou, Zhengpei Hu, Junliang Xing, Chengkun Zhang, Jianqiang Huang
Main category: cs.CV
TL;DR: AFM-Net is a novel framework that combines CNN and Mamba branches for remote sensing image classification, achieving state-of-the-art performance through hierarchical fusion and Mixture-of-Experts classification.
Details
Motivation: Remote sensing image classification is challenging due to complex spatial structures and multi-scale characteristics. Existing methods struggle to efficiently integrate CNNs (good for local textures) and Transformers (good for global context) due to high computational costs.Method: Proposes AFM-Net with two pathways: CNN branch for hierarchical visual priors and Mamba branch for efficient global sequence modeling. Uses Hierarchical Fusion Mechanism for progressive multi-scale feature aggregation and Mixture-of-Experts classifier for adaptive routing.
Result: Achieves 93.72% on AID, 95.54% on NWPU-RESISC45, and 96.92% on UC Merced datasets, surpassing state-of-the-art methods with balanced performance and efficiency.
Conclusion: AFM-Net effectively addresses the computational bottleneck of Transformer-based methods while maintaining strong performance through efficient CNN-Mamba integration and hierarchical feature fusion.
Abstract: Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Existing approaches see CNNs excel at modeling local textures, while Transformers excel at capturing global context. However, efficiently integrating them remains a bottleneck due to the high computational cost of Transformers. To tackle this, we propose AFM-Net, a novel Advanced Hierarchical Fusing framework that achieves effective local and global co-representation through two pathways: a CNN branch for extracting hierarchical visual priors, and a Mamba branch for efficient global sequence modeling. The core innovation of AFM-Net lies in its Hierarchical Fusion Mechanism, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a Mixture-of-Experts classifier module, which dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that AFM-Net obtains 93.72, 95.54, and 96.92 percent accuracy, surpassing state-of-the-art methods with balanced performance and efficiency. Code is available at https://github.com/tangyuanhao-qhu/AFM-Net.
[100] Panoramic Out-of-Distribution Segmentation for Autonomous Driving
Mengfei Duan, Yuheng Zhang, Yihong Cao, Fei Teng, Kai Luo, Jiaming Zhang, Kailun Yang, Zhiyong Li
Main category: cs.CV
TL;DR: The paper introduces Panoramic Out-of-distribution Segmentation (PanOoS) to address the limitations of current panoramic semantic segmentation methods in identifying outliers, and proposes POS, the first solution using text-guided prompt distribution learning.
Details
Motivation: Current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation models perform poorly in panoramic domain due to background clutter and pixel distortions, limiting comprehensive and safe scene understanding.Method: Proposes POS with text-guided prompt distribution learning, including Prompt-based Restoration Attention for semantic decoding optimization and Bilevel Prompt Distribution Learning for refining mask embeddings via semantic prototype supervision. Also introduces disentanglement strategy to leverage CLIP’s cross-domain generalization.
Result: POS achieves superior performance with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS benchmark, outperforming state-of-the-art pinhole-OoS methods. Also demonstrates leading closed-set segmentation capabilities.
Conclusion: POS effectively addresses panoramic out-of-distribution segmentation challenges and advances panoramic understanding development. The method establishes new benchmarks (DenseOoS and QuadOoS) to compensate for dataset scarcity in this domain.
Abstract: Panoramic imaging enables capturing 360{\deg} images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception, which is critical to applications, such as autonomous driving and augmented reality, etc. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to background clutter and pixel distortions. To address these issues, we introduce a new task, Panoramic Out-of-distribution Segmentation (PanOoS), with the aim of achieving comprehensive and safe scene understanding. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text-guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross-domain generalization capability of CLIP. The proposed Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self-adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS methods. Moreover, POS achieves leading closed-set segmentation capabilities and advances the development of panoramic understanding. Code and datasets will be available at https://github.com/MengfeiD/PanOoS.
[101] How Close Are We? Limitations and Progress of AI Models in Banff Lesion Scoring
Yanfan Zhu, Juming Xiong, Ruining Deng, Yu Wang, Yaohong Wang, Shilin Zhao, Mengmeng Yin, Yuqing Liu, Haichun Yang, Yuankai Huo
Main category: cs.CV
TL;DR: This study explores using existing deep learning models to approximate Banff lesion scores for renal transplant biopsies through a modular, rule-based framework, but reveals significant limitations in replicating expert-level grading.
Details
Motivation: The Banff Classification for renal transplant biopsies has challenges including semi-quantitative nature, complex criteria, and inter-observer variability, making computational replication difficult.Method: Decompose Banff indicators into structural and inflammatory components, use existing segmentation and detection tools, map outputs to Banff scores using heuristic rules aligned with expert guidelines, and evaluate against expert-annotated ground truths.
Result: Partial successes but critical failure modes including structural omission, hallucination, and detection ambiguity. Even when final scores match expert annotations, inconsistencies in intermediate representations undermine interpretability.
Conclusion: Current AI pipelines have limitations in replicating computational expert-level grading, emphasizing the need for modular evaluation and computational Banff grading standards to guide future model development.
Abstract: The Banff Classification provides the global standard for evaluating renal transplant biopsies, yet its semi-quantitative nature, complex criteria, and inter-observer variability present significant challenges for computational replication. In this study, we explore the feasibility of approximating Banff lesion scores using existing deep learning models through a modular, rule-based framework. We decompose each Banff indicator - such as glomerulitis (g), peritubular capillaritis (ptc), and intimal arteritis (v) - into its constituent structural and inflammatory components, and assess whether current segmentation and detection tools can support their computation. Model outputs are mapped to Banff scores using heuristic rules aligned with expert guidelines, and evaluated against expert-annotated ground truths. Our findings highlight both partial successes and critical failure modes, including structural omission, hallucination, and detection ambiguity. Even when final scores match expert annotations, inconsistencies in intermediate representations often undermine interpretability. These results reveal the limitations of current AI pipelines in replicating computational expert-level grading, and emphasize the importance of modular evaluation and computational Banff grading standard in guiding future model development for transplant pathology.
[102] Generating Accurate and Detailed Captions for High-Resolution Images
Hankyeol Lee, Gawon Seo, Kyounggyu Lee, Dogun Kim, Kyungwoo Song, Jiyoung Jung
Main category: cs.CV
TL;DR: A pipeline combining VLMs, LLMs, and object detection to enhance high-resolution image captioning by identifying missing objects through co-occurrence prediction and region-specific captioning.
Details
Motivation: VLMs struggle with high-resolution images due to pre-training on low-resolution inputs, leading to loss of visual details and omission of important objects in generated captions.Method: Multi-stage pipeline: 1) Generate initial caption with VLM, 2) LLM identifies key objects and predicts co-occurring objects, 3) Object detection verifies predictions, 4) Region-specific captioning for newly detected objects, 5) Remove references to undetected objects to reduce hallucinations.
Result: Experiments on high-resolution images show the pipeline produces more detailed and reliable captions while effectively minimizing hallucinations, as validated by pairwise comparison and quantitative scoring.
Conclusion: The proposed pipeline successfully addresses VLM limitations for high-resolution images by integrating multiple AI components to enrich caption detail and reduce hallucinations through systematic object verification and region-specific captioning.
Abstract: Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.
[103] M^3Detection: Multi-Frame Multi-Level Feature Fusion for Multi-Modal 3D Object Detection with Camera and 4D Imaging Radar
Xiaozhi Li, Huijun Di, Jian Li, Feng Liu, Wei Liang
Main category: cs.CV
TL;DR: M^3Detection is a multi-frame 3D object detection framework that fuses camera and 4D imaging radar data through multi-level feature aggregation and spatiotemporal reasoning to overcome limitations of single-frame fusion.
Details
Motivation: Single-frame camera-radar fusion methods capture only partial scene information, which is further limited by image degradation and radar sparsity. Multi-frame fusion offers richer spatiotemporal information but faces challenges in robust cross-modal feature fusion and computational efficiency.Method: The framework leverages intermediate features from baseline detectors and tracker-generated reference trajectories. It includes global-level inter-object feature aggregation guided by radar, local-level inter-grid feature aggregation along trajectories, and trajectory-level multi-frame spatiotemporal reasoning for cross-frame interactions.
Result: Extensive experiments on VoD and TJ4DRadSet datasets demonstrate state-of-the-art 3D detection performance, validating the effectiveness of multi-frame detection with camera-4D imaging radar fusion.
Conclusion: M^3Detection successfully addresses the challenges of multi-frame multi-modal fusion, achieving superior 3D object detection performance by effectively combining complementary camera and 4D radar information through hierarchical feature aggregation and temporal reasoning.
Abstract: Recent advances in 4D imaging radar have enabled robust perception in adverse weather, while camera sensors provide dense semantic information. Fusing the these complementary modalities has great potential for cost-effective 3D perception. However, most existing camera-radar fusion methods are limited to single-frame inputs, capturing only a partial view of the scene. The incomplete scene information, compounded by image degradation and 4D radar sparsity, hinders overall detection performance. In contrast, multi-frame fusion offers richer spatiotemporal information but faces two challenges: achieving robust and effective object feature fusion across frames and modalities, and mitigating the computational cost of redundant feature extraction. Consequently, we propose M^3Detection, a unified multi-frame 3D object detection framework that performs multi-level feature fusion on multi-modal data from camera and 4D imaging radar. Our framework leverages intermediate features from the baseline detector and employs the tracker to produce reference trajectories, improving computational efficiency and providing richer information for second-stage. In the second stage, we design a global-level inter-object feature aggregation module guided by radar information to align global features across candidate proposals and a local-level inter-grid feature aggregation module that expands local features along the reference trajectories to enhance fine-grained object representation. The aggregated features are then processed by a trajectory-level multi-frame spatiotemporal reasoning module to encode cross-frame interactions and enhance temporal representation. Extensive experiments on the VoD and TJ4DRadSet datasets demonstrate that M^3Detection achieves state-of-the-art 3D detection performance, validating its effectiveness in multi-frame detection with camera-4D imaging radar fusion.
[104] DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model
Yucheng Xing, Jinxing Yin, Xiaodong Liu
Main category: cs.CV
TL;DR: DANCER is a novel framework for realistic single-person dance synthesis using stable video diffusion models, featuring Appearance Enhancement and Pose Rendering modules to improve reference image details and motion guidance.
Details
Motivation: Video generation of human dancing is challenging due to high degrees of freedom in human motions and requirements for both quality and continuity. Existing methods struggle with realistic human-involved content generation.Method: Proposes DANCER framework with two key modules: Appearance Enhancement Module (AEM) to focus on reference image details, and Pose Rendering Module (PRM) to extend motion guidance from extra domains. Uses stable video diffusion model as base and collects TikTok-3K dataset for enhanced training.
Result: The model shows superior performance compared to state-of-the-art methods in extensive experiments on real-world datasets, demonstrating effectiveness in realistic dance video generation.
Conclusion: DANCER successfully addresses the challenges of human dance video generation through specialized modules for appearance enhancement and pose rendering, achieving state-of-the-art results with the help of a novel training dataset.
Abstract: Recently, diffusion models have shown their impressive ability in visual generation tasks. Besides static images, more and more research attentions have been drawn to the generation of realistic videos. The video generation not only has a higher requirement for the quality, but also brings a challenge in ensuring the video continuity. Among all the video generation tasks, human-involved contents, such as human dancing, are even more difficult to generate due to the high degrees of freedom associated with human motions. In this paper, we propose a novel framework, named as DANCER (Dance ANimation via Condition Enhancement and Rendering with Diffusion Model), for realistic single-person dance synthesis based on the most recent stable video diffusion model. As the video generation is generally guided by a reference image and a video sequence, we introduce two important modules into our framework to fully benefit from the two inputs. More specifically, we design an Appearance Enhancement Module (AEM) to focus more on the details of the reference image during the generation, and extend the motion guidance through a Pose Rendering Module (PRM) to capture pose conditions from extra domains. To further improve the generation capability of our model, we also collect a large amount of video data from Internet, and generate a novel datasetTikTok-3K to enhance the model training. The effectiveness of the proposed model has been evaluated through extensive experiments on real-world datasets, where the performance of our model is superior to that of the state-of-the-art methods. All the data and codes will be released upon acceptance.
[105] H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models
Mingyu Sung, Il-Min Kim, Sangseok Yun, Jae-Mo Kang
Main category: cs.CV
TL;DR: H2-Cache is a hierarchical caching mechanism that accelerates diffusion model inference by separating denoising into structure-defining and detail-refining stages, achieving up to 5.08x speedup while maintaining image quality.
Details
Motivation: Diffusion models face practical deployment challenges due to high computational costs of iterative denoising. Existing caching techniques create speed-quality trade-offs with quality degradation and computational overhead.Method: Uses hierarchical dual-stage caching with independent thresholds for structure and detail stages. Introduces pooled feature summarization (PFS) for efficient similarity estimation. Based on key insight that denoising can be functionally separated.
Result: Achieves significant acceleration (up to 5.08x) while maintaining image quality nearly identical to baseline. Outperforms existing caching methods both quantitatively and qualitatively on Flux architecture.
Conclusion: H2-Cache effectively resolves the speed-quality dilemma in diffusion models, significantly lowering barriers for real-world application of high-fidelity diffusion models.
Abstract: Diffusion models have emerged as state-of-the-art in image generation, but their practical deployment is hindered by the significant computational cost of their iterative denoising process. While existing caching techniques can accelerate inference, they often create a challenging trade-off between speed and fidelity, suffering from quality degradation and high computational overhead. To address these limitations, we introduce H2-Cache, a novel hierarchical caching mechanism designed for modern generative diffusion model architectures. Our method is founded on the key insight that the denoising process can be functionally separated into a structure-defining stage and a detail-refining stage. H2-cache leverages this by employing a dual-threshold system, using independent thresholds to selectively cache each stage. To ensure the efficiency of our dual-check approach, we introduce pooled feature summarization (PFS), a lightweight technique for robust and fast similarity estimation. Extensive experiments on the Flux architecture demonstrate that H2-cache achieves significant acceleration (up to 5.08x) while maintaining image quality nearly identical to the baseline, quantitatively and qualitatively outperforming existing caching methods. Our work presents a robust and practical solution that effectively resolves the speed-quality dilemma, significantly lowering the barrier for the real-world application of high-fidelity diffusion models. Source code is available at https://github.com/Bluear7878/H2-cache-A-Hierarchical-Dual-Stage-Cache.
[106] SilhouetteTell: Practical Video Identification Leveraging Blurred Recordings of Video Subtitles
Guanchong Huang, Song Fang
Main category: cs.CV
TL;DR: SilhouetteTell is a novel video identification attack that uses subtitle silhouettes’ spatiotemporal features to identify videos from up to 40 meters away, posing significant privacy threats.
Details
Motivation: Video identification attacks can reveal sensitive personal information about viewers' hobbies, beliefs, and preferences, potentially leading to profiling, discrimination, or blackmail.Method: Analyzes subtitle silhouettes displayed on screen, combining spatial and temporal information to create spatiotemporal features that correlate with subtitle files, working for both online and offline videos.
Result: Comprehensive experiments on smartphones confirm high efficacy in inferring video titles and clips under various settings, including from distances up to 40 meters.
Conclusion: SilhouetteTell demonstrates a powerful new attack vector that can identify videos through subtitle analysis, highlighting serious privacy vulnerabilities in current video streaming systems.
Abstract: Video identification attacks pose a significant privacy threat that can reveal videos that victims watch, which may disclose their hobbies, religious beliefs, political leanings, sexual orientation, and health status. Also, video watching history can be used for user profiling or advertising and may result in cyberbullying, discrimination, or blackmail. Existing extensive video inference techniques usually depend on analyzing network traffic generated by streaming online videos. In this work, we observe that the content of a subtitle determines its silhouette displayed on the screen, and identifying each subtitle silhouette also derives the temporal difference between two consecutive subtitles. We then propose SilhouetteTell, a novel video identification attack that combines the spatial and time domain information into a spatiotemporal feature of subtitle silhouettes. SilhouetteTell explores the spatiotemporal correlation between recorded subtitle silhouettes of a video and its subtitle file. It can infer both online and offline videos. Comprehensive experiments on off-the-shelf smartphones confirm the high efficacy of SilhouetteTell for inferring video titles and clips under various settings, including from a distance of up to 40 meters.
[107] Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization
Guozheng Zheng, Jian Guan, Mingjie Xie, Xuanjia Zhao, Congyi Fan, Shiheng Zhang, Pengming Feng
Main category: cs.CV
TL;DR: Proposes DPHR, a dual-level progressive hardness-aware reweighting strategy for cross-view geo-localization between drone and satellite images to address viewpoint gaps and hard negatives.
Details
Motivation: Existing mining or reweighting strategies use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence.Method: DPHR includes: 1) Sample-level Ratio-based Difficulty-Aware (RDA) module that evaluates relative difficulty and assigns fine-grained weights to negatives; 2) Batch-level Progressive Adaptive Loss Weighting (PALW) mechanism that uses training-progress signal to attenuate noisy gradients early and progressively enhance hard-negative mining.
Result: Experiments on University-1652 and SUES-200 benchmarks demonstrate effectiveness and robustness, achieving consistent improvements over state-of-the-art methods.
Conclusion: DPHR strategy effectively addresses challenges in cross-view geo-localization by providing adaptive, progressive handling of hard negatives throughout training.
Abstract: Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.
[108] Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications
Zixuan Hu, Yongxian Wei, Li Shen, Zhenyi Wang, Lei Li, Chun Yuan, Dacheng Tao
Main category: cs.CV
TL;DR: Proposes a sparse model inversion method that selectively inverts semantic foregrounds to accelerate existing dense inversion methods for Vision Transformers, achieving up to 3.79× speedup while maintaining performance.
Details
Motivation: Existing dense inversion methods are inefficient for high-resolution images from Vision Transformers due to redundant inversion of noisy backgrounds and unintended inversion of spurious correlations (hallucination).Method: A plug-and-play sparse model inversion strategy that selectively inverts semantic foregrounds while stopping inversion of noisy backgrounds and spurious correlations, without modifying original loss functions.
Result: Achieves significant inversion acceleration (up to 3.79× faster) while maintaining comparable or enhanced downstream performance in data-free model quantization and knowledge transfer.
Conclusion: The proposed sparse model inversion effectively addresses inefficiency in existing methods by focusing on semantic foregrounds, providing a practical solution for accelerating model inversion tasks.
Abstract: Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations–a phenomenon we term “hallucination” in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to 3.79 faster) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at https://github.com/Egg-Hu/SMI.
[109] Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions
Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Yoichi Sato
Main category: cs.CV
TL;DR: This paper introduces MIVA, a new multimodal deception detection task using Werewolf game data, and finds that even state-of-the-art MLLMs like GPT-4o struggle with reliable truth detection.
Details
Motivation: As AI systems become more integrated into human lives, robust social intelligence including deception detection is crucial. Current MLLMs have impressive multimodal capabilities but their performance in detecting deception remains unquantified.Method: Created a new multimodal dataset from Werewolf game with synchronized video, text, and ground-truth labels. Established benchmark evaluating state-of-the-art MLLMs on the MIVA task.
Result: Significant performance gap found - even powerful models like GPT-4o struggle to reliably distinguish truth from falsehood. Models fail to ground language in visual social cues effectively and may be overly conservative.
Conclusion: There is an urgent need for novel approaches to build more perceptive and trustworthy AI systems that can effectively detect deception in multimodal interactions.
Abstract: As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.
[110] Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks
Jiaxin Zhang, Zehong Zhu, Junye Deng, Yunqin Li, and Bowen Wang
Main category: cs.CV
TL;DR: Proposes a Hierarchical Graph Neural Network (HGNN) model using multi-source data to analyze village spatial morphology, achieving improved performance in multimodal fusion and classification tasks.
Details
Motivation: Village areas are important for human-land relationship studies, but urbanization is causing spatial characteristic disappearance and landscape homogenization. Existing research has limitations with single-disciplinary perspectives, qualitative methods, lack of digital infrastructure, and insufficient data.Method: Developed a HGNN model with input nodes and communication nodes, static input edges and dynamic communication edges. Combines Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) with two-stage feature update mechanism. Introduces relational pooling and joint training strategy across 17 village spatial morphology subtypes.
Result: Achieves significant performance improvements over existing approaches in multimodal fusion and classification tasks. Joint optimization lifts mean accuracy/F1 from 0.71/0.83 (independent models) to 0.82/0.90, with 6% gain for parcel tasks.
Conclusion: The method provides scientific evidence for exploring village spatial patterns and generative logic, addressing current research limitations in village spatial morphology analysis.
Abstract: Villages areas hold significant importance in the study of human-land relationships. However, with the advancement of urbanization, the gradual disappearance of spatial characteristics and the homogenization of landscapes have emerged as prominent issues. Existing studies primarily adopt a single-disciplinary perspective to analyze villages spatial morphology and its influencing factors, relying heavily on qualitative analysis methods. These efforts are often constrained by the lack of digital infrastructure and insufficient data. To address the current research limitations, this paper proposes a Hierarchical Graph Neural Network (HGNN) model that integrates multi-source data to conduct an in-depth analysis of villages spatial morphology. The framework includes two types of nodes-input nodes and communication nodes-and two types of edges-static input edges and dynamic communication edges. By combining Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), the proposed model efficiently integrates multimodal features under a two-stage feature update mechanism. Additionally, based on existing principles for classifying villages spatial morphology, the paper introduces a relational pooling mechanism and implements a joint training strategy across 17 subtypes. Experimental results demonstrate that this method achieves significant performance improvements over existing approaches in multimodal fusion and classification tasks. Additionally, the proposed joint optimization of all sub-types lifts mean accuracy/F1 from 0.71/0.83 (independent models) to 0.82/0.90, driven by a 6% gain for parcel tasks. Our method provides scientific evidence for exploring villages spatial patterns and generative logic.
[111] Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness
Ren Tasai, Guang Li, Ren Togo, Takahiro Ogawa, Kenji Hirata, Minghui Tang, Takaaki Yoshimura, Hiroyuki Sugimori, Noriko Nishioka, Yukie Shimizu, Kohsuke Kudo, Miki Haseyama
Main category: cs.CV
TL;DR: A novel continual self-supervised learning framework for chest CT images that addresses domain shifts from different window settings while ensuring data privacy through latent replay and feature distillation techniques.
Details
Motivation: To overcome challenges in medical image diagnosis including scarcity of large annotated datasets, domain shifts from different CT window settings, and privacy constraints that prevent data reuse.Method: Continual pretraining on unlabeled images with latent replay mechanism to prevent catastrophic forgetting, plus feature distillation combining Wasserstein distance-based knowledge distillation and batch-knowledge ensemble.
Result: Demonstrated superior performance on chest CT images across two different window settings compared to other approaches.
Conclusion: The proposed framework effectively learns diverse features from multi-window chest CT images while maintaining data privacy and robustness to domain shifts.
Abstract: We propose a novel continual self-supervised learning (CSSL) framework for simultaneously learning diverse features from multi-window-obtained chest computed tomography (CT) images and ensuring data privacy. Achieving a robust and highly generalizable model in medical image diagnosis is challenging, mainly because of issues, such as the scarcity of large-scale, accurately annotated datasets and domain shifts inherent to dynamic healthcare environments. Specifically, in chest CT, these domain shifts often arise from differences in window settings, which are optimized for distinct clinical purposes. Previous CSSL frameworks often mitigated domain shift by reusing past data, a typically impractical approach owing to privacy constraints. Our approach addresses these challenges by effectively capturing the relationship between previously learned knowledge and new information across different training stages through continual pretraining on unlabeled images. Specifically, by incorporating a latent replay-based mechanism into CSSL, our method mitigates catastrophic forgetting due to domain shifts during continual pretraining while ensuring data privacy. Additionally, we introduce a feature distillation technique that integrates Wasserstein distance-based knowledge distillation (WKD) and batch-knowledge ensemble (BKE), enhancing the ability of the model to learn meaningful, domain-shift-robust representations. Finally, we validate our approach using chest CT images obtained across two different window settings, demonstrating superior performance compared with other approaches.
[112] SpecAware: A Spectral-Content Aware Foundation Model for Unifying Multi-Sensor Learning in Hyperspectral Remote Sensing Mapping
Renjie Ji, Xue Wang, Chao Niu, Wen Zhang, Yong Mei, Kun Tan
Main category: cs.CV
TL;DR: SpecAware is a hyperspectral foundation model that uses sensor meta-attributes and spectral content to enable unified multi-sensor learning for HSI mapping, achieving superior performance across various downstream tasks.
Details
Motivation: Current HSI foundation models overlook sensor meta-attributes and struggle with multi-sensor training, limiting their transferability across different hyperspectral sensors and applications.Method: Two-step hypernetwork-driven encoding: 1) Meta-content aware module fuses sensor meta-attributes with image content, 2) HyperEmbedding module uses sample-conditioned hypernetwork to generate matrix factors for adaptive spatial-spectral feature processing.
Result: Extensive experiments on six datasets show SpecAware learns superior feature representations and excels in land-cover semantic segmentation, change detection, and scene classification tasks.
Conclusion: SpecAware successfully establishes a unified framework for joint pre-training across diverse HSI sensors by adaptively processing variable spectral channels and interpreting spatial-spectral features.
Abstract: Hyperspectral imaging (HSI) is a vital tool for fine-grained land-use and land-cover (LULC) mapping. However, the inherent heterogeneity of HSI data has long posed a major barrier to developing generalized models via joint training. Although HSI foundation models have shown promise for different downstream tasks, the existing approaches typically overlook the critical guiding role of sensor meta-attributes, and struggle with multi-sensor training, limiting their transferability. To address these challenges, we propose SpecAware, which is a novel hyperspectral spectral-content aware foundation model for unifying multi-sensor learning for HSI mapping. We also constructed the Hyper-400K dataset to facilitate this research, which is a new large-scale, high-quality benchmark dataset with over 400k image patches from diverse airborne AVIRIS sensors. The core of SpecAware is a two-step hypernetwork-driven encoding process for HSI data. Firstly, we designed a meta-content aware module to generate a unique conditional input for each HSI patch, tailored to each spectral band of every sample by fusing the sensor meta-attributes and its own image content. Secondly, we designed the HyperEmbedding module, where a sample-conditioned hypernetwork dynamically generates a pair of matrix factors for channel-wise encoding, consisting of adaptive spatial pattern extraction and latent semantic feature re-projection. Thus, SpecAware gains the ability to perceive and interpret spatial-spectral features across diverse scenes and sensors. This, in turn, allows SpecAware to adaptively process a variable number of spectral channels, establishing a unified framework for joint pre-training. Extensive experiments on six datasets demonstrate that SpecAware can learn superior feature representations, excelling in land-cover semantic segmentation classification, change detection, and scene classification.
[113] Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery
Mahmoud El Hussieni, Bahadır K. Güntürk, Hasan F. Ateş, Oğuz Hanoğlu
Main category: cs.CV
TL;DR: YOLOv11 achieves strong performance in joint building instance segmentation and height classification from satellite imagery, outperforming previous models in both accuracy and speed.
Details
Motivation: Accurate building instance segmentation and height classification are critical for urban planning, 3D city modeling, and infrastructure monitoring.Method: Uses YOLOv11 deep learning model with improved architecture for better feature combination and object localization. Evaluated on DFC2023 Track 2 dataset with 125,000 annotated buildings across 12 cities using precision, recall, F1 score, and mAP metrics.
Result: YOLOv11 achieves 60.4% mAP@50 and 38.3% mAP@50-95 for instance segmentation, with robust classification across five height tiers. Excels in handling occlusions, complex shapes, and class imbalance, particularly for rare high-rise structures.
Conclusion: YOLOv11 outperforms earlier multitask frameworks in detection accuracy and inference speed, making it suitable for real-time large-scale urban mapping and advancing semantic urban reconstruction.
Abstract: Accurate building instance segmentation and height classification are critical for urban planning, 3D city modeling, and infrastructure monitoring. This paper presents a detailed analysis of YOLOv11, the recent advancement in the YOLO series of deep learning models, focusing on its application to joint building extraction and discrete height classification from satellite imagery. YOLOv11 builds on the strengths of earlier YOLO models by introducing a more efficient architecture that better combines features at different scales, improves object localization accuracy, and enhances performance in complex urban scenes. Using the DFC2023 Track 2 dataset – which includes over 125,000 annotated buildings across 12 cities – we evaluate YOLOv11’s performance using metrics such as precision, recall, F1 score, and mean average precision (mAP). Our findings demonstrate that YOLOv11 achieves strong instance segmentation performance with 60.4% mAP@50 and 38.3% mAP@50–95 while maintaining robust classification accuracy across five predefined height tiers. The model excels in handling occlusions, complex building shapes, and class imbalance, particularly for rare high-rise structures. Comparative analysis confirms that YOLOv11 outperforms earlier multitask frameworks in both detection accuracy and inference speed, making it well-suited for real-time, large-scale urban mapping. This research highlights YOLOv11’s potential to advance semantic urban reconstruction through streamlined categorical height modeling, offering actionable insights for future developments in remote sensing and geospatial intelligence.
[114] MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts
Jingnan Gao, Zhe Wang, Xianze Fang, Xingyu Ren, Zhuo Chen, Shengqi Liu, Yuhao Cheng, Jiangjing Lyu, Xiaokang Yang, Yichao Yan
Main category: cs.CV
TL;DR: MoRE is a dense 3D visual foundation model using Mixture-of-Experts architecture that achieves state-of-the-art performance across multiple geometric tasks through dynamic feature routing, confidence-based depth refinement, and semantic feature integration.
Details
Motivation: Scaling 3D models is challenging due to geometric supervision complexity and 3D data diversity. The paper aims to overcome these limitations for better 3D visual geometry reconstruction.Method: Proposes MoRE with Mixture-of-Experts architecture that dynamically routes features to task-specific experts. Incorporates confidence-based depth refinement, integrates dense semantic features with 3D backbone representations, and uses tailored loss functions for robust learning.
Result: Extensive experiments show MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.
Conclusion: MoRE successfully addresses 3D model scaling challenges through its expert-based architecture and specialized components, demonstrating superior performance in geometric reconstruction tasks.
Abstract: Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks. In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations. However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability. Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation. In addition, it integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction. MoRE is further optimized with tailored loss functions to ensure robust learning across diverse inputs and multiple geometric tasks. Extensive experiments demonstrate that MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.
[115] Object-IR: Leveraging Object Consistency and Mesh Deformation for Self-Supervised Image Retargeting
Tianli Liao, Ran Wang, Siqing Zhang, Lei Li, Guangen Liu, Chenyang Zhao, Heling Cao, Peng Li
Main category: cs.CV
TL;DR: Object-IR is a self-supervised image retargeting method that uses mesh warping optimization guided by object appearance consistency and geometric constraints to eliminate distortion in semantically important regions.
Details
Motivation: Eliminating geometric distortion in semantically important regions remains a major challenge in image retargeting, and existing methods struggle to preserve object appearance while maintaining geometric properties.Method: Reformulates image retargeting as learning-based mesh warping optimization using a CNN to predict mesh grid motion. Uses object-consistent loss, geometric-preserving loss, and boundary loss in a comprehensive objective function without requiring manual annotations.
Result: Achieves state-of-the-art performance on RetargetMe benchmark, outperforming existing methods in both quantitative metrics and visual quality assessments. Processes 1024x683 images in 0.009s average inference time with real-time performance on consumer GPUs.
Conclusion: Object-IR provides an effective self-supervised solution for high-quality image retargeting that preserves semantic object appearance while eliminating geometric distortion, with efficient real-time performance.
Abstract: Eliminating geometric distortion in semantically important regions remains an intractable challenge in image retargeting. This paper presents Object-IR, a self-supervised architecture that reformulates image retargeting as a learning-based mesh warping optimization problem, where the mesh deformation is guided by object appearance consistency and geometric-preserving constraints. Given an input image and a target aspect ratio, we initialize a uniform rigid mesh at the output resolution and use a convolutional neural network to predict the motion of each mesh grid and obtain the deformed mesh. The retargeted result is generated by warping the input image according to the rigid mesh in the input image and the deformed mesh in the output resolution. To mitigate geometric distortion, we design a comprehensive objective function incorporating a) object-consistent loss to ensure that the important semantic objects retain their appearance, b) geometric-preserving loss to constrain simple scale transform of the important meshes, and c) boundary loss to enforce a clean rectangular output. Notably, our self-supervised paradigm eliminates the need for manually annotated retargeting datasets by deriving supervision directly from the input’s geometric and semantic properties. Extensive evaluations on the RetargetMe benchmark demonstrate that our Object-IR achieves state-of-the-art performance, outperforming existing methods in quantitative metrics and subjective visual quality assessments. The framework efficiently processes arbitrary input resolutions (average inference time: 0.009s for 1024x683 resolution) while maintaining real-time performance on consumer-grade GPUs. The source code will soon be available at https://github.com/tlliao/Object-IR.
[116] Fusion of Heterogeneous Pathology Foundation Models for Whole Slide Image Analysis
Zhidong Yang, Xiuhui Shi, Wei Ba, Zhigang Song, Haijing Luan, Taiyuan Hu, Senlin Lin, Jiguang Wang, Shaohua Kevin Zhou, Rui Yan
Main category: cs.CV
TL;DR: FuseCPath is a novel framework that fuses heterogeneous pathological foundation models (FMs) for whole slide image analysis, achieving superior ensemble performance through multi-view clustering-based patch selection, cluster-level re-embedding, and collaborative distillation strategies.
Details
Motivation: Current pathological foundation models exhibit substantial heterogeneity due to diverse private training datasets and different network architectures, which introduces performance variability when using extracted features from different FMs in downstream tasks.Method: Proposes FuseCPath with three key components: (1) multi-view clustering-based method to filter discriminative patches using multiple FMs’ embeddings, (2) cluster-level re-embedding strategy to capture patch-level local features, and (3) collaborative distillation strategy to explore connections between slide-level FMs.
Result: Extensive experiments on lung cancer, bladder cancer, and colorectal cancer datasets from TCGA demonstrate that FuseCPath achieves state-of-the-art performance across multiple tasks on these public datasets.
Conclusion: FuseCPath effectively fuses heterogeneous pathological foundation models, providing a superior ensemble approach for whole slide image analysis in computational pathology.
Abstract: Whole slide image (WSI) analysis has emerged as an increasingly essential technique in computational pathology. Recent advances in the pathological foundation models (FMs) have demonstrated significant advantages in deriving meaningful patch-level or slide-level feature representations from WSIs. However, current pathological FMs have exhibited substantial heterogeneity caused by diverse private training datasets and different network architectures. This heterogeneity introduces performance variability when we utilize the extracted features from different FMs in the downstream tasks. To fully explore the advantage of multiple FMs effectively, in this work, we propose a novel framework for the fusion of heterogeneous pathological FMs, called FuseCPath, yielding a model with a superior ensemble performance. The main contributions of our framework can be summarized as follows: (i) To guarantee the representativeness of the training patches, we propose a multi-view clustering-based method to filter out the discriminative patches via multiple FMs’ embeddings. (ii) To effectively fuse the heterogeneous patch-level FMs, we devise a cluster-level re-embedding strategy to online capture patch-level local features. (iii) To effectively fuse the heterogeneous slide-level FMs, we devise a collaborative distillation strategy to explore the connections between slide-level FMs. Extensive experiments conducted on lung cancer, bladder cancer, and colorectal cancer datasets from The Cancer Genome Atlas (TCGA) have demonstrated that the proposed FuseCPath achieves state-of-the-art performance across multiple tasks on these public datasets.
[117] Trans-defense: Transformer-based Denoiser for Adversarial Defense with Spatial-Frequency Domain Representation
Alik Pramanick, Mayank Bansal, Utkarsh Srivastava, Suklav Ghosh, Arijit Sur
Main category: cs.CV
TL;DR: Two-phase training method combining spatial and frequency domain denoising with DWT and transformer layers to defend against adversarial attacks on images, significantly improving classification accuracy.
Details
Motivation: Deep neural networks are vulnerable to adversarial attacks, limiting their use in security-critical systems. High-frequency components of attacked images are more severely corrupted than lower-frequency ones.Method: Two-phase approach: 1) Train denoising network using spatial features and Discrete Wavelet Transform (DWT) with transformer layer integration; 2) Retrain classifier using denoised images to enhance robustness.
Result: Experimental results on MNIST, CIFAR-10, and Fashion-MNIST show remarkable elevation in classification accuracy, substantially exceeding performance of denoising networks and adversarial training approaches.
Conclusion: The proposed method effectively defends against adversarial attacks by combining spatial and frequency domain analysis through DWT and transformer integration, significantly improving model robustness.
Abstract: In recent times, deep neural networks (DNNs) have been successfully adopted for various applications. Despite their notable achievements, it has become evident that DNNs are vulnerable to sophisticated adversarial attacks, restricting their applications in security-critical systems. In this paper, we present two-phase training methods to tackle the attack: first, training the denoising network, and second, the deep classifier model. We propose a novel denoising strategy that integrates both spatial and frequency domain approaches to defend against adversarial attacks on images. Our analysis reveals that high-frequency components of attacked images are more severely corrupted compared to their lower-frequency counterparts. To address this, we leverage Discrete Wavelet Transform (DWT) for frequency analysis and develop a denoising network that combines spatial image features with wavelets through a transformer layer. Next, we retrain the classifier using the denoised images, which enhances the classifier’s robustness against adversarial attacks. Experimental results across the MNIST, CIFAR-10, and Fashion-MNIST datasets reveal that the proposed method remarkably elevates classification accuracy, substantially exceeding the performance by utilizing a denoising network and adversarial training approaches. The code is available at https://github.com/Mayank94/Trans-Defense.
[118] C-LEAD: Contrastive Learning for Enhanced Adversarial Defense
Suklav Ghosh, Sonal Kumar, Arijit Sur
Main category: cs.CV
TL;DR: A novel adversarial defense method using contrastive learning to train models with both clean and adversarially perturbed images, improving robustness against various attacks.
Details
Motivation: Deep neural networks are vulnerable to adversarial attacks, which is a critical issue for deploying robust deep-learning systems in real-world applications.Method: Utilizes contrastive learning with contrastive loss function to train classification models using both clean and adversarially perturbed images, optimizing model parameters alongside perturbations.
Result: Experimental results show significant improvements in model robustness against various types of adversarial perturbations.
Conclusion: Contrastive loss helps extract more informative and resilient features, contributing to adversarial robustness in deep learning.
Abstract: Deep neural networks (DNNs) have achieved remarkable success in computer vision tasks such as image classification, segmentation, and object detection. However, they are vulnerable to adversarial attacks, which can cause incorrect predictions with small perturbations in input images. Addressing this issue is crucial for deploying robust deep-learning systems. This paper presents a novel approach that utilizes contrastive learning for adversarial defense, a previously unexplored area. Our method leverages the contrastive loss function to enhance the robustness of classification models by training them with both clean and adversarially perturbed images. By optimizing the model’s parameters alongside the perturbations, our approach enables the network to learn robust representations that are less susceptible to adversarial attacks. Experimental results show significant improvements in the model’s robustness against various types of adversarial perturbations. This suggests that contrastive loss helps extract more informative and resilient features, contributing to the field of adversarial robustness in deep learning.
[119] Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
Yehna Kim andYoung-Eun Kim, Seong-Whan Lee
Main category: cs.CV
TL;DR: The paper proposes using web-crawled descriptions with LLM-extracted keywords to address ambiguity in zero-shot action recognition, achieving state-of-the-art results on multiple datasets.
Details
Motivation: Address ambiguity in zero-shot action recognition caused by multi-semantic words when relying solely on action classes, and reduce dependency on human annotation.Method: Use web-crawled descriptions with LLM-extracted keywords, and introduce a spatio-temporal interaction module to align description attributes with video content.
Result: Achieved 81.0% on UCF-101, 53.1% on HMDB-51, and 68.9% on Kinetics-600 in zero-shot experiments.
Conclusion: The approach effectively handles semantic ambiguity in action recognition and demonstrates strong adaptability across various downstream tasks.
Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model’s adaptability and effectiveness across various downstream tasks.
[120] RegionRAG: Region-level Retrieval-Augumented Generation for Visually-Rich Documents
Yinglu Li, Zhiying Lu, Zhihang Liu, Chuanbin Liu, Hongtao Xie
Main category: cs.CV
TL;DR: RegionRAG shifts multi-modal RAG from document-level to region-level retrieval, using hybrid supervision to identify relevant patches and dynamic grouping for semantic regions, improving accuracy while reducing visual tokens.
Details
Motivation: Current multi-modal RAG methods use entire documents as retrieval units, introducing irrelevant visual content that dilutes focus on salient information and degrades performance.Method: Proposes region-level retrieval with hybrid supervision from labeled/unlabeled data to pinpoint relevant patches, and dynamic pipeline for grouping patches into semantic regions during inference.
Result: Achieves state-of-the-art on 6 benchmarks: 10.02% improvement in R@1 retrieval accuracy, 3.56% increase in QA accuracy, while using only 71.42% of visual tokens compared to prior methods.
Conclusion: RegionRAG enables generators to focus on concise relevant visual content, improving both efficiency and accuracy in multi-modal RAG systems.
Abstract: Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model’s attention and further degrade the performance. To address this challenge, we propose \modelname, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, \modelname enables the generator to focus solely on concise visual content relevant to queries, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. Improves retrieval accuracy by 10.02% in R@1 on average and increases question answering accuracy by 3.56% while using only 71.42% visual tokens compared to prior methods. The code will be available at https://github.com/Aeryn666/RegionRAG.
[121] T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis
Raza Imam, Hu Wang, Dwarikanath Mahapatra, Mohammad Yaqub
Main category: cs.CV
TL;DR: T^3 is a test-time adaptive merging framework that dynamically combines vision-language models for medical imaging using Jensen-Shannon divergence, achieving SOTA performance across diverse modalities while maintaining efficiency.
Details
Motivation: Existing model-merging techniques designed for natural images fail in medical imaging due to modality shifts and static interpolation limitations, creating a need for adaptive approaches that balance specialist precision with generalist robustness.Method: T^3 computes per-sample interpolation coefficients via Jensen-Shannon divergence between model outputs, dynamically adjusting model fusion. T^3_B extends this to batch-wise computation to reduce inference costs.
Result: T^3 achieves state-of-the-art Top-1 accuracy and error reduction across four medical modalities, outperforming strong baselines in cross-evaluation spanning in-domain, base-to-novel, and corruption scenarios.
Conclusion: T^3 provides an efficient, backpropagation-free framework for adaptive medical vision-language model deployment, enabling reliable performance across diverse clinical tasks and modality shifts.
Abstract: In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models’ output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.
[122] HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration
Shaojie Zhang, Pei Fu, Ruoceng Zhang, Jiahui Yang, Anan Du, Xiuwen Xi, Shaokang Wang, Ying Huang, Bin Qin, Zhenbo Luo, Jian Luan
Main category: cs.CV
TL;DR: HyperClick is a framework that improves GUI agent reliability by calibrating uncertainty and confidence, reducing overconfidence through dual reward mechanisms and spatial confidence modeling.
Details
Motivation: Current GUI agents lack self-awareness of their capability boundaries, leading to overconfidence and unreliable predictions in dynamic GUI automation where single errors can cause task failure.Method: Proposes HyperClick with dual reward mechanism combining binary reward for correct actions and truncated Gaussian-based spatial confidence modeling, calibrated using Brier score to jointly optimize grounding accuracy and confidence reliability.
Result: Extensive experiments on seven challenge benchmarks show HyperClick achieves state-of-the-art performance while providing well-calibrated confidence, reducing overconfidence and supporting more reliable GUI automation.
Conclusion: HyperClick enables explicit confidence calibration and introspective self-criticism, making GUI agents more reliable by aligning confidence with actual accuracy.
Abstract: Autonomous Graphical User Interface (GUI) agents rely on accurate GUI grounding, which maps language instructions to on-screen coordinates, to execute user commands. However, current models, whether trained via supervised fine-tuning (SFT) or reinforcement fine-tuning (RFT), lack self-awareness of their capability boundaries, leading to overconfidence and unreliable predictions. We first systematically evaluate probabilistic and verbalized confidence in general and GUI-specific models, revealing a misalignment between confidence and actual accuracy, which is particularly critical in dynamic GUI automation tasks, where single errors can cause task failure. To address this, we propose HyperClick, a novel framework that enhances reliable GUI grounding through uncertainty calibration. HyperClick introduces a dual reward mechanism, combining a binary reward for correct actions with a truncated Gaussian-based spatial confidence modeling, calibrated using the Brier score. This approach jointly optimizes grounding accuracy and confidence reliability, fostering introspective self-criticism. Extensive experiments on seven challenge benchmarks show that HyperClick achieves state-of-the-art performance while providing well-calibrated confidence. By enabling explicit confidence calibration and introspective self-criticism, HyperClick reduces overconfidence and supports more reliable GUI automation.
[123] FOCUS: Efficient Keyframe Selection for Long Video Understanding
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, Yang You
Main category: cs.CV
TL;DR: FOCUS is a training-free keyframe selection method for long videos that treats frame selection as a combinatorial exploration problem using multi-armed bandits, achieving substantial accuracy improvements while processing only 2% of frames.
Details
Motivation: Existing keyframe selection methods for multimodal LLMs either uniformly subsample or use retrieval-style scoring, which can miss informative moments and rely on pre-filtering, making them inefficient for hour-long videos.Method: Frames keyframe selection as a combinatorial pure-exploration problem using multi-armed bandits, treating temporal clips as arms and using empirical means with Bernstein confidence radius to identify informative regions through a two-stage exploration-exploitation procedure.
Result: Achieves 11.9% accuracy gain on LongVideoBench for videos longer than 20 minutes while processing less than 2% of video frames, demonstrating substantial improvements on long-video question-answering benchmarks.
Conclusion: FOCUS provides an effective, model-agnostic solution for scalable long-video understanding with MLLMs, offering theoretical guarantees and significant performance improvements with minimal frame processing.
Abstract: Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.
[124] Rethinking Robust Adversarial Concept Erasure in Diffusion Models
Qinghong Yin, Yu Tian, Yue Zhang
Main category: cs.CV
TL;DR: S-GRACE introduces semantics-guided adversarial concept erasure for diffusion models, improving erasure performance by 26% while reducing training time by 90% compared to existing methods.
Details
Motivation: Existing concept erasure methods in diffusion models use adversarial training but neglect conceptual semantics, leading to incomplete concept coverage or disruption of other concepts.Method: S-GRACE leverages semantic guidance within concept space to generate adversarial samples and perform erasure training, addressing the limitations of existing approaches.
Result: Experiments show S-GRACE significantly improves erasure performance by 26%, better preserves non-target concepts, and reduces training time by 90% compared to seven state-of-the-art methods.
Conclusion: Semantics-guided adversarial concept erasure effectively addresses the limitations of existing methods by properly fitting concept spaces through semantic guidance.
Abstract: Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.
[125] Versatile and Efficient Medical Image Super-Resolution Via Frequency-Gated Mamba
Wenfeng Huang, Xiangyun Liao, Wei Cao, Wenjing Jia, Weixin Si
Main category: cs.CV
TL;DR: FGMamba is a lightweight frequency-aware state-space model for medical image super-resolution that combines global dependency modeling with fine-detail enhancement using gated attention and pyramid frequency fusion.
Details
Motivation: Medical image super-resolution is crucial for diagnostic accuracy and cost reduction, but existing methods struggle to model both long-range anatomical structures and fine-grained frequency details efficiently.Method: Proposes FGMamba with two key components: Gated Attention-enhanced State-Space Module (GASM) for efficient state-space modeling with dual-branch attention, and Pyramid Frequency Fusion Module (PFFM) for multi-resolution high-frequency detail capture using FFT-guided fusion.
Result: Achieves superior PSNR/SSIM across five medical imaging modalities (Ultrasound, OCT, MRI, CT, Endoscopic) while maintaining compact parameter footprint (<0.75M), outperforming CNN-based and Transformer-based state-of-the-art methods.
Conclusion: Frequency-aware state-space modeling is effective for scalable and accurate medical image enhancement, demonstrating the viability of lightweight architectures for medical super-resolution tasks.
Abstract: Medical image super-resolution (SR) is essential for enhancing diagnostic accuracy while reducing acquisition cost and scanning time. However, modeling both long-range anatomical structures and fine-grained frequency details with low computational overhead remains challenging. We propose FGMamba, a novel frequency-aware gated state-space model that unifies global dependency modeling and fine-detail enhancement into a lightweight architecture. Our method introduces two key innovations: a Gated Attention-enhanced State-Space Module (GASM) that integrates efficient state-space modeling with dual-branch spatial and channel attention, and a Pyramid Frequency Fusion Module (PFFM) that captures high-frequency details across multiple resolutions via FFT-guided fusion. Extensive evaluations across five medical imaging modalities (Ultrasound, OCT, MRI, CT, and Endoscopic) demonstrate that FGMamba achieves superior PSNR/SSIM while maintaining a compact parameter footprint ($<$0.75M), outperforming CNN-based and Transformer-based SOTAs. Our results validate the effectiveness of frequency-aware state-space modeling for scalable and accurate medical image enhancement.
[126] CASR-Net: An Image Processing-focused Deep Learning-based Coronary Artery Segmentation and Refinement Network for X-ray Coronary Angiogram
Alvee Hassan, Rusab Sarmun, Muhammad E. H. Chowdhury, M. Murugappan, Md. Sakib Abrar Hossain, Sakib Mahmud, Abdulrahman Alqahtani, Sohaib Bassam Zoghoul, Amith Khandakar, Susu M. Zughaier, Somaya Al-Maadeed, Anwarul Hasan
Main category: cs.CV
TL;DR: CASR-Net is a three-stage pipeline for coronary artery segmentation that uses multichannel preprocessing, a UNet with DenseNet121 encoder and Self-ONN decoder, and contour refinement to achieve state-of-the-art performance on stenotic vessel segmentation.
Details
Motivation: Early detection of coronary artery disease is critical but poor X-ray image quality can impede clinical diagnosis. Automated segmentation of coronary arteries from angiographic images can support clinicians in diagnosis and treatment planning.Method: Three-stage pipeline: 1) Multichannel preprocessing combining CLAHE and improved Ben Graham method, 2) Segmentation network with UNet architecture using DenseNet121 encoder and Self-ONN decoder to preserve vessel continuity, 3) Contour refinement module to suppress false positives.
Result: Achieved IoU of 61.43%, DSC of 76.10%, and clDice of 79.36% on combined public datasets with both healthy and stenotic arteries, outperforming several state-of-the-art models. Multichannel preprocessing provided 0.31-0.89% DSC and 0.40-1.16% IoU improvements.
Conclusion: CASR-Net provides a robust approach for automated coronary artery segmentation that can serve as a valuable tool to support clinicians in CAD diagnosis and treatment planning, particularly for detecting narrow and stenotic vessel branches.
Abstract: Early detection of coronary artery disease (CAD) is critical for reducing mortality and improving patient treatment planning. While angiographic image analysis from X-rays is a common and cost-effective method for identifying cardiac abnormalities, including stenotic coronary arteries, poor image quality can significantly impede clinical diagnosis. We present the Coronary Artery Segmentation and Refinement Network (CASR-Net), a three-stage pipeline comprising image preprocessing, segmentation, and refinement. A novel multichannel preprocessing strategy combining CLAHE and an improved Ben Graham method provides incremental gains, increasing Dice Score Coefficient (DSC) by 0.31-0.89% and Intersection over Union (IoU) by 0.40-1.16% compared with using the techniques individually. The core innovation is a segmentation network built on a UNet with a DenseNet121 encoder and a Self-organized Operational Neural Network (Self-ONN) based decoder, which preserves the continuity of narrow and stenotic vessel branches. A final contour refinement module further suppresses false positives. Evaluated with 5-fold cross-validation on a combination of two public datasets that contain both healthy and stenotic arteries, CASR-Net outperformed several state-of-the-art models, achieving an IoU of 61.43%, a DSC of 76.10%, and clDice of 79.36%. These results highlight a robust approach to automated coronary artery segmentation, offering a valuable tool to support clinicians in diagnosis and treatment planning.
[127] Overcoming Prompts Pool Confusion via Parameterized Prompt for Incremental Object Detection
Zijia An, Boyu Diao, Ruiqi Liu, Libo Huang, Chuanguang Yang, Fei Wang, Zhulin An, Yongjun Xu
Main category: cs.CV
TL;DR: P²IOD introduces parameterized prompts for incremental object detection that adaptively consolidate knowledge across tasks while constraining updates to prevent catastrophic forgetting, achieving state-of-the-art performance on VOC2007 and COCO datasets.
Details
Motivation: Existing prompt-based approaches assume disjoint class sets across incremental tasks, which is unsuitable for object detection due to co-occurrence phenomena where unlabeled objects from previous tasks appear in current images, causing confusion in prompt pools.Method: P²IOD uses neural networks as parameterized prompts to adaptively consolidate knowledge across tasks and employs a parameterized prompts fusion strategy to constrain prompt structure updates, preventing catastrophic forgetting.
Result: Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate P²IOD’s effectiveness in incremental object detection and achieve state-of-the-art performance among existing baselines.
Conclusion: Parameterized prompts with adaptive consolidation properties and constrained updates effectively address the challenges of incremental object detection, particularly in co-occurring scenarios where traditional disjoint class assumptions fail.
Abstract: Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Existing prompts pool based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection images. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P$^2$IOD). Leveraging neural networks global evolution properties, P$^2$IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P$^2$IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P$^2$IOD’s effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.
[128] SAGS: Self-Adaptive Alias-Free Gaussian Splatting for Dynamic Surgical Endoscopic Reconstruction
Wenfeng Huang, Xiangyun Liao, Yinling Qian, Hao Liu, Yongming Yang, Wenjing Jia, Qiong Wang
Main category: cs.CV
TL;DR: SAGS is a self-adaptive alias-free Gaussian splatting framework that improves deformable tissue reconstruction in endoscopic surgery by addressing aliasing and movement artifacts, achieving superior performance over state-of-the-art methods.
Details
Motivation: Current Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) methods for endoscopic tissue reconstruction suffer from aliasing and artifacts caused by tissue movement, which degrade visualization quality. Existing 3DGS methods prioritize rendering speed but neglect these critical issues.Method: Proposed SAGS framework with an attention-driven, dynamically weighted 4D deformation decoder that leverages 3D smoothing filters and 2D Mip filters to mitigate artifacts and better capture fine details of tissue movement.
Result: Experimental results on EndoNeRF and SCARED benchmarks show superior performance in all metrics (PSNR, SSIM, LPIPS) compared to state-of-the-art methods, with better visualization quality.
Conclusion: SAGS effectively addresses aliasing and artifact problems in deformable endoscopic tissue reconstruction while maintaining high rendering efficiency, demonstrating significant improvements over existing approaches.
Abstract: Surgical reconstruction of dynamic tissues from endoscopic videos is a crucial technology in robot-assisted surgery. The development of Neural Radiance Fields (NeRFs) has greatly advanced deformable tissue reconstruction, achieving high-quality results from video and image sequences. However, reconstructing deformable endoscopic scenes remains challenging due to aliasing and artifacts caused by tissue movement, which can significantly degrade visualization quality. The introduction of 3D Gaussian Splatting (3DGS) has improved reconstruction efficiency by enabling a faster rendering pipeline. Nevertheless, existing 3DGS methods often prioritize rendering speed while neglecting these critical issues. To address these challenges, we propose SAGS, a self-adaptive alias-free Gaussian splatting framework. We introduce an attention-driven, dynamically weighted 4D deformation decoder, leveraging 3D smoothing filters and 2D Mip filters to mitigate artifacts in deformable tissue reconstruction and better capture the fine details of tissue movement. Experimental results on two public benchmarks, EndoNeRF and SCARED, demonstrate that our method achieves superior performance in all metrics of PSNR, SSIM, and LPIPS compared to the state of the art while also delivering better visualization quality.
[129] Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis
Weiming Chen, Yijia Wang, Zhihan Zhu, Zhihai He
Main category: cs.CV
TL;DR: Proposes a method for ultra-low bit rate visual communication by integrating image generation with deep image compression using joint text and coding latent to guide rectified flow models for precise scene reconstruction.
Details
Motivation: Addresses the need for ultra-low bit rate visual communication in challenging scenarios like deep space exploration and battlefield intelligence, where existing text-to-image models only provide semantic-level approximations that are insufficient for accurate vision analysis and human interactions.Method: Integrates image generation with deep image compression using joint text and coding latent to guide rectified flow models. Both semantic text description and coding latent are encoded and transmitted at very small bit rates.
Result: Experimental results show the method achieves same image reconstruction quality and vision analysis accuracy as existing methods while using much less bandwidth.
Conclusion: The proposed approach enables accurate visual scene reconstruction at ultra-low bit rates, making it suitable for bandwidth-constrained applications without sacrificing analysis accuracy or interaction performance.
Abstract: We consider the problem of ultra-low bit rate visual communication for remote vision analysis, human interactions and control in challenging scenarios with very low communication bandwidth, such as deep space exploration, battlefield intelligence, and robot navigation in complex environments. In this paper, we ask the following important question: can we accurately reconstruct the visual scene using only a very small portion of the bit rate in existing coding methods while not sacrificing the accuracy of vision analysis and performance of human interactions? Existing text-to-image generation models offer a new approach for ultra-low bitrate image description. However, they can only achieve a semantic-level approximation of the visual scene, which is far insufficient for the purpose of visual communication and remote vision analysis and human interactions. To address this important issue, we propose to seamlessly integrate image generation with deep image compression, using joint text and coding latent to guide the rectified flow models for precise generation of the visual scene. The semantic text description and coding latent are both encoded and transmitted to the decoder at a very small bit rate. Experimental results demonstrate that our method can achieve the same image reconstruction quality and vision analysis accuracy as existing methods while using much less bandwidth. The code will be released upon paper acceptance.
[130] MeisenMeister: A Simple Two Stage Pipeline for Breast Cancer Classification on MRI
Benjamin Hamm, Yannick Kirchhoff, Maximilian Rokuss, Klaus Maier-Hein
Main category: cs.CV
TL;DR: The ODELIA Breast MRI Challenge 2025 aims to improve early breast cancer detection through better interpretation of MRI scans, focusing on classification-based approaches due to limited segmentation labels.
Details
Motivation: Breast cancer detection remains challenging despite existing methods, primarily due to limited availability of high-quality segmentation labels, making robust classification approaches crucial for large-scale screening applications.Method: The approach involves an iterative development process with key stages of experimentation, evaluation, and refinement, guided by foundational assumptions and concepts.
Result: The team developed a solution focused on performance, robustness, and clinical relevance, with full implementation publicly released.
Conclusion: Classification-based approaches are essential for advancing early breast cancer detection in MRI screening, and the developed solution addresses key challenges in this domain.
Abstract: The ODELIA Breast MRI Challenge 2025 addresses a critical issue in breast cancer screening: improving early detection through more efficient and accurate interpretation of breast MRI scans. Even though methods for general-purpose whole-body lesion segmentation as well as multi-time-point analysis exist, breast cancer detection remains highly challenging, largely due to the limited availability of high-quality segmentation labels. Therefore, developing robust classification-based approaches is crucial for the future of early breast cancer detection, particularly in applications such as large-scale screening. In this write-up, we provide a comprehensive overview of our approach to the challenge. We begin by detailing the underlying concept and foundational assumptions that guided our work. We then describe the iterative development process, highlighting the key stages of experimentation, evaluation, and refinement that shaped the evolution of our solution. Finally, we present the reasoning and evidence that informed the design choices behind our final submission, with a focus on performance, robustness, and clinical relevance. We release our full implementation publicly at https://github.com/MIC-DKFZ/MeisenMeister
[131] Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing
Yijia Wang, Yiqing Shen, Weiming Chen, Zhihai He
Main category: cs.CV
TL;DR: CIELR is a method that converts complex image editing instructions into simple actions using LLM reasoning, avoiding joint fine-tuning of LLMs and diffusion models.
Details
Motivation: Existing methods struggle with complex editing instructions and require computationally expensive joint fine-tuning of LLMs and diffusion models.Method: Constructs structured semantic representation of input images using foundation models, then uses iterative update mechanism to refine representation for fine-grained visual representation.
Result: Surpasses previous state-of-the-art by 9.955 dB in PSNR on SmartEdit dataset and outperforms previous methods on CIEBench benchmark.
Conclusion: CIELR enables complex image editing without joint fine-tuning, achieving superior performance in preserving consistent regions.
Abstract: Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called \textbf{C}omplex \textbf{I}mage \textbf{E}diting via \textbf{L}LM \textbf{R}easoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image scene. This allows us to perform complex and flexible image editing tasks. Extensive experiments on the SmartEdit Reasoning Scenario Set show that our method surpasses the previous state-of-the-art by 9.955 dB in PSNR, indicating its superior preservation of regions that should remain consistent. Due to the limited number of samples of public datasets of complex image editing with reasoning, we construct a benchmark named CIEBench, containing 86 image samples, together with a metric specifically for reasoning-based image editing. CIELR also outperforms previous methods on this benchmark. The code and dataset are available at \href{https://github.com/Jia-shao/Reasoning-Editing}{https://github.com/Jia-shao/Reasoning-Editing}.
[132] RzenEmbed: Towards Comprehensive Multimodal Retrieval
Weijian Jian, Yajun Zhang, Dawei Liang, Chunyu Xie, Yixiao He, Dawei Leng, Yuhui Yin
Main category: cs.CV
TL;DR: RzenEmbed is a unified multimodal embedding framework that supports text, images, videos, and visual documents, achieving state-of-the-art performance on the MMEB benchmark through a novel two-stage training strategy with improved InfoNCE loss.
Details
Motivation: Existing CLIP-based MLLMs primarily focus on natural images and offer limited support for other visual modalities like videos and visual documents, creating a gap in universal multimodal embedding capabilities.Method: Two-stage training strategy: first stage for foundational text and multimodal retrieval, second stage with improved InfoNCE loss featuring hardness-weighted mechanism and false negative mitigation. Also uses learnable temperature parameter and model souping.
Result: Sets new state-of-the-art on MMEB benchmark, achieving best overall score and outperforming all prior work on challenging video and visual document retrieval tasks.
Conclusion: RzenEmbed successfully bridges the modality gap by providing a unified framework for diverse visual modalities while enhancing discriminative power and instruction-following capabilities.
Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering limited support for other crucial visual modalities such as videos and visual documents. To bridge this gap, we introduce RzenEmbed, a unified framework to learn embeddings across a diverse set of modalities, including text, images, videos, and visual documents. We employ a novel two-stage training strategy to learn discriminative representations. The first stage focuses on foundational text and multimodal retrieval. In the second stage, we introduce an improved InfoNCE loss, incorporating two key enhancements. Firstly, a hardness-weighted mechanism guides the model to prioritize challenging samples by assigning them higher weights within each batch. Secondly, we implement an approach to mitigate the impact of false negatives and alleviate data noise. This strategy not only enhances the model’s discriminative power but also improves its instruction-following capabilities. We further boost performance with learnable temperature parameter and model souping. RzenEmbed sets a new state-of-the-art on the MMEB benchmark. It not only achieves the best overall score but also outperforms all prior work on the challenging video and visual document retrieval tasks. Our models are available in https://huggingface.co/qihoo360/RzenEmbed.
[133] FPS: Feedforward-based Parameter Selection For Efficient Fine-Tuning
Kenneth Yang, Wen-Li Wei, Jen-Chun Lin
Main category: cs.CV
TL;DR: FPS is a gradient-free parameter selection method that identifies optimal parameter subsets in a single forward pass, achieving comparable performance to SOTA methods while significantly reducing memory usage and accelerating selection.
Details
Motivation: To address limitations of existing PEFT methods - addition-based methods cause inference latency and engineering complexity, while selection-based methods require full backward passes with high memory usage like full fine-tuning.Method: FPS ranks parameters by the product of their magnitudes and corresponding input activations in a single forward pass, leveraging both pre-trained knowledge and downstream data without gradients.
Result: On 24 visual tasks from FGVC and VTAB-1k, FPS achieves comparable performance to SOTA methods while reducing peak memory usage by nearly 9× and accelerating parameter selection by about 2×.
Conclusion: FPS offers a genuinely memory-efficient and practical solution for fine-tuning large-scale pre-trained models through gradient-free parameter selection.
Abstract: Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key strategy for adapting large-scale pre-trained models to downstream tasks, but existing approaches face notable limitations. Addition-based methods, such as Adapters [1], introduce inference latency and engineering complexity, while selection-based methods like Gradient-based Parameter Selection (GPS) [2] require a full backward pass, which results in the same peak memory usage as full fine-tuning. To address this dilemma, we propose Feedforward-based Parameter Selection (FPS), a gradient-free method that identifies an optimal parameter subset in a single forward pass. FPS ranks parameters by the product of their magnitudes and corresponding input activations, leveraging both pre-trained knowledge and downstream data. Evaluated on $24$ visual tasks from FGVC and VTAB-1k, FPS achieves performance comparable to state-of-the-art methods while reducing peak memory usage by nearly $9 \times$ and accelerating parameter selection by about $2 \times$, offering a genuinely memory-efficient and practical solution for fine-tuning large-scale pre-trained models.
[134] Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Xiaowen Chu
Main category: cs.CV
TL;DR: The paper introduces a co-designed framework for universal video retrieval, including a comprehensive benchmark (UVRB), scalable data synthesis, and a novel training curriculum (Modality Pyramid), achieving state-of-the-art zero-shot generalization.
Details
Motivation: Current video retrieval systems are structurally misaligned due to narrow benchmarks that incentivize limited data and single-task training, suppressing universal capability. There's an absence of diagnostic evaluation that defines and demands multi-dimensional generalization.Method: 1) Create Universal Video Retrieval Benchmark (UVRB) with 16 datasets to measure performance and diagnose capability gaps. 2) Develop scalable synthesis workflow generating 1.55M high-quality pairs to populate semantic space. 3) Design Modality Pyramid curriculum to train General Video Embedder (GVE) by leveraging interconnections in diverse data.
Result: GVE achieves state-of-the-art zero-shot generalization on UVRB. Analysis reveals popular benchmarks are poor predictors of general ability, and partially relevant retrieval is a dominant but overlooked scenario.
Conclusion: The co-designed framework provides a practical path to escape limited scope and advance toward truly universal video retrieval, addressing critical gaps in current evaluation and training paradigms.
Abstract: The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB’s diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
[135] Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V
Meftun Akarsu, Kerem Catay, Sedat Bin Vedat, Enes Kutay Yarkan, Ilke Senturk, Arda Sar, Dafne Eksioglu
Main category: cs.CV
TL;DR: A two-stage pipeline for fine-tuning video diffusion transformers to generate cinematic scenes from small datasets, decoupling visual style learning from motion generation using LoRA adapters and temporal expansion.
Details
Motivation: To enable efficient synthesis of cinematic scenes for television and film production using small datasets, addressing the need for domain-specific visual style adaptation in video generation.Method: Two-stage process: 1) LoRA modules in cross-attention layers for visual style learning from short clips, 2) Keyframe generation followed by temporal expansion to 720p sequences with parallelization and sequence partitioning for faster inference.
Result: Quantitative improvements in FVD, CLIP-SIM, and LPIPS metrics, with qualitative expert user study showing enhanced cinematic fidelity and temporal stability over the base model.
Conclusion: The pipeline enables efficient domain transfer and high-quality cinematic scene generation, with released code supporting reproducibility and adaptation across different cinematic domains.
Abstract: We present a practical pipeline for fine-tuning open-source video diffusion transformers to synthesize cinematic scenes for television and film production from small datasets. The proposed two-stage process decouples visual style learning from motion generation. In the first stage, Low-Rank Adaptation (LoRA) modules are integrated into the cross-attention layers of the Wan2.1 I2V-14B model to adapt its visual representations using a compact dataset of short clips from Ay Yapim’s historical television film El Turco. This enables efficient domain transfer within hours on a single GPU. In the second stage, the fine-tuned model produces stylistically consistent keyframes that preserve costume, lighting, and color grading, which are then temporally expanded into coherent 720p sequences through the model’s video decoder. We further apply lightweight parallelization and sequence partitioning strategies to accelerate inference without quality degradation. Quantitative and qualitative evaluations using FVD, CLIP-SIM, and LPIPS metrics, supported by a small expert user study, demonstrate measurable improvements in cinematic fidelity and temporal stability over the base model. The complete training and inference pipeline is released to support reproducibility and adaptation across cinematic domains.
[136] Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds
Wu Wei, Xiaomeng Fan, Yuwei Wu, Zhi Gao, Pengxiang Li, Yunde Jia, Mehrtash Harandi
Main category: cs.CV
TL;DR: Proposes Alignment across Trees method for symmetric hierarchical feature alignment between vision and language modalities using hyperbolic manifolds with distinct curvatures and an intermediate manifold for alignment.
Details
Motivation: Existing VLMs extract hierarchical text features but use single features for images, creating asymmetric and suboptimal modality alignment.Method: Constructs tree-like hierarchical features for both images and text, extracts semantic-aware visual features using cross-attention guided by text, embeds features in hyperbolic manifolds with different curvatures, and aligns them using an intermediate manifold with KL distance minimization.
Result: Consistently outperforms strong baselines on taxonomic open-set classification tasks across multiple image datasets under few-shot and cross-domain settings.
Conclusion: The proposed symmetric hierarchical alignment method effectively addresses modality asymmetry in VLMs and improves performance on challenging classification tasks.
Abstract: Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.
[137] A Hybrid Deep Learning and Forensic Approach for Robust Deepfake Detection
Sales Aribe Jr
Main category: cs.CV
TL;DR: Hybrid deepfake detection framework combining forensic features with deep learning outperforms single-method approaches and shows robustness against compression, adversarial attacks, and unseen manipulations.
Details
Motivation: Address limitations of existing deepfake detection methods: deep learning lacks generalization and forensic analysis struggles with new manipulation techniques, aiming to create more resilient detection systems.Method: Fuses forensic features (noise residuals, JPEG compression traces, frequency-domain descriptors) with deep learning representations from CNNs and vision transformers (ViTs).
Result: Achieved F1-scores of 0.96, 0.82, and 0.77 on FaceForensics++, Celeb-DF v2, and DFDC datasets; maintained performance under compression (F1=0.87), adversarial attacks (AUC=0.84), and unseen manipulations (F1=0.79); explainability analysis showed 82% overlap with ground-truth manipulated regions.
Conclusion: Hybrid approaches provide balanced solution combining deep learning adaptability with forensic interpretability for resilient and trustworthy deepfake detection systems.
Abstract: The rapid evolution of generative adversarial networks (GANs) and diffusion models has made synthetic media increasingly realistic, raising societal concerns around misinformation, identity fraud, and digital trust. Existing deepfake detection methods either rely on deep learning, which suffers from poor generalization and vulnerability to distortions, or forensic analysis, which is interpretable but limited against new manipulation techniques. This study proposes a hybrid framework that fuses forensic features, including noise residuals, JPEG compression traces, and frequency-domain descriptors, with deep learning representations from convolutional neural networks (CNNs) and vision transformers (ViTs). Evaluated on benchmark datasets (FaceForensics++, Celeb-DF v2, DFDC), the proposed model consistently outperformed single-method baselines and demonstrated superior performance compared to existing state-of-the-art hybrid approaches, achieving F1-scores of 0.96, 0.82, and 0.77, respectively. Robustness tests demonstrated stable performance under compression (F1 = 0.87 at QF = 50), adversarial perturbations (AUC = 0.84), and unseen manipulations (F1 = 0.79). Importantly, explainability analysis showed that Grad-CAM and forensic heatmaps overlapped with ground-truth manipulated regions in 82 percent of cases, enhancing transparency and user trust. These findings confirm that hybrid approaches provide a balanced solution, combining the adaptability of deep models with the interpretability of forensic cues, to develop resilient and trustworthy deepfake detection systems.
[138] Who Does Your Algorithm Fail? Investigating Age and Ethnic Bias in the MAMA-MIA Dataset
Aditya Parikh, Sneha Das, Aasa Feragen
Main category: cs.CV
TL;DR: The paper audits fairness in breast cancer tumor segmentation, revealing age-related bias against younger patients and showing how data aggregation affects ethnic biases.
Details
Motivation: Fairness evaluation in medical image segmentation is underexplored, and unaddressed bias can lead to healthcare disparities and be amplified through iterative model development.Method: Audited the fairness of automated segmentation labels in the MAMA-MIA breast cancer dataset by evaluating segmentation quality across age, ethnicity, and data source.
Result: Revealed intrinsic age-related bias against younger patients that persists after controlling for confounding factors, and showed how data aggregation influences site-specific ethnic biases.
Conclusion: Investigating data at a granular level is necessary to address segmentation biases, as physiological factors may contribute to age-related bias and data aggregation affects ethnic bias patterns.
Abstract: Deep learning models aim to improve diagnostic workflows, but fairness evaluation remains underexplored beyond classification, e.g., in image segmentation. Unaddressed segmentation bias can lead to disparities in the quality of care for certain populations, potentially compounded across clinical decision points and amplified through iterative model development. Here, we audit the fairness of the automated segmentation labels provided in the breast cancer tumor segmentation dataset MAMA-MIA. We evaluate automated segmentation quality across age, ethnicity, and data source. Our analysis reveals an intrinsic age-related bias against younger patients that continues to persist even after controlling for confounding factors, such as data source. We hypothesize that this bias may be linked to physiological factors, a known challenge for both radiologists and automated systems. Finally, we show how aggregating data from multiple data sources influences site-specific ethnic biases, underscoring the necessity of investigating data at a granular level.
[139] Mitigating Semantic Collapse in Partially Relevant Video Retrieval
WonJun Moon, MinSeok Jung, Gilhan Park, Tae-Young Kim, Cheol-Ho Cho, Woojin Jun, Jae-Pil Heo
Main category: cs.CV
TL;DR: The paper addresses semantic collapse in Partially Relevant Video Retrieval (PRVR) by proposing methods to preserve semantic relationships in text queries and disentangle hierarchical video representations across temporal scales.
Details
Motivation: Existing PRVR methods treat all annotated text-video pairs as positives and others as negatives, ignoring semantic variations within videos and across different videos. This causes embeddings of queries and video segments for distinct events in the same video to collapse together while driving apart semantically similar content from different videos.Method: Proposes Text Correlation Preservation Learning to preserve semantic relationships in text queries, and Cross-Branch Video Alignment (CBVA) with contrastive alignment to disentangle hierarchical video representations across temporal scales. Also introduces order-preserving token merging and adaptive CBVA to enhance alignment.
Result: Extensive experiments on PRVR benchmarks demonstrate that the framework effectively prevents semantic collapse and substantially improves retrieval accuracy.
Conclusion: The proposed framework successfully addresses semantic collapse in both text and video embedding spaces, leading to significant improvements in partially relevant video retrieval performance.
Abstract: Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.
[140] DeblurSDI: Blind Image Deblurring Using Self-diffusion
Yanlong Yang, Guanxiong Luo
Main category: cs.CV
TL;DR: DeblurSDI is a zero-shot, self-supervised framework for blind image deconvolution that uses self-diffusion without pre-training, achieving superior performance in recovering sharp images and blur kernels.
Details
Motivation: Traditional blind deconvolution methods rely on handcrafted priors, while deep learning approaches need large external datasets, limiting adaptability to real-world scenarios.Method: Formulates blind deconvolution as iterative reverse self-diffusion from noise, optimizing two neural networks for image and kernel refinement with data consistency and L1-norm sparsity, plus noise scheduling for stability.
Result: Extensive experiments show DeblurSDI consistently achieves superior performance, recovering sharp images and accurate kernels even in highly degraded scenarios.
Conclusion: DeblurSDI provides a robust zero-shot solution that dynamically learns instance-specific priors, demonstrating remarkable robustness to blur kernel variations.
Abstract: Blind image deconvolution is a challenging ill-posed inverse problem, where both the latent sharp image and the blur kernel are unknown. Traditional methods often rely on handcrafted priors, while modern deep learning approaches typically require extensive pre-training on large external datasets, limiting their adaptability to real-world scenarios. In this work, we propose DeblurSDI, a zero-shot, self-supervised framework based on self-diffusion (SDI) that requires no prior training. DeblurSDI formulates blind deconvolution as an iterative reverse self-diffusion process that starts from pure noise and progressively refines the solution. At each step, two randomly-initialized neural networks are optimized continuously to refine the sharp image and the blur kernel. The optimization is guided by an objective function combining data consistency with a sparsity-promoting L1-norm for the kernel. A key innovation is our noise scheduling mechanism, which stabilizes the optimization and provides remarkable robustness to variations in blur kernel size. These allow DeblurSDI to dynamically learn an instance-specific prior tailored to the input image. Extensive experiments demonstrate that DeblurSDI consistently achieves superior performance, recovering sharp images and accurate kernels even in highly degraded scenarios.
[141] CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging
Aon Safdar, Mohamed Saadeldin
Main category: cs.CV
TL;DR: CoMViT is a compact Vision Transformer optimized for medical imaging that achieves robust performance across 12 MedMNIST datasets with only ~4.5M parameters, offering 5-20x parameter reduction while maintaining accuracy.
Details
Motivation: Vision Transformers have strong potential in medical imaging but face challenges with high computational demands and overfitting on small datasets, limiting their real-world clinical applicability.Method: CoMViT integrates convolutional tokenizer, diagonal masking, dynamic temperature scaling, and pooling-based sequence aggregation for improved performance and generalization through systematic architectural optimization.
Result: CoMViT matches or outperforms deeper CNN and ViT variants across 12 MedMNIST datasets while maintaining lightweight design. Grad-CAM analyses show it consistently attends to clinically relevant regions despite compact size.
Conclusion: Principled ViT redesign enables efficient and interpretable models for low-resource medical imaging settings, demonstrating CoMViT’s potential for real-world clinical applications.
Abstract: Vision Transformers (ViTs) have demonstrated strong potential in medical imaging; however, their high computational demands and tendency to overfit on small datasets limit their applicability in real-world clinical scenarios. In this paper, we present CoMViT, a compact and generalizable Vision Transformer architecture optimized for resource-constrained medical image analysis. CoMViT integrates a convolutional tokenizer, diagonal masking, dynamic temperature scaling, and pooling-based sequence aggregation to improve performance and generalization. Through systematic architectural optimization, CoMViT achieves robust performance across twelve MedMNIST datasets while maintaining a lightweight design with only ~4.5M parameters. It matches or outperforms deeper CNN and ViT variants, offering up to 5-20x parameter reduction without sacrificing accuracy. Qualitative Grad-CAM analyses show that CoMViT consistently attends to clinically relevant regions despite its compact size. These results highlight the potential of principled ViT redesign for developing efficient and interpretable models in low-resource medical imaging settings.
[142] From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration
Jianwen Sun, Fanrui Zhang, Yukang Feng, Chuanhao Li, Zizhen Li, Jiaxin Ai, Yifan Chang, Yu Dai, Kaipeng Zhang
Main category: cs.CV
TL;DR: VisPainter is a multi-agent framework that generates editable vector graphics for scientific illustrations, addressing limitations of current raster image generation and code-based methods by providing element-level control and intuitive manipulation.
Details
Motivation: Current generative models for scientific illustrations have two major limitations: raster images lack semantic structure for editing, while code-based methods are cumbersome and not intuitive. Neither approach meets the needs for efficient, intuitive, and iterative scientific creation.Method: VisPainter uses a multi-agent framework with three specialized modules (Manager, Designer, and Toolbox) built on model context protocol to collaboratively produce vector graphics diagrams. It also introduces VisBench, a benchmark with seven-dimensional metrics to evaluate illustration quality.
Result: The framework enables true element-level control where any element can be added and modified later. Extensive ablation experiments verified the architecture’s rationality and evaluation methods’ reliability. Various vision-language models were evaluated with fair rankings and capability comparisons.
Conclusion: VisPainter successfully bridges the gap between raster image generation and code-based methods by providing an intuitive, efficient system for creating editable scientific illustrations with element-level control, supported by a comprehensive evaluation benchmark.
Abstract: Scientific illustrations demand both high information density and post-editability. However, current generative models have two major limitations: Frist, image generation models output rasterized images lacking semantic structure, making it impossible to access, edit, or rearrange independent visual components in the images. Second, code-based generation methods (TikZ or SVG), although providing element-level control, force users into the cumbersome cycle of “writing-compiling-reviewing” and lack the intuitiveness of manipulation. Neither of these two approaches can well meet the needs for efficiency, intuitiveness, and iterative modification in scientific creation. To bridge this gap, we introduce VisPainter, a multi-agent framework for scientific illustration built upon the model context protocol. VisPainter orchestrates three specialized modules-a Manager, a Designer, and a Toolbox-to collaboratively produce diagrams compatible with standard vector graphics software. This modular, role-based design allows each element to be explicitly represented and manipulated, enabling true element-level control and any element can be added and modified later. To systematically evaluate the quality of scientific illustrations, we introduce VisBench, a benchmark with seven-dimensional evaluation metrics. It assesses high-information-density scientific illustrations from four aspects: content, layout, visual perception, and interaction cost. To this end, we conducted extensive ablation experiments to verify the rationality of our architecture and the reliability of our evaluation methods. Finally, we evaluated various vision-language models, presenting fair and credible model rankings along with detailed comparisons of their respective capabilities. Additionally, we isolated and quantified the impacts of role division, step control,and description on the quality of illustrations.
[143] A Multi-tiered Human-in-the-loop Approach for Interactive School Mapping Using Earth Observation and Machine Learning
Casper Fibaek, Abi Riley, Kelsey Doerksen, Do-Hyung Kim, Rochelle Schneider
Main category: cs.CV
TL;DR: A multi-tiered human-in-the-loop framework for interactive school mapping using machine learning and satellite imagery to improve educational facility records in developing regions.
Details
Motivation: To improve accuracy and completeness of educational facility records in developing regions where data is scarce and infrequently updated.Method: Three-tier approach: 1) ML analysis of population density, land cover, and infrastructure to identify gaps; 2) Medium-resolution satellite imagery (later removed); 3) Very high-resolution imagery with deep learning models for detailed candidate locations, combined with human operator review interface.
Result: Preliminary evaluations show the framework provides scalable and cost-effective solution for educational infrastructure mapping.
Conclusion: The multi-tiered human-in-the-loop strategy effectively supports planning and resource allocation for educational infrastructure in developing regions.
Abstract: This paper presents a multi-tiered human-in-the-loop framework for interactive school mapping designed to improve the accuracy and completeness of educational facility records, particularly in developing regions where such data may be scarce and infrequently updated. The first tier involves a machine learning based analysis of population density, land cover, and existing infrastructure compared with known school locations. The first tier identifies potential gaps and “mislabelled” schools. In subsequent tiers, medium-resolution satellite imagery (Sentinel-2) is investigated to pinpoint regions with a high likelihood of school presence, followed by the application of very high-resolution (VHR) imagery and deep learning models to generate detailed candidate locations for schools within these prioritised areas. The medium-resolution approach was later removed due to insignificant improvements. The medium and VHR resolution models build upon global pre-trained steps to improve generalisation. A key component of the proposed approach is an interactive interface to allow human operators to iteratively review, validate, and refine the mapping results. Preliminary evaluations indicate that the multi-tiered strategy provides a scalable and cost-effective solution for educational infrastructure mapping to support planning and resource allocation.
[144] NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
Wei Xu, Cheng Wang, Dingkang Liang, Zongchuang Zhao, Xingyu Jiang, Peng Zhang, Xiang Bai
Main category: cs.CV
TL;DR: The paper introduces NautData, a 1.45M image-text pair dataset for underwater scene understanding, and NAUTILUS, an underwater LMM with a vision feature enhancement module that improves robustness against underwater image degradation.
Details
Motivation: Underwater exploration is important for resource exploration and national security, but lacks large-scale multi-task instruction-tuning datasets and suffers from image degradation issues that hinder automated underwater scene understanding.Method: Constructed NautData dataset with 1.45M image-text pairs supporting 8 tasks, and developed a plug-and-play vision feature enhancement (VFE) module using physical priors from underwater imaging models to restore clear information. Integrated VFE into LLaVA-1.5 and Qwen2.5-VL to create NAUTILUS.
Result: Experiments on NautData and public datasets show VFE consistently improves baseline performance on most tasks, demonstrating NAUTILUS’s superiority in underwater scene understanding.
Conclusion: The proposed VFE module effectively enhances underwater scene understanding by addressing image degradation, and NAUTILUS with NautData enables comprehensive development and evaluation of underwater perception models.
Abstract: Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.
[145] ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng
Main category: cs.CV
TL;DR: ThinkMorph is a unified multimodal reasoning model that generates complementary text-image reasoning chains, achieving significant performance gains on vision tasks and demonstrating emergent multimodal intelligence.
Details
Motivation: To address the unclear nature of meaningful interleaved multimodal reasoning chains and explore how text and vision should function as complementary rather than isomorphic modalities.Method: Fine-tuned a unified model on 24K high-quality interleaved reasoning traces across tasks with varying visual engagement, learning to generate progressive text-image reasoning steps that manipulate visual content while maintaining verbal logic.
Result: Achieved 34.7% average improvement over base model on vision-centric benchmarks, matched or surpassed larger proprietary VLMs on out-of-domain tasks, and exhibited emergent multimodal intelligence including visual manipulation skills and adaptive reasoning mode switching.
Conclusion: ThinkMorph demonstrates promising directions for characterizing emergent capabilities in unified multimodal reasoning models through complementary text-image reasoning chains.
Abstract: Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
[146] Context-Gated Cross-Modal Perception with Visual Mamba for PET-CT Lung Tumor Segmentation
Elena Mulero Ayllón, Linlin Shen, Pierangelo Veltri, Fabrizia Gelardi, Arturo Chiti, Paolo Soda, Matteo Tortora
Main category: cs.CV
TL;DR: vMambaX is a lightweight multimodal framework that integrates PET and CT scans using a Context-Gated Cross-Modal Perception Module for improved lung tumor segmentation, achieving better performance with lower computational complexity.
Details
Motivation: Accurate lung tumor segmentation is crucial for diagnosis and treatment planning, but effectively combining anatomical (CT) and functional (PET) information remains challenging.Method: Built on Visual Mamba architecture, vMambaX uses a Context-Gated Cross-Modal Perception Module to adaptively enhance inter-modality feature interaction, emphasizing informative regions while suppressing noise.
Result: Evaluated on PCLT20K dataset, the model outperforms baseline models while maintaining lower computational complexity.
Conclusion: The results demonstrate the effectiveness of adaptive cross-modal gating for multimodal tumor segmentation and show vMambaX’s potential as an efficient and scalable framework for lung cancer analysis.
Abstract: Accurate lung tumor segmentation is vital for improving diagnosis and treatment planning, and effectively combining anatomical and functional information from PET and CT remains a major challenge. In this study, we propose vMambaX, a lightweight multimodal framework integrating PET and CT scan images through a Context-Gated Cross-Modal Perception Module (CGM). Built on the Visual Mamba architecture, vMambaX adaptively enhances inter-modality feature interaction, emphasizing informative regions while suppressing noise. Evaluated on the PCLT20K dataset, the model outperforms baseline models while maintaining lower computational complexity. These results highlight the effectiveness of adaptive cross-modal gating for multimodal tumor segmentation and demonstrate the potential of vMambaX as an efficient and scalable framework for advanced lung cancer analysis. The code is available at https://github.com/arco-group/vMambaX.
[147] Deep Neural Watermarking for Robust Copyright Protection in 3D Point Clouds
Khandoker Ashik Uz Zaman, Mohammad Zahangir Alam, Mohammed N. M. Ali, Mahdi H. Miraz
Main category: cs.CV
TL;DR: A robust deep neural watermarking framework for 3D point cloud copyright protection that embeds watermarks using SVD and extracts them using PointNet++, achieving superior performance against geometric and non-geometric attacks compared to traditional methods.
Details
Motivation: 3D point clouds are vulnerable to geometric and non-geometric attacks that degrade conventional watermarks, requiring robust copyright protection solutions for digital 3D content.Method: Embeds binary watermarks into singular values of 3D point cloud blocks using SVD, and uses PointNet++ neural network for robust watermark extraction under various attacks including rotation, scaling, noise, cropping, and signal distortions.
Result: Deep learning-based extraction significantly outperforms traditional SVD methods, achieving bitwise accuracy up to 0.83 and IoU of 0.80 compared to SVD’s 0.58 accuracy and 0.26 IoU for severe crop attacks.
Conclusion: The proposed deep learning framework demonstrates superior watermark recovery and maintains high fidelity under severe distortions, providing effective copyright protection for 3D point clouds.
Abstract: The protection of intellectual property has become critical due to the rapid growth of three-dimensional content in digital media. Unlike traditional images or videos, 3D point clouds present unique challenges for copyright enforcement, as they are especially vulnerable to a range of geometric and non-geometric attacks that can easily degrade or remove conventional watermark signals. In this paper, we address these challenges by proposing a robust deep neural watermarking framework for 3D point cloud copyright protection and ownership verification. Our approach embeds binary watermarks into the singular values of 3D point cloud blocks using spectral decomposition, i.e. Singular Value Decomposition (SVD), and leverages the extraction capabilities of Deep Learning using PointNet++ neural network architecture. The network is trained to reliably extract watermarks even after the data undergoes various attacks such as rotation, scaling, noise, cropping and signal distortions. We validated our method using the publicly available ModelNet40 dataset, demonstrating that deep learning-based extraction significantly outperforms traditional SVD-based techniques under challenging conditions. Our experimental evaluation demonstrates that the deep learning-based extraction approach significantly outperforms existing SVD-based methods with deep learning achieving bitwise accuracy up to 0.83 and Intersection over Union (IoU) of 0.80, compared to SVD achieving a bitwise accuracy of 0.58 and IoU of 0.26 for the Crop (70%) attack, which is the most severe geometric distortion in our experiment. This demonstrates our method’s ability to achieve superior watermark recovery and maintain high fidelity even under severe distortions.
[148] MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series
Xue Xia, Randall Balestriero, Tao Zhang, Yixin Zhou, Andrew Ding, Dev Saini, Lorenz Hurni
Main category: cs.CV
TL;DR: MapSAM2 is a unified framework for segmenting historical map images and time series by treating them as videos, enabling improved segmentation accuracy with few-shot fine-tuning and reduced annotation costs.
Details
Motivation: Historical maps are valuable archives but challenging to analyze automatically due to stylistic variability and scarce annotated data. Creating linked spatio-temporal datasets from map time series is labor-intensive but essential for applications like dating buildings and studying environmental changes.Method: Built on a visual foundation model, MapSAM2 treats historical map images and time series as videos. For images, it processes tiles as videos to leverage memory attention for contextual cues. For time series, it introduces the Siegfried Building Time Series Dataset and generates pseudo time series from single-year maps by simulating temporal transformations.
Result: Experimental results show MapSAM2 effectively learns temporal associations and accurately segments and links buildings in time series under limited supervision or using pseudo videos.
Conclusion: MapSAM2 provides an effective solution for automated historical map analysis with reduced annotation requirements, and the authors will release both dataset and code to support future research.
Abstract: Historical maps are unique and valuable archives that document geographic features across different time periods. However, automated analysis of historical map images remains a significant challenge due to their wide stylistic variability and the scarcity of annotated training data. Constructing linked spatio-temporal datasets from historical map time series is even more time-consuming and labor-intensive, as it requires synthesizing information from multiple maps. Such datasets are essential for applications such as dating buildings, analyzing the development of road networks and settlements, studying environmental changes etc. We present MapSAM2, a unified framework for automatically segmenting both historical map images and time series. Built on a visual foundation model, MapSAM2 adapts to diverse segmentation tasks with few-shot fine-tuning. Our key innovation is to treat both historical map images and time series as videos. For images, we process a set of tiles as a video, enabling the memory attention mechanism to incorporate contextual cues from similar tiles, leading to improved geometric accuracy, particularly for areal features. For time series, we introduce the annotated Siegfried Building Time Series Dataset and, to reduce annotation costs, propose generating pseudo time series from single-year maps by simulating common temporal transformations. Experimental results show that MapSAM2 learns temporal associations effectively and can accurately segment and link buildings in time series under limited supervision or using pseudo videos. We will release both our dataset and code to support future research.
[149] Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
Ilyass Moummad, Kawtar Zaher, Hervé Goëau, Alexis Joly
Main category: cs.CV
TL;DR: CroVCA introduces a simple unified principle for learning binary codes that remain consistent across semantically aligned views, achieving state-of-the-art hashing performance with high efficiency.
Details
Motivation: Foundation models provide powerful embeddings but high-dimensional nearest neighbor search is computationally expensive. Hashing offers efficient search but existing approaches have complex pipelines, multi-term objectives, and long training times.Method: Uses Cross-View Code Alignment (CroVCA) with single binary cross-entropy loss for alignment and coding-rate maximization as anti-collapse regularizer. Implements HashCoder - lightweight MLP with batch normalization for balanced codes. Can be used as probing head on frozen embeddings or via LoRA fine-tuning.
Result: Achieves state-of-the-art results in just 5 training epochs. Unsupervised hashing on COCO completes in under 2 minutes, supervised hashing on ImageNet100 in about 3 minutes on single GPU. Particularly effective at 16 bits.
Conclusion: CroVCA demonstrates high efficiency, adaptability, and broad applicability for large-scale retrieval with compact binary codes.
Abstract: Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA’s efficiency, adaptability, and broad applicability.
[150] ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning
Samarup Bhattacharya, Anubhab Bhattacharya, Abir Chakraborty
Main category: cs.CV
TL;DR: ANCHOR is a framework that uses supervised contrastive learning with hard positive mining to create robust neural networks against adversarial attacks by clustering embeddings of images, their augmentations, and perturbed versions together in the embedding space.
Details
Motivation: Neural networks are vulnerable to adversarial attacks where small, imperceptible changes to images can cause wrong predictions. The gradients that help models learn can also be exploited to create these attacks.Method: Leverages supervised contrastive learning with explicit hard positive mining to learn representations where embeddings for images, their augmentations, and perturbed versions cluster together by class while being separated from other classes.
Result: On CIFAR-10, achieves impressive results for both clean and robust accuracy under PGD-20 (epsilon = 0.031), outperforming standard adversarial training methods.
Conclusion: Combining adversarial guidance with hard-mined contrastive supervision helps models learn more structured and robust representations, narrowing the gap between accuracy and robustness.
Abstract: Neural networks have changed the way machines interpret the world. At their core, they learn by following gradients, adjusting their parameters step by step until they identify the most discriminant patterns in the data. This process gives them their strength, yet it also opens the door to a hidden flaw. The very gradients that help a model learn can also be used to produce small, imperceptible tweaks that cause the model to completely alter its decision. Such tweaks are called adversarial attacks. These attacks exploit this vulnerability by adding tiny, imperceptible changes to images that, while leaving them identical to the human eye, cause the model to make wrong predictions. In this work, we propose Adversarially-trained Contrastive Hard-mining for Optimized Robustness (ANCHOR), a framework that leverages the power of supervised contrastive learning with explicit hard positive mining to enable the model to learn representations for images such that the embeddings for the images, their augmentations, and their perturbed versions cluster together in the embedding space along with those for other images of the same class while being separated from images of other classes. This alignment helps the model focus on stable, meaningful patterns rather than fragile gradient cues. On CIFAR-10, our approach achieves impressive results for both clean and robust accuracy under PGD-20 (epsilon = 0.031), outperforming standard adversarial training methods. Our results indicate that combining adversarial guidance with hard-mined contrastive supervision helps models learn more structured and robust representations, narrowing the gap between accuracy and robustness.
[151] Who Made This? Fake Detection and Source Attribution with Diffusion Features
Simone Bonechi, Paolo Andreini, Barbara Toniella Corradini
Main category: cs.CV
TL;DR: FRIDA is a lightweight framework that uses diffusion model features for deepfake detection and source identification, achieving state-of-the-art cross-generator performance without fine-tuning.
Details
Motivation: Existing supervised detectors struggle to generalize across unseen generators and require extensive labeled data and frequent retraining, raising concerns about authenticity, copyright, and misinformation.Method: Leverages internal activations from pre-trained diffusion models with k-nearest-neighbor classifier for detection and a compact neural model for source attribution.
Result: Achieves state-of-the-art cross-generator performance without fine-tuning and enables accurate source generator attribution.
Conclusion: Diffusion representations inherently encode generator-specific patterns, providing a simple and interpretable foundation for synthetic image forensics.
Abstract: The rapid progress of generative diffusion models has enabled the creation of synthetic images that are increasingly difficult to distinguish from real ones, raising concerns about authenticity, copyright, and misinformation. Existing supervised detectors often struggle to generalize across unseen generators, requiring extensive labeled data and frequent retraining. We introduce FRIDA (Fake-image Recognition and source Identification via Diffusion-features Analysis), a lightweight framework that leverages internal activations from a pre-trained diffusion model for deepfake detection and source generator attribution. A k-nearest-neighbor classifier applied to diffusion features achieves state-of-the-art cross-generator performance without fine-tuning, while a compact neural model enables accurate source attribution. These results show that diffusion representations inherently encode generator-specific patterns, providing a simple and interpretable foundation for synthetic image forensics.
[152] Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang
Main category: cs.CV
TL;DR: Spatial-SSRL introduces self-supervised reinforcement learning with verifiable rewards using five pretext tasks derived from ordinary images to improve spatial understanding in LVLMs without costly supervision.
Details
Motivation: Spatial understanding is a weakness in Large Vision-Language Models, and existing methods require costly supervision, specialized tools, or constrained environments that limit scalability.Method: Spatial-SSRL automatically formulates five pretext tasks: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction, which provide verifiable ground-truth answers without human annotation.
Result: Training on these tasks improves spatial reasoning while preserving general visual capabilities, achieving average accuracy gains of 4.63% (3B) and 3.89% (7B) over Qwen2.5-VL baselines across seven spatial understanding benchmarks in image and video settings.
Conclusion: Simple, intrinsic supervision enables reinforcement learning with verifiable rewards at scale and provides a practical route to stronger spatial intelligence in LVLMs.
Abstract: Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
[153] Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin
Main category: cs.CV
TL;DR: DUST is a dual-stream diffusion framework that enhances Vision-Language-Action models by handling modality conflicts through separate streams with cross-modal sharing, independent noise perturbations, and decoupled training, achieving performance gains in simulation and real-world robotics.
Details
Motivation: Address the challenge of jointly predicting next-state observations and action sequences in Vision-Language-Action models due to inherent modality differences, and improve robotic policy learning through world modeling.Method: Multimodal diffusion transformer with separate modality streams, independent noise perturbations for each modality, decoupled flow-matching loss, and joint sampling with test-time scaling where action and vision tokens evolve asynchronously.
Result: Achieves up to 6% gains over baselines on RoboCasa and GR-1 benchmarks, additional 2-5% boost from test-time scaling, 13% improvement on real-world Franka Research 3 tasks, and significant transfer gains from BridgeV2 pre-training.
Conclusion: DUST effectively handles modality conflicts in VLAs through dual-stream architecture and decoupled training, demonstrating strong performance improvements in both simulation and real-world robotics with potential for large-scale pre-training.
Abstract: Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST’s potential for large-scale VLA pretraining.
[154] Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation
Riccardo Brioschi, Aleksandr Alekseev, Emanuele Nevali, Berkay Döner, Omar El Malki, Blagoj Mitrevski, Leandro Kieliger, Mark Collier, Andrii Maksai, Jesse Berent, Claudiu Musat, Efi Kokiopoulou
Main category: cs.CV
TL;DR: The paper introduces a sketch-to-layout approach that uses user-provided sketches as intuitive constraints for graphic layout generation, outperforming state-of-the-art constraint-based methods while offering better usability.
Details
Motivation: Current constraint-based layout generation methods require complex specifications that reduce usability. The authors aim to provide a more intuitive design experience by using sketches as natural constraints.Method: Proposes a multimodal transformer-based solution that takes sketches and content assets as inputs. Introduces a novel method to synthetically generate training sketches at scale to avoid costly human annotation.
Result: The model outperforms state-of-the-art constraint-based methods on three public datasets (PubLayNet, DocLayNet, SlidesVQA) and releases O(200k) synthetically-generated sketches to facilitate future research.
Conclusion: Sketch-to-layout is established as a promising research direction, offering more intuitive design guidance while achieving superior performance compared to existing constraint-based approaches.
Abstract: Graphic layout generation is a growing research area focusing on generating aesthetically pleasing layouts ranging from poster designs to documents. While recent research has explored ways to incorporate user constraints to guide the layout generation, these constraints often require complex specifications which reduce usability. We introduce an innovative approach exploiting user-provided sketches as intuitive constraints and we demonstrate empirically the effectiveness of this new guidance method, establishing the sketch-to-layout problem as a promising research direction, which is currently under-explored. To tackle the sketch-to-layout problem, we propose a multimodal transformer-based solution using the sketch and the content assets as inputs to produce high quality layouts. Since collecting sketch training data from human annotators to train our model is very costly, we introduce a novel and efficient method to synthetically generate training sketches at scale. We train and evaluate our model on three publicly available datasets: PubLayNet, DocLayNet and SlidesVQA, demonstrating that it outperforms state-of-the-art constraint-based methods, while offering a more intuitive design experience. In order to facilitate future sketch-to-layout research, we release O(200k) synthetically-generated sketches for the public datasets above. The datasets are available at https://github.com/google-deepmind/sketch_to_layout.
[155] VessShape: Few-shot 2D blood vessel segmentation by leveraging shape priors from synthetic images
Cesar H. Comin, Wesley N. Galvão
Main category: cs.CV
TL;DR: VessShape introduces synthetic 2D datasets with procedural tubular geometries to instill shape bias in segmentation models, improving few-shot and zero-shot performance on real-world blood vessel datasets.
Details
Motivation: Overcome data scarcity and poor cross-domain generalization in blood vessel segmentation by addressing CNNs' tendency to learn texture-based features rather than shape cues.Method: Generate large-scale synthetic datasets with procedurally generated tubular geometries combined with varied textures, encouraging models to learn shape priors of vessel structures.
Result: Models pre-trained on VessShape achieve strong few-shot segmentation with only 4-10 samples for fine-tuning, and exhibit notable zero-shot capabilities on unseen domains without target-specific training.
Conclusion: Pre-training with strong shape bias is an effective strategy to overcome data scarcity and improve generalization in blood vessel segmentation tasks.
Abstract: Semantic segmentation of blood vessels is an important task in medical image analysis, but its progress is often hindered by the scarcity of large annotated datasets and the poor generalization of models across different imaging modalities. A key aspect is the tendency of Convolutional Neural Networks (CNNs) to learn texture-based features, which limits their performance when applied to new domains with different visual characteristics. We hypothesize that leveraging geometric priors of vessel shapes, such as their tubular and branching nature, can lead to more robust and data-efficient models. To investigate this, we introduce VessShape, a methodology for generating large-scale 2D synthetic datasets designed to instill a shape bias in segmentation models. VessShape images contain procedurally generated tubular geometries combined with a wide variety of foreground and background textures, encouraging models to learn shape cues rather than textures. We demonstrate that a model pre-trained on VessShape images achieves strong few-shot segmentation performance on two real-world datasets from different domains, requiring only four to ten samples for fine-tuning. Furthermore, the model exhibits notable zero-shot capabilities, effectively segmenting vessels in unseen domains without any target-specific training. Our results indicate that pre-training with a strong shape bias can be an effective strategy to overcome data scarcity and improve model generalization in blood vessel segmentation.
[156] NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception
Congzhang Shao, Quan Yuan, Guiyang Luo, Yue Hu, Danni Wang, Yilin Liu, Rui Pan, Bo Chen, Jinglin Li
Main category: cs.CV
TL;DR: NegoCollab proposes a negotiated common representation method for heterogeneous collaborative perception, using a negotiator to derive common representations from local agent features and multiple alignment losses to reduce domain gaps.
Details
Motivation: Immutable heterogeneity in collaborative perception causes domain gaps between different agents' features, degrading performance. Existing methods use one agent's representation as common, which fails for agents with large domain discrepancies.Method: Uses a negotiator to derive common representation from local representations, sender-receiver pairs for feature transformation, and three alignment losses (distribution, structural, pragmatic) for training.
Result: The method effectively reduces domain gaps between heterogeneous agents and enables better feature alignment compared to using one agent’s representation as common.
Conclusion: NegoCollab successfully addresses heterogeneous collaboration challenges through negotiated common representations and comprehensive alignment supervision, improving collaborative perception performance.
Abstract: Collaborative perception improves task performance by expanding the perception range through information sharing among agents. . Immutable heterogeneity poses a significant challenge in collaborative perception, as participating agents may employ different and fixed perception models. This leads to domain gaps in the intermediate features shared among agents, consequently degrading collaborative performance. Aligning the features of all agents to a common representation can eliminate domain gaps with low training cost. However, in existing methods, the common representation is designated as the representation of a specific agent, making it difficult for agents with significant domain discrepancies from this specific agent to achieve proper alignment. This paper proposes NegoCollab, a heterogeneous collaboration method based on the negotiated common representation. It introduces a negotiator during training to derive the common representation from the local representations of each modality’s agent, effectively reducing the inherent domain gap with the various local representations. In NegoCollab, the mutual transformation of features between the local representation space and the common representation space is achieved by a pair of sender and receiver. To better align local representations to the common representation containing multimodal information, we introduce structural alignment loss and pragmatic alignment loss in addition to the distribution alignment loss to supervise the training. This enables the knowledge in the common representation to be fully distilled into the sender.
[157] Gaussian Combined Distance: A Generic Metric for Object Detection
Ziqian Guan, Xieyi Fu, Pengjun Huang, Hengyuan Zhang, Hubin Du, Yongtao Liu, Yinglin Wang, Qang Ma
Main category: cs.CV
TL;DR: GCD is a new similarity metric for object detection that addresses limitations of IoU and Wasserstein Distance, particularly for small objects. It offers scale invariance and joint optimization to improve detection performance.
Details
Motivation: IoU-based metrics perform poorly for small objects due to sensitivity to positional deviations, while Wasserstein Distance lacks scale invariance and has slow convergence when used as loss function.Method: Proposed Gaussian Combined Distance (GCD) as a new similarity metric that provides scale invariance and enables joint optimization of bounding box attributes through analytical gradient analysis.
Result: GCD achieves state-of-the-art performance on AI-TOD-v2 for tiny object detection, and outperforms Wasserstein Distance on MS-COCO-2017 and Visdrone-2019 datasets across various scales.
Conclusion: GCD is an effective alternative to IoU and Wasserstein Distance for object detection, particularly beneficial for small object detection with improved generalization and convergence properties.
Abstract: In object detection, a well-defined similarity metric can significantly enhance model performance. Currently, the IoU-based similarity metric is the most commonly preferred choice for detectors. However, detectors using IoU as a similarity metric often perform poorly when detecting small objects because of their sensitivity to minor positional deviations. To address this issue, recent studies have proposed the Wasserstein Distance as an alternative to IoU for measuring the similarity of Gaussian-distributed bounding boxes. However, we have observed that the Wasserstein Distance lacks scale invariance, which negatively impacts the model’s generalization capability. Additionally, when used as a loss function, its independent optimization of the center attributes leads to slow model convergence and unsatisfactory detection precision. To address these challenges, we introduce the Gaussian Combined Distance (GCD). Through analytical examination of GCD and its gradient, we demonstrate that GCD not only possesses scale invariance but also facilitates joint optimization, which enhances model localization performance. Extensive experiments on the AI-TOD-v2 dataset for tiny object detection show that GCD, as a bounding box regression loss function and label assignment metric, achieves state-of-the-art performance across various detectors. We further validated the generalizability of GCD on the MS-COCO-2017 and Visdrone-2019 datasets, where it outperforms the Wasserstein Distance across diverse scales of datasets. Code is available at https://github.com/MArKkwanGuan/mmdet-GCD.
[158] PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
Danyal Maqbool, Changhee Lee, Zachary Huemann, Samuel D. Church, Matthew E. Larson, Scott B. Perlman, Tomas A. Romero, Joshua D. Warner, Meghan Lubner, Xin Tie, Jameson Merkow, Junjie Hu, Steve Y. Cho, Tyler J. Bradshaw
Main category: cs.CV
TL;DR: PETAR-4B is a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation in medical imaging.
Details
Motivation: Most medical vision-language applications are limited to 2D imaging, while 3D PET/CT imaging presents challenges with large volumetric data, small lesions, and lengthy reports.Method: Created a large-scale dataset with 11,000+ lesion descriptions and 3D segmentations from 5,000+ PET/CT exams using hybrid rule-based and LLM pipeline. Developed PETAR-4B model that bridges global contextual reasoning with fine-grained lesion awareness.
Result: PETAR substantially improves PET/CT report generation quality, producing clinically coherent and localized findings according to comprehensive automated and human evaluations.
Conclusion: The work advances 3D medical vision-language understanding by extending VLMs to handle 3D PET/CT imaging with spatially grounded report generation.
Abstract: Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.
[159] Deep learning denoising unlocks quantitative insights in operando materials microscopy
Samuel Degnan-Morgenstern, Alexander E. Cohen, Rajeev Gopal, Megan Gober, George J. Nelson, Peng Bai, Martin Z. Bazant
Main category: cs.CV
TL;DR: A deep learning framework for denoising operando microscopy data that preserves physical fidelity and enables quantitative analysis across multiple imaging modalities.
Details
Motivation: Measurement noise limits resolution and undermines quantitative analysis in operando microscopy, which provides direct insight into dynamic chemical and physical processes in functional materials.Method: Unsupervised deep learning-based denoising integrated into quantitative microscopy workflows, validated using simulated data with PDE-constrained optimization and applied to multiple experimental techniques including STXM, optical microscopy, and neutron radiography.
Result: Denoising revealed nanoscale heterogeneity in STXM of LFP, enabled automated particle segmentation in optical microscopy of graphite electrodes, and reduced noise-induced variability by nearly 80% in neutron radiography to resolve heterogeneous lithium transport.
Conclusion: Deep denoising is established as a powerful, modality-agnostic enhancement that advances quantitative operando imaging and extends the reach of previously noise-limited techniques.
Abstract: Operando microscopy provides direct insight into the dynamic chemical and physical processes that govern functional materials, yet measurement noise limits the effective resolution and undermines quantitative analysis. Here, we present a general framework for integrating unsupervised deep learning-based denoising into quantitative microscopy workflows across modalities and length scales. Using simulated data, we demonstrate that deep denoising preserves physical fidelity, introduces minimal bias, and reduces uncertainty in model learning with partial differential equation (PDE)-constrained optimization. Applied to experiments, denoising reveals nanoscale chemical and structural heterogeneity in scanning transmission X-ray microscopy (STXM) of lithium iron phosphate (LFP), enables automated particle segmentation and phase classification in optical microscopy of graphite electrodes, and reduces noise-induced variability by nearly 80% in neutron radiography to resolve heterogeneous lithium transport. Collectively, these results establish deep denoising as a powerful, modality-agnostic enhancement that advances quantitative operando imaging and extends the reach of previously noise-limited techniques.
[160] Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes
Bo Li, Duyuan Zheng, Xinyang Liu, Qingwen Li, Hong Li, Hongyan Cui, Ge Gao, Chen Liu
Main category: cs.CV
TL;DR: Sh-ViT is a lightweight Vision Transformer model for occluded person re-identification that uses shuffling modules, scenario-adapted augmentation, and knowledge distillation to achieve robust performance in surveillance conditions.
Details
Motivation: Person re-identification in surveillance faces challenges from occlusion, viewpoint distortion, and poor image quality, while existing methods often rely on complex modules or only work well on clear frontal images.Method: Built on ViT-Base, Sh-ViT introduces: 1) Shuffle module in final Transformer layer to break spatial correlations, 2) Scenario-adapted augmentation (geometric transforms, erasing, blur, color adjustment), 3) DeiT-based knowledge distillation for limited labels. Also created MyTT dataset for real-world evaluation.
Result: Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT dataset, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art methods.
Conclusion: Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.
Abstract: Person re-identification (ReID) in surveillance is challenged by occlusion, viewpoint distortion, and poor image quality. Most existing methods rely on complex modules or perform well only on clear frontal images. We propose Sh-ViT (Shuffling Vision Transformer), a lightweight and robust model for occluded person ReID. Built on ViT-Base, Sh-ViT introduces three components: First, a Shuffle module in the final Transformer layer to break spatial correlations and enhance robustness to occlusion and blur; Second, scenario-adapted augmentation (geometric transforms, erasing, blur, and color adjustment) to simulate surveillance conditions; Third, DeiT-based knowledge distillation to improve learning with limited labels.To support real-world evaluation, we construct the MyTT dataset, containing over 10,000 pedestrians and 30,000+ images from base station inspections, with frequent equipment occlusion and camera variations. Experiments show that Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art methods.In summary, Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.
[161] Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, Lei Yang
Main category: cs.CV
TL;DR: Phased DMD is a multi-step distillation framework that combines phase-wise distillation with Mixture-of-Experts to improve generative model performance while maintaining diversity, addressing limitations in one-step and traditional multi-step distillation methods.
Details
Motivation: One-step distilled models underperform on complex generative tasks due to limited model capacity, while direct multi-step distillation increases memory usage and computational depth, causing instability and reduced efficiency. Stochastic gradient truncation reduces generation diversity.Method: Phased DMD divides the SNR range into subintervals and uses progressive distribution matching with score matching within each subinterval. It combines phase-wise distillation with Mixture-of-Experts to reduce learning difficulty while enhancing model capacity.
Result: Experimental results show Phased DMD preserves output diversity better than DMD while retaining key generative capabilities. It was validated by distilling state-of-the-art image and video generation models including Qwen-Image (20B parameters) and Wan2.2 (28B parameters).
Conclusion: Phased DMD effectively bridges phase-wise distillation with Mixture-of-Experts, providing a solution that maintains generation diversity while improving performance on complex generative tasks compared to existing distillation methods.
Abstract: Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. However, limited model capacity causes one-step distilled models underperform on complex generative tasks, e.g., synthesizing intricate object motions in text-to-video generation. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution, we observe that it substantially reduces the generation diversity of multi-step distilled models, bringing it down to the level of their one-step counterparts. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. Phased DMD is built upon two key ideas: progressive distribution matching and score matching within subintervals. First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions. Next, to ensure the training objective within each subinterval is accurate, we have conducted rigorous mathematical derivations. We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image (20B parameters) and Wan2.2 (28B parameters). Experimental results demonstrate that Phased DMD preserves output diversity better than DMD while retaining key generative capabilities. We will release our code and models.
[162] LifWavNet: Lifting Wavelet-based Network for Non-contact ECG Reconstruction from Radar
Soumitra Kundu, Gargi Panda, Saumik Bhattacharya, Aurobinda Routray, Rajlakshmi Guha
Main category: cs.CV
TL;DR: LifWavNet is a lifting wavelet network for non-contact ECG reconstruction from radar signals, using learnable wavelets and multi-resolution analysis to outperform existing methods in ECG reconstruction and vital sign estimation.
Details
Motivation: To enable unobtrusive cardiac monitoring through non-contact ECG reconstruction from radar signals, overcoming limitations of fixed wavelet approaches in prior models.Method: Uses learnable lifting wavelets with lifting and inverse lifting units in a multi-resolution analysis and synthesis model, combined with multi-resolution STFT loss for temporal and spectral consistency.
Result: Outperforms state-of-the-art methods on two public datasets in ECG reconstruction and downstream vital sign estimation (heart rate and heart rate variability), with interpretable multi-resolution feature visualization.
Conclusion: LifWavNet establishes a robust framework for radar-based non-contact ECG measurement with improved reconstruction fidelity and interpretability.
Abstract: Non-contact electrocardiogram (ECG) reconstruction from radar signals offers a promising approach for unobtrusive cardiac monitoring. We present LifWavNet, a lifting wavelet network based on a multi-resolution analysis and synthesis (MRAS) model for radar-to-ECG reconstruction. Unlike prior models that use fixed wavelet approaches, LifWavNet employs learnable lifting wavelets with lifting and inverse lifting units to adaptively capture radar signal features and synthesize physiologically meaningful ECG waveforms. To improve reconstruction fidelity, we introduce a multi-resolution short-time Fourier transform (STFT) loss, that enforces consistency with the ground-truth ECG in both temporal and spectral domains. Evaluations on two public datasets demonstrate that LifWavNet outperforms state-of-the-art methods in ECG reconstruction and downstream vital sign estimation (heart rate and heart rate variability). Furthermore, intermediate feature visualization highlights the interpretability of multi-resolution decomposition and synthesis in radar-to-ECG reconstruction. These results establish LifWavNet as a robust framework for radar-based non-contact ECG measurement.
[163] Continual Vision-and-Language Navigation
Seongjun Jeong, Gi-Cheon Kang, Seongho Choi, Joochan Kim, Byoung-Tak Zhang
Main category: cs.CV
TL;DR: Proposes Continual Vision-and-Language Navigation (CVLN) paradigm for agents to learn across multiple scene domains, with two setups and two effective replay-based baselines.
Details
Motivation: Traditional VLN assumes train-once-deploy-once strategy, which is unrealistic as agents continually encounter novel environments in real-world deployment.Method: Introduces CVLN with two setups: Initial-instruction based and Dialogue-based navigation. Proposes two replay-based methods: Perplexity Replay (PerpR) for difficult episodes and Episodic Self-Replay (ESR) for storing action logits.
Result: Existing continual learning methods perform poorly on CVLN, while PerpR and ESR achieve better performance by efficiently using replay memory.
Conclusion: CVLN addresses the need for continual learning in VLN, and the proposed replay-based methods effectively handle sequential decision-making across multiple scene domains.
Abstract: Developing Vision-and-Language Navigation (VLN) agents typically assumes a \textit{train-once-deploy-once} strategy, which is unrealistic as deployed agents continually encounter novel environments. To address this, we propose the Continual Vision-and-Language Navigation (CVLN) paradigm, where agents learn and adapt incrementally across multiple \textit{scene domains}. CVLN includes two setups: Initial-instruction based CVLN for instruction-following, and Dialogue-based CVLN for dialogue-guided navigation. We also introduce two simple yet effective baselines for sequential decision-making: Perplexity Replay (PerpR), which replays difficult episodes, and Episodic Self-Replay (ESR), which stores and revisits action logits during training. Experiments show that existing continual learning methods fall short for CVLN, while PerpR and ESR achieve better performance by efficiently utilizing replay memory.
[164] SRAGAN: Saliency Regularized and Attended Generative Adversarial Network for Chinese Ink-wash Painting Style Transfer
Xiang Gao, Yuqi Zhang
Main category: cs.CV
TL;DR: Proposes a saliency-guided GAN framework for Chinese ink-wash painting style transfer that preserves content structure using saliency detection, SIOU loss, SANorm, and saliency-attended discriminator.
Details
Motivation: Existing I2I translation methods for ink-wash painting style transfer often erase or corrupt source image content details when transferring style elements.Method: Incorporates saliency detection into unpaired I2I framework with SIOU loss for explicit content regularization, SANorm for implicit structure enhancement, and saliency-attended discriminator for focused adversarial training.
Result: Superior performance over advanced image stylization methods in both GAN and diffusion model paradigms, demonstrated through extensive qualitative and quantitative experiments.
Conclusion: Saliency-guided approach effectively preserves object content structure while generating vivid ink-wash painting styles, addressing the content corruption issue in traditional style transfer methods.
Abstract: Recent style transfer problems are still largely dominated by Generative Adversarial Network (GAN) from the perspective of cross-domain image-to-image (I2I) translation, where the pivotal issue is to learn and transfer target-domain style patterns onto source-domain content images. This paper handles the problem of translating real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though a wide range of I2I models tackle this problem, a notable challenge is that the content details of the source image could be easily erased or corrupted due to the transfer of ink-wash style elements. To remedy this issue, we propose to incorporate saliency detection into the unpaired I2I framework to regularize image content, where the detected saliency map is utilized from two aspects: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize object content structure by enforcing saliency consistency before and after image stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances object structure integrity of the generated paintings by dynamically injecting image saliency information into the generator to guide stylization process. Besides, we also propose saliency attended discriminator which harnesses image saliency information to focus generative adversarial attention onto the drawn objects, contributing to generating more vivid and delicate brush strokes and ink-wash textures. Extensive qualitative and quantitative experiments demonstrate superiority of our approach over related advanced image stylization methods in both GAN and diffusion model paradigms.
[165] GASP: Gaussian Splatting for Physic-Based Simulations
Piotr Borycki, Weronika Smolak, Joanna Waczyńska, Marcin Mazur, Sławomir Tadeja, Przemysław Spurek
Main category: cs.CV
TL;DR: GASP introduces a physics simulation pipeline for Gaussian Splatting that uses parametrized flat Gaussian distributions and Gaussian grouping to enable efficient integration with physics engines without requiring additional meshing mechanisms.
Details
Motivation: Integrating physics simulation with state-of-the-art 3D scene rendering techniques like Gaussian Splatting is challenging, as existing approaches require additional meshing mechanisms or modify physics dynamics.Method: Uses parametrized flat Gaussian distributions to reduce physics modeling to working with 3D points, implements Gaussian grouping for hierarchical structuring and selective simulation, and provides rules for manipulating Gaussians and controlling their sizes.
Result: The pipeline demonstrates superior performance on diverse benchmark datasets for 3D object rendering and can be integrated into any physics engine as a black box.
Conclusion: GASP provides an effective solution for physics-based simulation with Gaussian Splatting, enabling efficient integration with physics engines while maintaining rendering quality.
Abstract: Physics simulation is paramount for modeling and utilizing 3D scenes in various real-world applications. However, integrating with state-of-the-art 3D scene rendering techniques such as Gaussian Splatting (GS) remains challenging. Existing models use additional meshing mechanisms, including triangle or tetrahedron meshing, marching cubes, or cage meshes. Alternatively, we can modify the physics-grounded Newtonian dynamics to align with 3D Gaussian components. Current models take the first-order approximation of a deformation map, which locally approximates the dynamics by linear transformations. In contrast, our GS for Physics-Based Simulations (GASP) pipeline uses parametrized flat Gaussian distributions. Consequently, the problem of modeling Gaussian components using the physics engine is reduced to working with 3D points. In our work, we present additional rules for manipulating Gaussians, demonstrating how to adapt the pipeline to incorporate meshes, control Gaussian sizes during simulations, and enhance simulation efficiency. This is achieved through the Gaussian grouping strategy, which implements hierarchical structuring and enables simulations to be performed exclusively on selected Gaussians. The resulting solution can be integrated into any physics engine that can be treated as a black box. As demonstrated in our studies, the proposed pipeline exhibits superior performance on a diverse range of benchmark datasets designed for 3D object rendering. The project webpage, which includes additional visualizations, can be found at https://waczjoan.github.io/GASP.
[166] EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting
Bohao Liao, Wei Zhai, Zengyu Wan, Zhixin Cheng, Wenfei Yang, Tianzhu Zhang, Yang Cao, Zheng-Jun Zha
Main category: cs.CV
TL;DR: EF-3DGS integrates event cameras with 3D Gaussian Splatting (3DGS) to improve scene reconstruction from casually captured videos, especially in high-speed scenarios where traditional methods fail.
Details
Motivation: Traditional camera-based scene reconstruction methods struggle with high-speed or low-frame-rate scenarios. Event cameras provide high temporal resolution and motion information during blind inter-frame intervals, making them valuable for improving reconstruction quality.Method: Three key components: 1) Event Generation Model to fuse events and frames for view supervision, 2) Contrast Maximization framework for motion extraction and pose calibration, 3) Photometric bundle adjustment for view consistency between events and frames.
Result: Evaluated on Tanks and Temples benchmark and RealEv-DAVIS dataset, showing improved performance in challenging high-speed scenarios.
Conclusion: EF-3DGS successfully integrates event camera advantages into 3DGS, enabling robust scene reconstruction from casually captured videos even in challenging high-speed conditions.
Abstract: Scene reconstruction from casually captured videos has wide applications in real-world scenarios. With recent advancements in differentiable rendering techniques, several methods have attempted to simultaneously optimize scene representations (NeRF or 3DGS) and camera poses. Despite recent progress, existing methods relying on traditional camera input tend to fail in high-speed (or equivalently low-frame-rate) scenarios. Event cameras, inspired by biological vision, record pixel-wise intensity changes asynchronously with high temporal resolution, providing valuable scene and motion information in blind inter-frame intervals. In this paper, we introduce the event camera to aid scene construction from a casually captured video for the first time, and propose Event-Aided Free-Trajectory 3DGS, called EF-3DGS, which seamlessly integrates the advantages of event cameras into 3DGS through three key components. First, we leverage the Event Generation Model (EGM) to fuse events and frames, supervising the rendered views observed by the event stream. Second, we adopt the Contrast Maximization (CMax) framework in a piece-wise manner to extract motion information by maximizing the contrast of the Image of Warped Events (IWE), thereby calibrating the estimated poses. Besides, based on the Linear Event Generation Model (LEGM), the brightness information encoded in the IWE is also utilized to constrain the 3DGS in the gradient domain. Third, to mitigate the absence of color information of events, we introduce photometric bundle adjustment (PBA) to ensure view consistency across events and frames. We evaluate our method on the public Tanks and Temples benchmark and a newly collected real-world dataset, RealEv-DAVIS. Our project page is https://lbh666.github.io/ef-3dgs/.
[167] PROFIT: A Specialized Optimizer for Deep Fine Tuning
Anirudh S Chakravarthy, Shuai Kyle Zheng, Xin Huang, Sachithra Hemachandra, Xiao Zhang, Yuning Chai, Zhao Chen
Main category: cs.CV
TL;DR: PROFIT is a novel optimizer designed for fine-tuning converged models that explicitly considers model convergence properties through temporal gradient orthogonalization, outperforming traditional methods across various tasks.
Details
Motivation: While much attention has been paid to efficient fine-tuning, there's been less focus on improving model performance through fine-tuning. The paper addresses this gap by developing an optimizer specifically for fine-tuning converged models.Method: PROFIT employs a temporal gradient-orthogonalization process that explicitly takes the properties of converged models into account to regularize optimization, unlike traditional optimizers like SGD or Adam.
Result: PROFIT outperforms fine-tuning methods in various tasks including image classification, multimodal language model training, and large-scale motion prediction.
Conclusion: PROFIT is an effective modular optimizer that can be easily integrated into any training pipeline with minimal engineering effort, providing improved performance for fine-tuning converged models.
Abstract: The fine-tuning of pre-trained models has become ubiquitous in generative AI, computer vision, and robotics. Although much attention has been paid to improving the efficiency of fine-tuning model, there has been less scholarship around fine-tuning specifically for improved model performance. To remedy this gap, we present PROFIT, one of the first optimizers designed to incrementally fine-tune converged models on new tasks and/or datasets. Unlike traditional optimizers such as SGD or Adam, which make minimal assumptions due to random initializations, PROFIT takes the properties of a converged model into account explicitly to regularize the optimization process. Employing a temporal gradient-orthogonalization process, PROFIT outperforms fine-tuning methods in various tasks, from image classification to multimodal language model training to large-scale motion prediction. Moreover, PROFIT is encapsulated as a modular optimizer, which makes it easy to integrate directly into any training pipeline with minimal engineering effort.
[168] MixedGaussianAvatar: Realistically and Geometrically Accurate Head Avatar via Mixed 2D-3D Gaussians
Peng Chen, Xiaobao Wei, Qingpo Wuwu, Xinyi Wang, Xingyu Xiao, Ming Lu
Main category: cs.CV
TL;DR: MixedGaussianAvatar combines 2D and 3D Gaussian representations for high-fidelity head avatar reconstruction, achieving both geometric accuracy and rendering quality while being animatable with FLAME parameters.
Details
Motivation: Existing methods using Neural Radiance Fields (NeRF) are slow for training and rendering, while 3D Gaussian Splatting (3DGS) methods are efficient but have poor geometric accuracy. 2DGS improves geometry but sacrifices rendering fidelity. The goal is to leverage benefits from both approaches.Method: Proposes a mixed 2D-3D Gaussian representation where 2D Gaussians reconstruct head surfaces for geometric accuracy (attached to FLAME mesh), and 3D Gaussians are connected where 2DGS rendering is inadequate. Uses progressive training strategy: first train 2D Gaussians, then fine-tune mixed representation. Integrates 2D images and 3D mesh with unified representation.
Result: The method achieves superior performance in both geometric accuracy and rendering fidelity compared to existing approaches. The avatars can be animated using FLAME parameters.
Conclusion: MixedGaussianAvatar successfully combines the strengths of 2DGS and 3DGS, providing a solution that is both geometrically accurate and realistically rendered for head avatar reconstruction, with efficient training and animation capabilities.
Abstract: Reconstructing high-fidelity 3D head avatars is crucial in various applications such as virtual reality. The pioneering methods reconstruct realistic head avatars with Neural Radiance Fields (NeRF), which have been limited by training and rendering speed. Recent methods based on 3D Gaussian Splatting (3DGS) significantly improve the efficiency of training and rendering. However, the surface inconsistency of 3DGS results in subpar geometric accuracy; later, 2DGS uses 2D surfels to enhance geometric accuracy at the expense of rendering fidelity. To leverage the benefits of both 2DGS and 3DGS, we propose a novel method named MixedGaussianAvatar for realistically and geometrically accurate head avatar reconstruction. Our main idea is to utilize 2D Gaussians to reconstruct the surface of the 3D head, ensuring geometric accuracy. We attach the 2D Gaussians to the triangular mesh of the FLAME model and connect additional 3D Gaussians to those 2D Gaussians where the rendering quality of 2DGS is inadequate, creating a mixed 2D-3D Gaussian representation. These 2D-3D Gaussians can then be animated using FLAME parameters. We further introduce a progressive training strategy that first trains the 2D Gaussians and then fine-tunes the mixed 2D-3D Gaussians. We use a unified mixed Gaussian representation to integrate the two modalities of 2D image and 3D mesh. Furthermore, the comprehensive experiments demonstrate the superiority of MixedGaussianAvatar. The code will be released.
[169] DPA: A one-stop metric to measure bias amplification in classification datasets
Bhanu Tokas, Rahul Nair, Hannah Kerner
Main category: cs.CV
TL;DR: DPA is a new predictability-based metric for measuring bias amplification in ML models that addresses limitations of existing metrics by being directional, working with both balanced and unbalanced datasets, and correctly identifying positive/negative bias amplification.
Details
Motivation: Existing bias amplification metrics have limitations - co-occurrence-based metrics fail with balanced datasets or negative bias amplification, while predictability-based metrics like LA lack directionality. There's a need for a comprehensive metric that addresses all these issues.Method: Proposed Directional Predictability Amplification (DPA), a predictability-based metric that measures relative changes in predictability of protected attributes from model predictions. It uses attacker functions but is less sensitive to their choice and reports bounded scores.
Result: Experiments on COMPAS, COCO, and ImSitu datasets show DPA is the most reliable metric for measuring bias amplification. It eliminates the need to use multiple metrics and improves over prior predictability-based metrics.
Conclusion: DPA provides a comprehensive solution for measuring bias amplification, addressing key limitations of existing metrics while being applicable to various dataset types and providing directional analysis of bias amplification.
Abstract: Most ML datasets today contain biases. When we train models on these datasets, they often not only learn these biases but can worsen them – a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification in classification datasets. They measure bias amplification between a protected attribute (e.g., gender) and a task (e.g., cooking). These metrics also support fine-grained bias analysis by identifying the direction in which a model amplifies biases. However, co-occurrence-based metrics have limitations – some fail to measure bias amplification in balanced datasets, while others fail to measure negative bias amplification. To solve these issues, recent work proposed a predictability-based metric called leakage amplification (LA). However, LA cannot identify the direction in which a model amplifies biases. We propose Directional Predictability Amplification (DPA), a predictability-based metric that is (1) directional, (2) works with balanced and unbalanced datasets, and (3) correctly identifies positive and negative bias amplification. DPA eliminates the need to evaluate models on multiple metrics to verify these three aspects. DPA also improves over prior predictability-based metrics like LA: it is less sensitive to the choice of attacker function (a hyperparameter in predictability-based metrics), reports scores within a bounded range, and accounts for dataset bias by measuring relative changes in predictability. Our experiments on well-known datasets like COMPAS (a tabular dataset), COCO, and ImSitu (image datasets) show that DPA is the most reliable metric to measure bias amplification in classification problems. To compare DPA with existing bias amplification metrics, we released a one-stop library of major bias amplification metrics at https://github.com/kerner-lab/Bias-Amplification.
[170] Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers
Yunshan Zhong, Yuyao Zhou, Yuxin Zhang, Wanchen Sui, Shen Li, Yong Li, Fei Chao, Rongrong Ji
Main category: cs.CV
TL;DR: SARDFQ is a data-free quantization method for Vision Transformers that addresses semantic distortion and inadequacy through attention priors alignment and multi-semantic reinforcement, achieving significant accuracy improvements over existing methods.
Details
Motivation: Existing data-free quantization methods for Vision Transformers suffer from semantic distortion (synthetic images deviate from real semantics) and semantic inadequacy (oversimplified textures and limited content), leading to suboptimal quantization performance.Method: SARDFQ uses three key components: 1) Attention Priors Alignment (APA) to optimize synthetic images with random structure attention priors, 2) Multi-Semantic Reinforcement (MSR) with localized patch optimization to enhance semantic richness, and 3) Soft-Label Learning (SL) to facilitate learning of multi-semantic images.
Result: SARDFQ significantly outperforms existing methods, improving top-1 accuracy on ImageNet by 15.52% for W4A4 ViT-B quantization. Extensive experiments demonstrate the method’s effectiveness.
Conclusion: SARDFQ effectively addresses the limitations of semantic distortion and inadequacy in data-free quantization for Vision Transformers, achieving state-of-the-art performance through semantics alignment and reinforcement techniques.
Abstract: Data-free quantization (DFQ) enables model quantization without accessing real data, addressing concerns regarding data security and privacy. With the growing adoption of Vision Transformers (ViTs), DFQ for ViTs has garnered significant attention. However, existing DFQ methods exhibit two limitations: (1) semantic distortion, where the semantics of synthetic images deviate substantially from those of real images, and (2) semantic inadequacy, where synthetic images contain extensive regions with limited content and oversimplified textures, leading to suboptimal quantization performance. To address these limitations, we propose SARDFQ, a novel Semantics Alignment and Reinforcement Data-Free Quantization method for ViTs. To address semantic distortion, SARDFQ incorporates Attention Priors Alignment (APA), which optimizes synthetic images to follow randomly generated structure attention priors. To mitigate semantic inadequacy, SARDFQ introduces Multi-Semantic Reinforcement (MSR), leveraging localized patch optimization to enhance semantic richness across synthetic images. Furthermore, SARDFQ employs Soft-Label Learning (SL), wherein multiple semantic targets are adapted to facilitate the learning of multi-semantic images augmented by MSR. Extensive experiments demonstrate the effectiveness of SARDFQ, significantly surpassing existing methods. For example, SARDFQ improves top-1 accuracy on ImageNet by 15.52% for W4A4 ViT-B. The code is at https://github.com/zysxmu/SARDFQ.
[171] Manifold Learning for Hyperspectral Images
Fethi Harkat, Guillaume Gey, Valérie Perrier, Kévin Polisano, Tiphaine Deuberet
Main category: cs.CV
TL;DR: Proposes using UMAP to construct adjacency graphs that capture nonlinear correlations in XRT multi-energy images, improving ML performance for hyperspectral image classification.
Details
Motivation: Traditional feature extraction methods like PCA struggle to represent XRT multi-energy images, limiting neural network performance in decision-making.Method: Approximates dataset topology by constructing adjacency graphs using Uniform Manifold Approximation and Projection (UMAP) to capture nonlinear correlations.
Result: Significantly improves machine learning algorithm performance, particularly for hyperspectral images from X-ray transmission spectroscopy, enhancing feature separability.
Conclusion: The UMAP-based approach preserves global data structure and leads to more accurate and robust classification results compared to traditional methods.
Abstract: Traditional feature extraction and projection techniques, such as Principal Component Analysis, struggle to adequately represent X-Ray Transmission (XRT) Multi-Energy (ME) images, limiting the performance of neural networks in decision-making processes. To address this issue, we propose a method that approximates the dataset topology by constructing adjacency graphs using the Uniform Manifold Approximation and Projection. This approach captures nonlinear correlations within the data, significantly improving the performance of machine learning algorithms, particularly in processing Hyperspectral Images (HSI) from X-ray transmission spectroscopy. This technique not only preserves the global structure of the data but also enhances feature separability, leading to more accurate and robust classification results.
[172] AMD-Hummingbird: Towards an Efficient Text-to-Video Model
Takashi Isobe, He Cui, Dong Zhou, Mengmeng Ge, Dong Li, Emad Barsoum
Main category: cs.CV
TL;DR: Hummingbird is a lightweight Text-to-Video framework that prunes U-Net models from 1.4B to 0.7B parameters, achieving 31X speedup over VideoCrafter2 while maintaining high visual quality through visual feedback learning and enhanced data processing.
Details
Motivation: Existing T2V models struggle to balance computational efficiency and visual quality, especially on resource-limited devices like iGPUs and mobile phones. Most prior work prioritizes visual fidelity over deployment efficiency.Method: Prunes existing models and enhances visual quality through visual feedback learning. Introduces a data processing pipeline using LLMs and VQA models to improve text prompts and video data quality. Supports user-driven training and style customization.
Result: Achieves 31X speedup compared to VideoCrafter2, attains highest overall score on VBench, supports generation of up to 26 frames, and requires only 4 GPUs for training while maintaining competitive performance.
Conclusion: Hummingbird provides a practical and efficient T2V solution combining high performance, scalability, and flexibility for real-world applications, addressing the efficiency-quality tradeoff in video generation.
Abstract: Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.
[173] Face Spoofing Detection using Deep Learning
Najeebullah, Maaz Salman, Zar Nawab Khan Swati
Main category: cs.CV
TL;DR: This study compares three vision models (MobileNetV2, ResNET50, ViT) for digital image spoof detection in facial recognition systems, finding MobileNetV2 outperforms others with 91.59% accuracy on test data and demonstrates superior generalization.
Details
Motivation: Digital image spoofing poses significant security threats to biometric authentication systems, particularly facial recognition, necessitating effective spoof detection methods to enhance system security.Method: Evaluated three vision models (MobileNetV2, ResNET50, Vision Transformer) using a dataset of 150,986 images split into training (140,002), testing (10,984), and validation (39,574) sets, comparing performance through accuracy, precision, recall, and F1 score metrics.
Result: MobileNetV2 achieved best performance: 91.59% accuracy, 91.72% precision, 91.59% recall, and 91.58% F1 score on test data, outperforming ViT (86.54% accuracy). On validation data, MobileNetV2 reached 97.17% accuracy vs ViT’s 96.36%. MobileNetV2 showed faster convergence and better generalization despite overfitting signs.
Conclusion: MobileNetV2 demonstrates balanced performance and robustness, making it the preferred choice for spoof detection applications requiring reliability on new data, highlighting its practical suitability for real-world deployment in security-sensitive contexts.
Abstract: Digital image spoofing has emerged as a significant security threat in biometric authentication systems, particularly those relying on facial recognition. This study evaluates the performance of three vision based models, MobileNetV2, ResNET50, and Vision Transformer, ViT, for spoof detection in image classification, utilizing a dataset of 150,986 images divided into training , 140,002, testing, 10,984, and validation ,39,574, sets. Spoof detection is critical for enhancing the security of image recognition systems, and this research compares the models effectiveness through accuracy, precision, recall, and F1 score metrics. Results reveal that MobileNetV2 outperforms other architectures on the test dataset, achieving an accuracy of 91.59%, precision of 91.72%, recall of 91.59%, and F1 score of 91.58%, compared to ViT 86.54%, 88.28%, 86.54%, and 86.39%, respectively. On the validation dataset, MobileNetV2, and ViT excel, with MobileNetV2 slightly ahead at 97.17% accuracy versus ViT 96.36%. MobileNetV2 demonstrates faster convergence during training and superior generalization to unseen data, despite both models showing signs of overfitting. These findings highlight MobileNetV2 balanced performance and robustness, making it the preferred choice for spoof detection applications where reliability on new data is essential. The study underscores the importance of model selection in security sensitive contexts and suggests MobileNetV2 as a practical solution for real world deployment.
[174] D$^2$USt3R: Enhancing 3D Reconstruction for Dynamic Scenes
Jisang Han, Honggyu An, Jaewoo Jung, Takuya Narihira, Junyoung Seo, Kazumi Fukuda, Chaehyun Kim, Sunghwan Hong, Yuki Mitsufuji, Seungryong Kim
Main category: cs.CV
TL;DR: The paper proposes D²USt3R, a method that regresses Static-Dynamic Aligned Pointmaps (SDAP) to address 3D reconstruction in dynamic scenes, overcoming limitations of previous static-focused methods like DUSt3R.
Details
Motivation: Existing 3D pointmap regression methods like DUSt3R perform well in static scenes but struggle with dynamic motions that disrupt camera pose-based alignment, leading to degraded reconstruction quality.Method: Proposed D²USt3R directly regresses Static-Dynamic Aligned Pointmaps (SDAP) that simultaneously capture both static and dynamic 3D scene geometry by explicitly incorporating spatial and temporal aspects.
Result: Extensive experiments show the approach consistently achieves superior 3D reconstruction performance across various datasets with complex motions, enhancing downstream tasks through improved 3D dense correspondence.
Conclusion: The proposed D²USt3R method successfully addresses dynamic scene reconstruction by explicitly modeling both static and dynamic components, outperforming previous static-only approaches in challenging motion scenarios.
Abstract: In this work, we address the task of 3D reconstruction in dynamic scenes, where object motions frequently degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, that are originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera poses. To overcome this, we propose $D^2USt3R$ that directly regresses Static-Dynamic Aligned Pointmaps (SDAP) that simultaneiously capture both static and dynamic 3D scene geometry. By explicitly incorporating both spatial and temporal aspects, our approach successfully encapsulates 3D dense correspondence to the proposed pointmaps, enhancing downstream tasks. Extensive experimental evaluations demonstrate that our proposed approach consistently achieves superior 3D reconstruction performance across various datasets featuring complex motions.
[175] NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, Michael Qizhe Shieh
Main category: cs.CV
TL;DR: NoisyRollout is a data augmentation method that mixes training trajectories from clean and distorted images to enhance policy exploration and improve robustness in vision-language models, achieving state-of-the-art performance without additional training costs.
Details
Motivation: Vision-language models struggle with imperfect visual perception and limited policy exploration during test-time compute scaling, which affects reasoning capabilities.Method: Inject perceptual diversity by mixing training trajectories from both clean and moderately distorted images, using a noise annealing schedule that gradually reduces distortion strength during training.
Result: Achieves state-of-the-art performance across 5 out-of-domain reasoning and perception benchmarks, validated across different model sizes, data scales, and image augmentation types.
Conclusion: NoisyRollout is an effective, easy-to-adopt method that enhances policy exploration and robustness in vision-language models without requiring additional training costs or modifications to RL objectives.
Abstract: Recent advances in reinforcement learning (RL) have strengthened the reasoning capabilities of vision-language models (VLMs). However, enhancing policy exploration to better scale test-time compute remains largely underexplored. In addition, VLMs continue to struggle with imperfect visual perception, which in turn affects the subsequent reasoning process. We introduce NoisyRollout, a simple yet effective data augmentation method that addresses these issues by mixing training trajectories from both clean and moderately distorted images. This approach injects perceptual diversity, encouraging better policy exploration and leading to more robust reasoning. A noise annealing schedule gradually reduces distortion strength, aiding exploration early in training while ensuring later stability. Crucially, our method is easy-to-adopt–requiring no additional training cost and no modifications to the RL objective. Extensive experiments on 2 distinct training datasets demonstrate that NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models across 5 out-of-domain reasoning and perception benchmarks. Furthermore, we validate the effectiveness of NoisyRollout across model sizes (7B and 32B), data scales (from 1K to 6K) and image augmentation types (Gaussion noise and rotation), highlighting its generalizability and scalability.
[176] AVA: Towards Agentic Video Analytics with Vision Language Models
Yuxuan Yan, Shiqi Jiang, Ting Cao, Yifan Yang, Qianqian Yang, Yuanchao Shu, Yuqing Yang, Lili Qiu
Main category: cs.CV
TL;DR: AVA is a VLM-powered system for open-ended video analytics that addresses context window limitations through Event Knowledge Graphs and agentic retrieval-generation, achieving state-of-the-art performance on multiple benchmarks.
Details
Motivation: Existing video analytics systems are limited to predefined tasks and struggle with ultra-long videos due to VLM context window constraints, hindering open-ended analytical scenarios.Method: AVA uses two key innovations: (1) near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long videos, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex queries.
Result: AVA achieves 62.3% accuracy on LVBench, 64.1% on VideoMME-Long, and 75.8% on the new AVA-100 benchmark, significantly surpassing existing VLM and video RAG systems.
Conclusion: AVA demonstrates superior performance for open-ended video analytics, particularly in ultra-long video scenarios, and introduces the AVA-100 benchmark to advance evaluation in this domain.
Abstract: AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively-significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%. The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at https://huggingface.co/datasets/iesc/Ava-100.
[177] Variational Visual Question Answering for Uncertainty-Aware Selective Prediction
Tobias Jan Wieczorek, Nathalie Daun, Mohammad Emtiyaz Khan, Marcus Rohrbach
Main category: cs.CV
TL;DR: Variational Bayes method improves selective prediction in VQA tasks, enhancing model reliability and reducing hallucinations in vision language models.
Details
Motivation: Vision language models suffer from overconfidence and hallucinations, and Bayesian methods could improve reliability through selective prediction, but are often considered costly for large models.Method: Proposed ‘Variational VQA’ - an extension of variational methods for deep learning, using variational Bayes for selective prediction with a new risk-averse selector that considers prediction variance.
Result: Significant gains in selective prediction on VQA and Visual Reasoning, especially at low error tolerance (≤1%). One posterior sample often outperforms AdamW-trained models. Improved calibration and reliability.
Conclusion: Variational learning is a viable option to make large VLMs safer and more trustworthy, providing compelling evidence for its effectiveness in multimodal applications.
Abstract: Despite remarkable progress in recent years, vision language models (VLMs) remain prone to overconfidence and hallucinations on tasks such as Visual Question Answering (VQA) and Visual Reasoning. Bayesian methods can potentially improve reliability by helping models selectively predict, that is, models respond only when they are sufficiently confident. Unfortunately, Bayesian methods are often assumed to be costly and ineffective for large models, and so far there exists little evidence to show otherwise, especially for multimodal applications. Here, we show the effectiveness and competitive edge of variational Bayes for selective prediction in VQA for the first time. We build on recent advances in variational methods for deep learning and propose an extension called “Variational VQA”. This method improves calibration and yields significant gains for selective prediction on VQA and Visual Reasoning, particularly when the error tolerance is low ($\leq 1%$). Often, just one posterior sample can yield more reliable answers than those obtained by models trained with AdamW. In addition, we propose a new risk-averse selector that outperforms standard sample averaging by considering the variance of predictions. Overall, we present compelling evidence that variational learning is a viable option to make large VLMs safer and more trustworthy.
[178] Rethinking Metrics and Benchmarks of Video Anomaly Detection
Zihao Liu, Xiaoyu Wu, Wenna Li, Linlin Yang, Shengjin Wang
Main category: cs.CV
TL;DR: This paper identifies limitations in current Video Anomaly Detection (VAD) evaluation methods and proposes three novel evaluation approaches: Prob-AUC/AP metrics to address single annotation bias, Latency-aware AP to reward early detection, and hard normal benchmarks to evaluate scene overfitting.
Details
Motivation: Existing VAD research focuses heavily on model architectures and training strategies while neglecting evaluation metrics and benchmarks, leading to three critical limitations: single annotation bias, lack of early detection rewards, and inability to evaluate scene overfitting.Method: The authors propose three evaluation methods: 1) Probabilistic AUC/AP metrics using multi-round annotations, 2) Latency-aware Average Precision metric for early detection, and 3) Two hard normal benchmarks (UCF-HN, MSAD-HN) specifically designed to test scene overfitting.
Result: The paper reports performance comparisons of ten state-of-the-art VAD approaches using the proposed evaluation methods, providing new perspectives for future VAD model development.
Conclusion: The proposed evaluation framework addresses critical limitations in current VAD assessment practices and offers more comprehensive evaluation methods that better reflect real-world requirements for anomaly detection systems.
Abstract: Video Anomaly Detection (VAD), which aims to detect anomalies that deviate from expectation, has attracted increasing attention in recent years. Existing advancements in VAD primarily focus on model architectures and training strategies, while devoting insufficient attention to evaluation metrics and benchmarks. In this paper, we rethink VAD evaluation methods through comprehensive analyses, revealing three critical limitations in current practices: 1) existing metrics are significantly influenced by single annotation bias; 2) current metrics fail to reward early detection of anomalies; 3) available benchmarks lack the capability to evaluate scene overfitting of fully/weakly-supervised algorithms. To address these limitations, we propose three novel evaluation methods: first, we establish probabilistic AUC/AP (Prob-AUC/AP) metrics utlizing multi-round annotations to mitigate single annotation bias; second, we develop a Latency-aware Average Precision (LaAP) metric that rewards early and accurate anomaly detection; and finally, we introduce two hard normal benchmarks (UCF-HN, MSAD-HN) with videos specifically designed to evaluate scene overfitting. We report performance comparisons of ten state-of-the-art VAD approaches using our proposed evaluation methods, providing novel perspectives for future VAD model development. We release our data and code in https://github.com/Kamino666/RethinkingVAD.
[179] StateSpaceDiffuser: Bringing Long Context to Diffusion World Models
Nedko Savov, Naser Kazemi, Deheng Zhang, Danda Pani Paudel, Xi Wang, Luc Van Gool
Main category: cs.CV
TL;DR: StateSpaceDiffuser combines diffusion models with state-space models to enable long-term memory in world models, maintaining temporal coherence for significantly more steps than diffusion-only approaches.
Details
Motivation: Current diffusion-based world models lose long-term context and temporal coherence after a few steps due to lack of lasting environment state, causing generated scenes to drift from previous observations.Method: Integrates features from a state-space model (representing entire interaction history) with a diffusion model, restoring long-term memory while preserving high-fidelity synthesis capabilities.
Result: Significantly outperforms diffusion-only baseline, maintaining coherent visual context for an order of magnitude more steps in both 2D maze navigation and complex 3D environments.
Conclusion: Combining state-space representations with diffusion models is highly effective for achieving both visual detail and long-term memory in world models.
Abstract: World models have recently gained prominence for action-conditioned visual prediction in complex environments. However, relying on only a few recent observations causes them to lose long-term context. Consequently, within a few steps, the generated scenes drift from what was previously observed, undermining temporal coherence. This limitation, common in state-of-the-art world models, which are diffusion-based, stems from the lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model, representing the entire interaction history. This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model’s ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory. Project page: https://insait-institute.github.io/StateSpaceDiffuser/.
[180] DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
Jinyoung Park, Jeehye Na, Jinyoung Kim, Hyunwoo J. Kim
Main category: cs.CV
TL;DR: The paper proposes DeepVideo-R1, a video large language model trained with Reg-GRPO (a regression-based reformulation of Group Relative Policy Optimization) and difficulty-aware data augmentation to improve video reasoning capabilities.
Details
Motivation: While RL-based post-training methods like GRPO have shown success in enhancing reasoning capabilities of LLMs, their effectiveness in VideoLLMs remains understudied, with identified problems including reliance on safeguards and vanishing advantage.Method: Proposes Reg-GRPO which reformulates GRPO loss as a regression task to directly predict advantages, eliminating need for safeguards. Also introduces difficulty-aware data augmentation to locate sample difficulty at solvable levels for diverse reward signals.
Result: Experimental results show significant improvement in video reasoning performance across multiple benchmarks.
Conclusion: The proposed Reg-GRPO with difficulty-aware data augmentation effectively addresses limitations of standard GRPO in VideoLLMs, leading to enhanced video reasoning capabilities.
Abstract: Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) has still been less studyed. In this paper, we explore GRPO and identify two problems that deteriorate the effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function into a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as the clipping and min functions. It directly aligns the model with advantages, providing guidance to prefer better ones. The difficulty-aware data augmentation strategy augments input prompts/videos to locate the difficulty of samples at solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.
[181] On the Theory of Conditional Feature Alignment for Unsupervised Domain-Adaptive Counting
Zhuonan Liang, Dongnan Liu, Jianan Fan, Yaxuan Song, Qiang Qu, Runnan Chen, Yu Yao, Peng Fu, Weidong Cai
Main category: cs.CV
TL;DR: Proposes conditional feature alignment framework for cross-domain object counting to handle density shifts, with theoretical analysis and empirical validation showing superior performance over existing methods.
Details
Motivation: Object counting models fail across domains with different density distributions because density shifts are task-relevant and violate standard domain adaptation assumptions.Method: Theoretical framework of conditional feature alignment that explicitly preserves density-related features across domains, formalizing conditional divergence by partitioning domains into subsets and measuring divergences per condition.
Result: Extensive experiments on multiple counting datasets show the method outperforms existing unsupervised domain adaptation approaches, empirically validating the theoretical insights.
Conclusion: Conditional feature alignment provides tighter error bounds and better cross-domain generalization for object counting compared to unconditional alignment, effectively handling density distribution shifts.
Abstract: Object counting models suffer when deployed across domains with differing density variety, since density shifts are inherently task-relevant and violate standard domain adaptation assumptions. To address this, we propose a theoretical framework of conditional feature alignment and provide a straightforward implementation. By theoretical analysis, our framework is feasible to achieve superior cross-domain generalization for counting. In the presented network, the features related to density are explicitly preserved across domains. Theoretically, we formalize the notion of conditional divergence by partitioning each domain into subsets and measuring divergences per condition. We then derive a joint error bound showing that, under discrete label spaces treated as condition sets, aligning distributions conditionally leads to tighter bounds on the combined source-target decision error than unconditional alignment. Empirically, we demonstrate the effectiveness of our approach through extensive experiments on multiple counting datasets with varying density distributions. The results show that our method outperforms existing unsupervised domain adaptation methods, empirically validating the theoretical insights on conditional feature alignment.
[182] SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding
Zihao Sheng, Zilin Huang, Yansong Qu, Jiancong Chen, Yuhao Luo, Yen-Jung Chen, Yue Leng, Sikai Chen
Main category: cs.CV
TL;DR: SafePLUG is a multimodal large language model framework that enables pixel-level understanding and temporal grounding for comprehensive traffic accident analysis, addressing limitations of existing MLLMs in handling fine-grained visual details and localized scene components.
Details
Motivation: Existing MLLMs for traffic accident understanding primarily focus on coarse-grained image/video-level comprehension and struggle with fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios.Method: Proposed SafePLUG framework that supports arbitrary-shaped visual prompts for region-aware question answering, pixel-level segmentation based on language instructions, and recognition of temporally anchored events in traffic accident scenarios. Also curated a new multimodal dataset with detailed pixel-level annotations and temporal event boundaries.
Result: SafePLUG achieves strong performance on multiple tasks including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding.
Conclusion: The framework lays a foundation for fine-grained understanding of complex traffic scenes, with potential to improve driving safety and enhance situational awareness in smart transportation systems.
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance the development of MLLMs for traffic accident understanding, we curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries. Experimental results show that SafePLUG achieves strong performance on multiple tasks, including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. These capabilities lay a foundation for fine-grained understanding of complex traffic scenes, with the potential to improve driving safety and enhance situational awareness in smart transportation systems. The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG
[183] $\mathtt{M^3VIR}$: A Large-Scale Multi-Modality Multi-View Synthesized Benchmark Dataset for Image Restoration and Content Creation
Yuanzhi Li, Lebin Zhou, Nam Ling, Zhenghao Chen, Wei Wang, Wei Jiang
Main category: cs.CV
TL;DR: M3VIR is a large-scale multi-modal, multi-view dataset for gaming content that addresses limitations of existing datasets by providing authentic ground-truth LR-HR paired frames and multi-view frames across 80 scenes in 8 categories, enabling research on super-resolution, novel view synthesis, and controlled video generation.
Details
Motivation: Existing datasets for gaming and entertainment AI are limited to specific domains or use artificial degradations that don't accurately capture gaming content characteristics. There are no benchmarks for controllable video generation, hindering progress in AI-powered gaming experiences.Method: Created M3VIR dataset using Unreal Engine 5 to render diverse, high-fidelity gaming content. Includes two subsets: M3VIR_MR for super-resolution, novel view synthesis, and combined tasks; and M3VIR_MS as the first multi-style, object-level ground-truth set for controlled video generation research.
Result: The dataset provides authentic ground-truth LR-HR paired and multi-view frames across 80 scenes in 8 categories. Benchmarking of state-of-the-art SR and NVS methods established performance baselines. The dataset enables research in areas where no existing approaches currently exist.
Conclusion: M3VIR addresses critical gaps in gaming AI research by providing the first comprehensive dataset for authentic gaming content, establishing benchmarks for SR and NVS tasks, and enabling future research in controllable video generation for next-generation cloud gaming and entertainment.
Abstract: The gaming and entertainment industry is rapidly evolving, driven by immersive experiences and the integration of generative AI (GAI) technologies. Training such models effectively requires large-scale datasets that capture the diversity and context of gaming environments. However, existing datasets are often limited to specific domains or rely on artificial degradations, which do not accurately capture the unique characteristics of gaming content. Moreover, benchmarks for controllable video generation remain absent. To address these limitations, we introduce $\mathtt{M^3VIR}$, a large-scale, multi-modal, multi-view dataset specifically designed to overcome the shortcomings of current resources. Unlike existing datasets, $\mathtt{M^3VIR}$ provides diverse, high-fidelity gaming content rendered with Unreal Engine 5, offering authentic ground-truth LR-HR paired and multi-view frames across 80 scenes in 8 categories. It includes $\mathtt{M^3VIR_MR}$ for super-resolution (SR), novel view synthesis (NVS), and combined NVS+SR tasks, and $\mathtt{M^3VIR_{MS}}$, the first multi-style, object-level ground-truth set enabling research on controlled video generation. Additionally, we benchmark several state-of-the-art SR and NVS methods to establish performance baselines. While no existing approaches directly handle controlled video generation, $\mathtt{M^3VIR}$ provides a benchmark for advancing this area. By releasing the dataset, we aim to facilitate research in AI-powered restoration, compression, and controllable content generation for next-generation cloud gaming and entertainment.
[184] FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
Yixiang Dai, Fan Jiang, Chiyu Wang, Mu Xu, Yonggang Qi
Main category: cs.CV
TL;DR: FantasyWorld is a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch to enable joint modeling of video latents and implicit 3D fields, improving spatial consistency and 3D reasoning capabilities.
Details
Motivation: Current video foundation models lack explicit 3D grounding capabilities, limiting their spatial consistency and utility for downstream 3D reasoning tasks, despite having strong imaginative priors.Method: Introduces a framework with cross-branch supervision where geometry cues guide video generation and video priors regularize 3D prediction, enabling joint modeling of video latents and implicit 3D fields in a single forward pass.
Result: Outperforms recent geometry-consistent baselines in multi-view coherence and style consistency, and the geometric branch latents can serve as versatile representations for downstream 3D tasks without per-scene optimization.
Conclusion: FantasyWorld effectively bridges video imagination and 3D perception through unified backbone and cross-branch information exchange, enabling consistent and generalizable 3D-aware video representations.
Abstract: High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.
[185] BALR-SAM: Boundary-Aware Low-Rank Adaptation of SAM for Resource-Efficient Medical Image Segmentation
Zelin Liu, Sicheng Dong, Bocheng Li, Yixuan Yang, Jiacheng Ruan, Chenxu Zhou, Suncheng Xiang
Main category: cs.CV
TL;DR: BALR-SAM is a boundary-aware low-rank adaptation framework that enhances SAM for medical imaging by combining complementary detail enhancement, low-rank adapters, and efficient attention mechanisms, achieving state-of-the-art performance while updating only 1.8% of parameters.
Details
Motivation: Vision foundation models like SAM struggle with medical image segmentation due to lack of domain-specific adaptation, and there's a need for efficient fine-tuning with minimal resource demands while maintaining strong performance in clinical practice.Method: Combines three components: (1) Complementary Detail Enhancement Network with depthwise separable convolutions for boundary-sensitive features, (2) low-rank adapters in Vision Transformer blocks to optimize medical feature representation, and (3) low-rank tensor attention mechanism in mask decoder to reduce memory usage.
Result: Outperforms several state-of-the-art methods including fully fine-tuned MedSAM on standard medical segmentation datasets, while updating only 1.8% (11.7M) of parameters, cutting memory usage by 75% and boosting inference speed.
Conclusion: BALR-SAM provides an efficient and effective solution for adapting vision foundation models to medical imaging tasks, achieving superior performance with minimal parameter updates and significant resource savings.
Abstract: Vision foundation models like the Segment Anything Model (SAM), pretrained on large-scale natural image datasets, often struggle in medical image segmentation due to a lack of domain-specific adaptation. In clinical practice, fine-tuning such models efficiently for medical downstream tasks with minimal resource demands, while maintaining strong performance, is challenging. To address these issues, we propose BALR-SAM, a boundary-aware low-rank adaptation framework that enhances SAM for medical imaging. It combines three tailored components: (1) a Complementary Detail Enhancement Network (CDEN) using depthwise separable convolutions and multi-scale fusion to capture boundary-sensitive features essential for accurate segmentation; (2) low-rank adapters integrated into SAM’s Vision Transformer blocks to optimize feature representation and attention for medical contexts, while simultaneously significantly reducing the parameter space; and (3) a low-rank tensor attention mechanism in the mask decoder, cutting memory usage by 75% and boosting inference speed. Experiments on standard medical segmentation datasets show that BALR-SAM, without requiring prompts, outperforms several state-of-the-art (SOTA) methods, including fully fine-tuned MedSAM, while updating just 1.8% (11.7M) of its parameters.
[186] Dynamic Gaussian Splatting from Defocused and Motion-blurred Monocular Videos
Xuankai Zhang, Junjin Xiao, Qing Zhang
Main category: cs.CV
TL;DR: A unified framework for high-quality dynamic Gaussian Splatting from defocused and motion-blurred monocular videos, using blur prediction network and dynamic Gaussian densification.
Details
Motivation: Existing methods are tailored for either defocus blur or motion blur, lacking ability to handle both simultaneously. Joint modeling as blur kernel convolution is limited by difficulty in estimating accurate blur kernels.Method: Proposes per-pixel blur kernel estimation using blur prediction network with blur-aware sparsity constraint, dynamic Gaussian densification for incomplete regions, and incorporation of unseen view information for scene optimization.
Result: Outperforms state-of-the-art methods in generating photorealistic novel view synthesis from defocused and motion-blurred monocular videos.
Conclusion: The method successfully addresses the challenge of handling both defocus and motion blur in dynamic Gaussian Splatting through reliable blur kernel estimation and optimization strategies.
Abstract: This paper presents a unified framework that allows high-quality dynamic Gaussian Splatting from both defocused and motion-blurred monocular videos. Due to the significant difference between the formation processes of defocus blur and motion blur, existing methods are tailored for either one of them, lacking the ability to simultaneously deal with both of them. Although the two can be jointly modeled as blur kernel-based convolution, the inherent difficulty in estimating accurate blur kernels greatly limits the progress in this direction. In this work, we go a step further towards this direction. Particularly, we propose to estimate per-pixel reliable blur kernels using a blur prediction network that exploits blur-related scene and camera information and is subject to a blur-aware sparsity constraint. Besides, we introduce a dynamic Gaussian densification strategy to mitigate the lack of Gaussians for incomplete regions, and boost the performance of novel view synthesis by incorporating unseen view information to constrain scene optimization. Extensive experiments show that our method outperforms the state-of-the-art methods in generating photorealistic novel view synthesis from defocused and motion-blurred monocular videos. Our code is available at https://github.com/hhhddddddd/dydeblur.
[187] Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering
Jian Lan, Zhicheng Liu, Udo Schlegel, Raoyuan Zhao, Yihong Liu, Hinrich Schütze, Michael A. Hedderich, Thomas Seidl
Main category: cs.CV
TL;DR: HaDola is a human uncertainty-aware framework that improves VLM training by selectively using high-quality samples and automatic labeling, reducing reliance on costly human annotations while achieving better performance and calibration.
Details
Motivation: Standard supervised fine-tuning ignores human uncertainty in annotations, leading to under-calibrated models and inefficient training. High-uncertainty samples can degrade performance, but current methods simply use the most frequent label.Method: HaDola uses a four-stage framework: discriminate (identify harmful samples), self-annotate (automatic labeling), error trigger (prioritize informative samples), and training. It bootstraps from a small seed set (5% of data) to iteratively improve model performance.
Result: Extensive experiments on VQAv2 and VizWiz show HaDola matches or outperforms state-of-the-art baselines with less training data. Models become more accurate and better calibrated to human uncertainty distributions.
Conclusion: Explicitly modeling human uncertainty in fine-tuning is more effective than scaling dataset size. Better utilization of human uncertainty annotations leads to more efficient and calibrated VLM training.
Abstract: Large vision-language models (VLMs) achieve strong performance in Visual Question Answering but still rely heavily on supervised fine-tuning (SFT) with massive labeled datasets, which is costly due to human annotations. Crucially, real-world datasets often exhibit human uncertainty (HU) – variation in human confidence across annotations – but standard SFT simply optimizes toward the most frequent label, disregarding HU distributions. This leaves two open questions: How does HU affect SFT, and how can HU be effectively leveraged in training? In this work, we first conduct a systematic evaluation of VLMs across varying HU levels. We have two key findings: (i) surprisingly, high-HU samples contribute little or even degrade model performance, and (ii) naively training on the full dataset yields under-calibrated models that fail to capture HU distributions. Motivated by these findings, we introduce HaDola, a human uncertainty-aware data selection and automatic labeling framework. HaDola operates in four stages – discriminate, self-annotate, error trigger, and training – to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set (5% of data). Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated. Extensive experiments on VQAv2 and VizWiz datasets demonstrate that HaDola consistently matches or outperforms state-of-the-art baselines with less training data. Our work highlights the importance of explicitly modeling HU in SFT, suggesting that better utilization of HU is more effective than merely scaling up dataset size.
[188] CARE: Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams
Junhao Zhao, Zishuai Liu, Ruili Fang, Jin Lu, Linghan Zhang, Fei Dou
Main category: cs.CV
TL;DR: CARE is a novel framework for ADL recognition that uses contrastive alignment between sequence and image representations to overcome limitations of existing methods, achieving state-of-the-art performance on CASAS datasets.
Details
Motivation: Existing ADL recognition methods have limitations: sequence-based approaches are sensitive to noise and lack spatial awareness, while image-based approaches lose fine-grained temporal dynamics. Naive fusion methods fail to properly align these complementary representations.Method: Proposes CARE framework with Sequence-Image Contrastive Alignment (SICA) that jointly optimizes representation learning and classification. Integrates time-aware sequence encoding with spatially-informed image representations using a joint contrastive-classification objective.
Result: Achieves state-of-the-art performance on three CASAS datasets: 89.8% on Milan, 88.9% on Cairo, and 73.3% on Kyoto7. Demonstrates robustness to sensor malfunctions and layout variability.
Conclusion: CARE effectively addresses representation-level limitations in ADL recognition by enforcing alignment between sequence and image views, showing potential for reliable ADL recognition in smart home environments.
Abstract: The recognition of Activities of Daily Living (ADLs) from event-triggered ambient sensors is an essential task in Ambient Assisted Living, yet existing methods remain constrained by representation-level limitations. Sequence-based approaches preserve temporal order of sensor activations but are sensitive to noise and lack spatial awareness, while image-based approaches capture global patterns and implicit spatial correlations but compress fine-grained temporal dynamics and distort sensor layouts. Naive fusion (e.g., feature concatenation) fail to enforce alignment between sequence- and image-based representation views, underutilizing their complementary strengths. We propose Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams (CARE), an end-to-end framework that jointly optimizes representation learning via Sequence-Image Contrastive Alignment (SICA) and classification via cross-entropy, ensuring both cross-representation alignment and task-specific discriminability. CARE integrates (i) time-aware, noise-resilient sequence encoding with (ii) spatially-informed and frequency-sensitive image representations, and employs (iii) a joint contrastive-classification objective for end-to-end learning of aligned and discriminative embeddings. Evaluated on three CASAS datasets, CARE achieves state-of-the-art performance (89.8% on Milan, 88.9% on Cairo, and 73.3% on Kyoto7) and demonstrates robustness to sensor malfunctions and layout variability, highlighting its potential for reliable ADL recognition in smart homes.
[189] ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data
Sheida Rahnamai Kordasiabi, Damian Dalle Nogare, Florian Jug
Main category: cs.CV
TL;DR: ε-Seg is a hierarchical variational autoencoder method for semantic segmentation of EM images that works with very sparse labels (0.05% or less), using center-region masking, contrastive learning, and direct MLP prediction instead of clustering.
Details
Motivation: Semantic segmentation of EM images is challenging due to biological complexity, and existing methods struggle with sparse labeling requirements.Method: Uses hierarchical VAEs with center-region masking, sparse label contrastive learning, GMM prior, and MLP segmentation head instead of clustering for direct label prediction.
Result: Achieves competitive sparsely-supervised segmentation on complex biological image data with limited training labels, demonstrated on EM and fluorescence microscopy datasets.
Conclusion: ε-Seg provides effective semantic segmentation for complex biological images even with extremely sparse labeling, outperforming baseline methods.
Abstract: Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce {\epsilon}-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that {\epsilon}-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.
[190] How Should One Evaluate Monocular Depth Estimation?
Siyang Wu, Jack Nugent, Willow Yang, Jia Deng
Main category: cs.CV
TL;DR: The paper analyzes evaluation metrics for monocular depth estimation, revealing their limitations and proposing new metrics and visualization tools for better human alignment.
Details
Motivation: There is a lack of standardization in evaluating monocular depth estimation, with many metrics available but their trade-offs and behaviors not well understood, especially in comparison to human judgment.Method: Conducted quantitative analysis of existing metrics’ sensitivity to ground truth perturbations, introduced a new metric based on relative surface normals, developed new depth visualization tools, and created a principled method for composite metrics.
Result: Analysis revealed existing metrics are severely under-sensitive to curvature perturbations (like making flat surfaces wavy). The proposed new metric and composite metrics show better alignment with human judgment.
Conclusion: The paper provides improved evaluation methodology for monocular depth estimation with new metrics and tools that better capture human perceptual judgments, addressing limitations in current evaluation practices.
Abstract: Monocular depth estimation is an important task with rapid progress, but how to evaluate it remains an open question, as evidenced by a lack of standardization in existing literature and a large selection of evaluation metrics whose trade-offs and behaviors are not well understood. This paper contributes a novel, quantitative analysis of existing metrics in terms of their sensitivity to various types of perturbations of ground truth, emphasizing comparison to human judgment. Our analysis reveals that existing metrics are severely under-sensitive to curvature perturbation such as making flat surfaces wavy. To remedy this, we introduce a new metric based on relative surface normals, along with new depth visualization tools and a principled method to create composite metrics with better human alignment. Code and data are available at: https://github.com/princeton-vl/evalmde.
[191] IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, Ziwei Liu
Main category: cs.CV
TL;DR: IGGT is an end-to-end transformer that unifies 3D spatial reconstruction and instance-level understanding through 3D-consistent contrastive learning, using only 2D visual inputs to create coherent 3D scenes with distinct object instances.
Details
Motivation: Prior approaches treat 3D reconstruction and spatial understanding separately, limiting generalization and downstream task performance. Simple alignment methods restrict perception to aligned models' capacity.Method: Propose InstanceGrounded Geometry Transformer (IGGT) with 3D-Consistent Contrastive Learning strategy to encode unified representations from 2D inputs, enabling consistent lifting to 3D scenes with distinct instances. Create InsScene-15K dataset with comprehensive annotations.
Result: The method supports consistent lifting of 2D visual inputs into coherent 3D scenes with explicitly distinct object instances.
Conclusion: IGGT successfully unifies geometric structure and semantic understanding for improved 3D scene analysis, addressing limitations of prior isolated approaches.
Abstract: Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model’s capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.
[192] Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling
Ruoyu Wang, Beier Zhu, Junzhi Li, Liangyu Yuan, Chi Zhang
Main category: cs.CV
TL;DR: AdaSDE is a novel single-step SDE solver that dynamically regulates error correction strength to accelerate diffusion sampling, achieving state-of-the-art performance with minimal function evaluations.
Details
Motivation: To address the complementary weaknesses of ODE solvers (irreducible gradient error) and SDE solvers (amplified discretization errors) in diffusion-based generative processes, aiming to unify ODE efficiency with SDE error resilience.Method: Introduces AdaSDE with a single per-step learnable coefficient estimated via lightweight distillation, which dynamically regulates error correction strength. The framework can be integrated with existing solvers.
Result: Achieves state-of-the-art performance: FID scores of 4.18 on CIFAR-10, 8.05 on FFHQ, and 6.96 on LSUN Bedroom at only 5 NFE (number of function evaluations).
Conclusion: AdaSDE successfully bridges the gap between ODE and SDE methods, providing an efficient and robust solution for accelerated diffusion sampling with superior sample quality.
Abstract: Diffusion-based generative processes, formulated as differential equation solving, frequently balance computational speed with sample quality. Our theoretical investigation of ODE- and SDE-based solvers reveals complementary weaknesses: ODE solvers accumulate irreducible gradient error along deterministic trajectories, while SDE methods suffer from amplified discretization errors when the step budget is limited. Building upon this insight, we introduce AdaSDE, a novel single-step SDE solver that aims to unify the efficiency of ODEs with the error resilience of SDEs. Specifically, we introduce a single per-step learnable coefficient, estimated via lightweight distillation, which dynamically regulates the error correction strength to accelerate diffusion sampling. Notably, our framework can be integrated with existing solvers to enhance their capabilities. Extensive experiments demonstrate state-of-the-art performance: at 5 NFE, AdaSDE achieves FID scores of 4.18 on CIFAR-10, 8.05 on FFHQ and 6.96 on LSUN Bedroom. Codes are available in https://github.com/WLU-wry02/AdaSDE.
[193] DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications
Malaisree P, Youwai S, Kitkobsin T, Janrungautai S, Amorndechaphon D, Rojanavasu P
Main category: cs.CV
TL;DR: DINO-YOLO combines YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient object detection in civil engineering, achieving significant performance improvements while maintaining real-time inference.
Details
Motivation: Object detection in civil engineering applications faces challenges due to limited annotated data in specialized domains, requiring data-efficient solutions.Method: Hybrid architecture integrating YOLOv12 with DINOv3 self-supervised vision transformers, with strategic feature integration at input preprocessing (P0) and mid-backbone enhancement (P3).
Result: Substantial improvements across datasets: Tunnel Segment Crack detection (12.4% improvement), Construction PPE (13.7% gain), KITTI (88.6% improvement), with real-time inference (30-47 FPS) and 2-4x inference overhead.
Conclusion: DINO-YOLO establishes state-of-the-art performance for civil engineering datasets with limited data while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection.
Abstract: Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.
[194] LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation
Yang Miao, Jan-Nico Zaech, Xi Wang, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool
Main category: cs.CV
TL;DR: LangHOPS is the first MLLM-based framework for open-vocabulary object-part instance segmentation that grounds object-part hierarchies in language space rather than visual grouping.
Details
Motivation: To address the limitations of prior approaches that rely on heuristic or learnable visual grouping for object-part segmentation, by leveraging MLLM's rich knowledge and reasoning capabilities.Method: Integrates MLLM into object-part parsing pipeline to ground hierarchies in language space and link multi-granularity concepts, using language-grounded hierarchy and MLLM-driven part query refinement.
Result: Achieves state-of-the-art results: 5.5% AP improvement (in-domain) and 4.8% AP (cross-dataset) on PartImageNet, and 2.5% mIOU improvement on unseen object parts in ADE20K (zero-shot).
Conclusion: The framework effectively leverages MLLM capabilities for hierarchical object-part segmentation, with ablation studies validating the language-grounded hierarchy and part query refinement strategy.
Abstract: We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.
[195] MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Runxi Huang, Mingxuan Yu, Mingyu Tsoi, Xiaomin Ouyang
Main category: cs.CV
TL;DR: MMEdge is a real-time multimodal inference framework for edge devices that uses pipelined sensing and encoding to reduce latency while maintaining accuracy through temporal aggregation and adaptive optimization.
Details
Motivation: Real-time multimodal inference on resource-constrained edge devices is essential for applications like autonomous driving and human-computer interaction, but prior work overlooks the coupling between sensing dynamics and model execution, as well as inter-modality dependencies.Method: MMEdge decomposes inference into fine-grained sensing and encoding units for incremental computation, uses temporal aggregation to capture dynamics, includes adaptive configuration optimizer for optimal settings under latency constraints, and employs cross-modal speculative skipping to bypass slower modalities when early predictions are confident.
Result: Evaluation on two public multimodal datasets and deployment on a UAV testbed shows MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
Conclusion: MMEdge provides an effective solution for real-time multimodal inference on edge devices through its pipelined design, temporal aggregation, and adaptive optimization mechanisms.
Abstract: Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
[196] Larger Hausdorff Dimension in Scanning Pattern Facilitates Mamba-Based Methods in Low-Light Image Enhancement
Xinhua Wang, Caibo Feng, Xiangjun Fu, Chunxiao Liu
Main category: cs.CV
TL;DR: Enhanced Mamba framework with Hilbert Selective Scan mechanism increases Hausdorff dimension for better feature space exploration, improving low-light image enhancement performance while reducing computational costs.
Details
Motivation: To address information inconsistencies and improve spatial locality in Mamba-based methods while maintaining long-range dependency handling capabilities.Method: Proposed Hilbert Selective Scan mechanism that increases the Hausdorff dimension of Mamba’s scanning pattern for more effective feature space exploration.
Result: Significantly improved quantitative metrics and qualitative visual fidelity on low-light image enhancement benchmarks, with reduced computational resource consumption and shorter inference time.
Conclusion: The refined strategy advances state-of-the-art in low-light image enhancement and shows promise for broader applications in Mamba-based techniques.
Abstract: We propose an innovative enhancement to the Mamba framework by increasing the Hausdorff dimension of its scanning pattern through a novel Hilbert Selective Scan mechanism. This mechanism explores the feature space more effectively, capturing intricate fine-scale details and improving overall coverage. As a result, it mitigates information inconsistencies while refining spatial locality to better capture subtle local interactions without sacrificing the model’s ability to handle long-range dependencies. Extensive experiments on publicly available benchmarks demonstrate that our approach significantly improves both the quantitative metrics and qualitative visual fidelity of existing Mamba-based low-light image enhancement methods, all while reducing computational resource consumption and shortening inference time. We believe that this refined strategy not only advances the state-of-the-art in low-light image enhancement but also holds promise for broader applications in fields that leverage Mamba-based techniques.
cs.AI
[197] CATArena: Evaluation of LLM Agents through Iterative Tournament Competitions
Lingyue Fu, Xin Ding, Yaoming Zhu, Shao Zhang, Lin Qiu, Weiwen Liu, Weinan Zhang, Xuezhi Cao, Xunliang Cai, Jiaxin Ding, Yong Yu
Main category: cs.AI
TL;DR: CATArena is a tournament-style evaluation platform using board/card games to assess LLM agents’ learning abilities through competitive peer-learning, addressing score saturation in current benchmarks.
Details
Motivation: Current benchmarks for LLM agents focus on end-to-end performance in fixed scenarios, suffering from score saturation and limited skill assessment as agent capabilities improve.Method: Proposed iterative competitive peer-learning framework where agents refine strategies through repeated interactions and feedback, implemented via CATArena platform with four diverse board/card games featuring open-ended scoring.
Result: Experimental results show CATArena provides reliable, stable, and scalable benchmarking for core agent abilities, particularly learning ability and strategy coding, with both minimal and commercial code agents.
Conclusion: CATArena enables continuous dynamic evaluation of rapidly advancing agent capabilities by addressing score saturation through open-ended tasks without explicit upper limits.
Abstract: Large Language Model (LLM) agents have evolved from basic text generation to autonomously completing complex tasks through interaction with external tools. However, current benchmarks mainly assess end-to-end performance in fixed scenarios, restricting evaluation to specific skills and suffering from score saturation and growing dependence on expert annotation as agent capabilities improve. In this work, we emphasize the importance of learning ability, including both self-improvement and peer-learning, as a core driver for agent evolution toward human-level intelligence. We propose an iterative, competitive peer-learning framework, which allows agents to refine and optimize their strategies through repeated interactions and feedback, thereby systematically evaluating their learning capabilities. To address the score saturation issue in current benchmarks, we introduce CATArena, a tournament-style evaluation platform featuring four diverse board and card games with open-ended scoring. By providing tasks without explicit upper score limits, CATArena enables continuous and dynamic evaluation of rapidly advancing agent capabilities. Experimental results and analyses involving both minimal and commercial code agents demonstrate that CATArena provides reliable, stable, and scalable benchmarking for core agent abilities, particularly learning ability and strategy coding.
[198] Inverse Knowledge Search over Verifiable Reasoning: Synthesizing a Scientific Encyclopedia from a Long Chains-of-Thought Knowledge Base
Yu Li, Yuan Huang, Tao Wang, Caiyu Fan, Xiansheng Cai, Sihan Hu, Xinzijian Liu, Cheng Shi, Mingjun Xu, Zhen Wang, Yan Wang, Xiangqi Jin, Tianhan Zhang, Linfeng Zhang, Lei Wang, Youjin Deng, Pan Zhang, Weijie Sun, Xingyu Li, Weinan E, Linfeng Zhang, Zhiyuan Yao, Kun Chen
Main category: cs.AI
TL;DR: A framework that decompresses scientific reasoning by creating a verifiable Long Chain-of-Thought knowledge base, which powers SciencePedia - an emergent encyclopedia with 200,000 entries across multiple disciplines.
Details
Motivation: Scientific materials compress reasoning by presenting conclusions while omitting derivational chains, which hinders verification and inhibits cross-domain links by collapsing logical and causal connections between concepts.Method: Uses a Socratic agent guided by 200 courses to generate 3 million first-principles questions. Multiple solver models generate LCoTs, filtered by prompt sanitization and cross-model consensus. The Brainstorm Search Engine retrieves derivations, feeding the Plato synthesizer that narrates chains into coherent articles.
Result: SciencePedia comprises 200,000 fine-grained entries across mathematics, physics, chemistry, biology, engineering, and computation. Plato-synthesized articles show higher knowledge-point density and lower factual error rates than baseline without retrieval.
Conclusion: This reasoning-centric approach enables trustworthy, cross-domain scientific synthesis at scale and establishes the foundation for an ever-expanding encyclopedia built on verifiable LCoT knowledge base.
Abstract: Most scientific materials compress reasoning, presenting conclusions while omitting the derivational chains that justify them. This compression hinders verification by lacking explicit, step-wise justifications and inhibits cross-domain links by collapsing the very pathways that establish the logical and causal connections between concepts. We introduce a scalable framework that decompresses scientific reasoning, constructing a verifiable Long Chain-of-Thought (LCoT) knowledge base and projecting it into an emergent encyclopedia, SciencePedia. Our pipeline operationalizes an endpoint-driven, reductionist strategy: a Socratic agent, guided by a curriculum of around 200 courses, generates approximately 3 million first-principles questions. To ensure high fidelity, multiple independent solver models generate LCoTs, which are then rigorously filtered by prompt sanitization and cross-model answer consensus, retaining only those with verifiable endpoints. This verified corpus powers the Brainstorm Search Engine, which performs inverse knowledge search – retrieving diverse, first-principles derivations that culminate in a target concept. This engine, in turn, feeds the Plato synthesizer, which narrates these verified chains into coherent articles. The initial SciencePedia comprises approximately 200,000 fine-grained entries spanning mathematics, physics, chemistry, biology, engineering, and computation. In evaluations across six disciplines, Plato-synthesized articles (conditioned on retrieved LCoTs) exhibit substantially higher knowledge-point density and significantly lower factual error rates than an equally-prompted baseline without retrieval (as judged by an external LLM). Built on this verifiable LCoT knowledge base, this reasoning-centric approach enables trustworthy, cross-domain scientific synthesis at scale and establishes the foundation for an ever-expanding encyclopedia.
[199] LAFA: Agentic LLM-Driven Federated Analytics over Decentralized Data Sources
Haichao Ji, Zibo Wang, Cheng Pan, Meng Han, Yifei Zhu, Dan Wang, Zhu Han
Main category: cs.AI
TL;DR: LAFA is the first system that integrates LLM-agent-based data analytics with federated analytics (FA), enabling privacy-preserving computation with natural language queries.
Details
Motivation: Existing LLM-agent analytics frameworks lack privacy protection by assuming centralized data access, while federated analytics enables privacy but lacks natural language support.Method: LAFA uses a hierarchical multi-agent architecture with coarse-grained query decomposition, fine-grained mapping to FA operation DAGs, and optimization to eliminate redundant operations.
Result: LAFA outperforms baseline prompting strategies with higher execution plan success rates and significantly reduces resource-intensive FA operations.
Conclusion: LAFA establishes a practical foundation for privacy-preserving, LLM-driven analytics that supports natural language input in federated analytics settings.
Abstract: Large Language Models (LLMs) have shown great promise in automating data analytics tasks by interpreting natural language queries and generating multi-operation execution plans. However, existing LLM-agent-based analytics frameworks operate under the assumption of centralized data access, offering little to no privacy protection. In contrast, federated analytics (FA) enables privacy-preserving computation across distributed data sources, but lacks support for natural language input and requires structured, machine-readable queries. In this work, we present LAFA, the first system that integrates LLM-agent-based data analytics with FA. LAFA introduces a hierarchical multi-agent architecture that accepts natural language queries and transforms them into optimized, executable FA workflows. A coarse-grained planner first decomposes complex queries into sub-queries, while a fine-grained planner maps each subquery into a Directed Acyclic Graph of FA operations using prior structural knowledge. To improve execution efficiency, an optimizer agent rewrites and merges multiple DAGs, eliminating redundant operations and minimizing computational and communicational overhead. Our experiments demonstrate that LAFA consistently outperforms baseline prompting strategies by achieving higher execution plan success rates and reducing resource-intensive FA operations by a substantial margin. This work establishes a practical foundation for privacy-preserving, LLM-driven analytics that supports natural language input in the FA setting.
[200] The Denario project: Deep knowledge AI agents for scientific discovery
Francisco Villaescusa-Navarro, Boris Bolliet, Pablo Villanueva-Domingo, Adrian E. Bayer, Aidan Acquah, Chetana Amancharla, Almog Barzilay-Siegal, Pablo Bermejo, Camille Bilodeau, Pablo Cárdenas Ramírez, Miles Cranmer, Urbano L. França, ChangHoon Hahn, Yan-Fei Jiang, Raul Jimenez, Jun-Young Lee, Antonio Lerario, Osman Mamun, Thomas Meier, Anupam A. Ojha, Pavlos Protopapas, Shimanto Roy, David N. Spergel, Pedro Tarancón-Álvarez, Ujjwal Tiwari, Matteo Viel, Digvijay Wadekar, Chi Wang, Bonny Y. Wang, Licong Xu, Yossi Yovel, Shuwen Yue, Wen-Han Zhou, Qiyao Zhu, Jiajun Zou, Íñigo Zubeldia
Main category: cs.AI
TL;DR: Denario is an AI multi-agent system that serves as a scientific research assistant, capable of performing various research tasks from idea generation to paper writing across multiple scientific disciplines.
Details
Motivation: To create an AI assistant that can automate and assist in scientific research processes, enabling comprehensive end-to-end scientific analysis and interdisciplinary research.Method: Uses a modular multi-agent architecture with Cmbagent as a deep-research backend, allowing it to handle specific tasks or complete end-to-end scientific workflows.
Result: Successfully generated multiple AI-written papers across various scientific disciplines, with evaluations from domain experts showing both strengths and limitations of the system.
Conclusion: Denario demonstrates promising capabilities for AI-driven research but has current limitations; the work also discusses ethical implications and philosophical considerations of AI in science.
Abstract: We present Denario, an AI multi-agent system designed to serve as a scientific research assistant. Denario can perform many different tasks, such as generating ideas, checking the literature, developing research plans, writing and executing code, making plots, and drafting and reviewing a scientific paper. The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis using Cmbagent as a deep-research backend. In this work, we describe in detail Denario and its modules, and illustrate its capabilities by presenting multiple AI-generated papers generated by it in many different scientific disciplines such as astrophysics, biology, biophysics, biomedical informatics, chemistry, material science, mathematical physics, medicine, neuroscience and planetary science. Denario also excels at combining ideas from different disciplines, and we illustrate this by showing a paper that applies methods from quantum physics and machine learning to astrophysical data. We report the evaluations performed on these papers by domain experts, who provided both numerical scores and review-like feedback. We then highlight the strengths, weaknesses, and limitations of the current system. Finally, we discuss the ethical implications of AI-driven research and reflect on how such technology relates to the philosophy of science. We publicly release the code at https://github.com/AstroPilot-AI/Denario. A Denario demo can also be run directly on the web at https://huggingface.co/spaces/astropilot-ai/Denario, and the full app will be deployed on the cloud.
[201] Cognition Envelopes for Bounded AI Reasoning in Autonomous UAS Operations
Pedro Antonio Alarcón Granadeno, Arturo Miguel Bernal Russell, Sofia Nelson, Demetrius Hernandez, Maureen Petterson, Michael Murphy, Walter J. Scheirer, Jane Cleland-Huang
Main category: cs.AI
TL;DR: Introduces Cognition Envelopes to constrain AI decisions in cyber-physical systems using foundational models, addressing errors like hallucinations and overgeneralizations.
Details
Motivation: Foundational Models (LLMs/VLMs) in cyber-physical systems introduce new error types like hallucinations and context misalignments, leading to flawed decisions that need containment.Method: Proposes Cognition Envelopes as reasoning boundaries to constrain AI-generated decisions, complementing meta-cognition and traditional safety envelopes.
Result: Conceptual framework established for Cognition Envelopes, highlighting the need for systematic processes for their definition, validation, and assurance.
Conclusion: Cognition Envelopes provide a necessary approach to manage risks from foundational models in cyber-physical systems, requiring structured guidelines similar to safety envelopes.
Abstract: Cyber-physical systems increasingly rely on Foundational Models such as Large Language Models (LLMs) and Vision-Language Models (VLMs) to increase autonomy through enhanced perception, inference, and planning. However, these models also introduce new types of errors, such as hallucinations, overgeneralizations, and context misalignments, resulting in incorrect and flawed decisions. To address this, we introduce the concept of Cognition Envelopes, designed to establish reasoning boundaries that constrain AI-generated decisions while complementing the use of meta-cognition and traditional safety envelopes. As with safety envelopes, Cognition Envelopes require practical guidelines and systematic processes for their definition, validation, and assurance.
[202] SUSTAINABLE Platform: Seamless Smart Farming Integration Towards Agronomy Automation
Agorakis Bompotas, Konstantinos Koutras, Nikitas Rigas Kalogeropoulos, Panagiotis Kechagias, Dimitra Gariza, Athanasios P. Kalogeras, Christos Alexakos
Main category: cs.AI
TL;DR: SUSTAINABLE is a smart farming platform that integrates IoT, AI, satellite imaging, and role-based task orchestration to enable efficient, traceable, and sustainable agriculture, with a pilot use case in viticulture.
Details
Motivation: The global agricultural sector is undergoing a transformative shift driven by increasing food demands, climate variability, and the need for sustainable practices.Method: The platform integrates IoT, AI, satellite imaging, and role-based task orchestration, with features including satellite index integration, real-time environmental data, and role-aware task management tailored to Mediterranean vineyards.
Result: The paper presents a comparative evaluation of current smart agriculture solutions and introduces SUSTAINABLE’s key features.
Conclusion: SUSTAINABLE enables efficient, traceable, and sustainable agriculture through its integrated smart farming approach.
Abstract: The global agricultural sector is undergoing a transformative shift, driven by increasing food demands, climate variability and the need for sustainable practices. SUSTAINABLE is a smart farming platform designed to integrate IoT, AI, satellite imaging, and role-based task orchestration to enable efficient, traceable, and sustainable agriculture with a pilot usecase in viticulture. This paper explores current smart agriculture solutions, presents a comparative evaluation, and introduces SUSTAINABLE’s key features, including satellite index integration, real-time environmental data, and role-aware task management tailored to Mediterranean vineyards.
[203] Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models
Jared Junkin, Samuel Nathanson
Main category: cs.AI
TL;DR: Causal masking on spatial chess board data produces stronger playing models than sequential move-based training, challenging traditional assumptions about causal masking’s appropriateness for nonsequential data.
Details
Motivation: To investigate whether causal masking can be effectively applied to spatial data (chess boards) despite traditional views that it's inappropriate for nonsequential domains, and compare performance against sequential representations.Method: Train language models with bidirectional and causal self-attention on both spatial (board-based) and sequential (move-based) chess data, comparing their playing strength.
Result: Models trained on spatial board states with causal masking consistently achieve stronger playing strength than models trained on sequential data.
Conclusion: Applying causal masking to spatial data is viable for training unimodal LLMs and can be preferable to sequentialization in some domains like chess.
Abstract: Language models are traditionally designed around causal masking. In domains with spatial or relational structure, causal masking is often viewed as inappropriate, and sequential linearizations are instead used. Yet the question of whether it is viable to accept the information loss introduced by causal masking on nonsequential data has received little direct study, in part because few domains offer both spatial and sequential representations of the same dataset. In this work, we investigate this issue in the domain of chess, which naturally supports both representations. We train language models with bidirectional and causal self-attention mechanisms on both spatial (board-based) and sequential (move-based) data. Our results show that models trained on spatial board states - \textit{even with causal masking} - consistently achieve stronger playing strength than models trained on sequential data. While our experiments are conducted on chess, our results are methodological and may have broader implications: applying causal masking to spatial data is a viable procedure for training unimodal LLMs on spatial data, and in some domains is even preferable to sequentialization.
[204] e1: Learning Adaptive Control of Reasoning Effort
Michael Kleinman, Matthew Trager, Alessandro Achille, Wei Xia, Stefano Soatto
Main category: cs.AI
TL;DR: Adaptive Effort Control enables users to control AI reasoning effort via a continuous parameter, allowing dynamic trade-off between accuracy and cost without requiring prior knowledge of problem difficulty.
Details
Motivation: Users need fine-grained control over reasoning effort to balance output quality versus latency/cost, but existing methods require specifying absolute token counts which demands knowing problem difficulty beforehand.Method: A self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens relative to the current average chain-of-thought length for each query.
Result: Across model scales (1.5B to 32B parameters), the approach enables ~3x reduction in chain-of-thought length while maintaining or improving performance relative to the base model used for RL training.
Conclusion: The method eliminates dataset- and phase-specific tuning while producing better cost-accuracy tradeoff curves, with models automatically learning to allocate resources proportionally to task difficulty.
Abstract: Increasing the thinking budget of AI models can significantly improve accuracy, but not all questions warrant the same amount of reasoning. Users may prefer to allocate different amounts of reasoning effort depending on how they value output quality versus latency and cost. To leverage this tradeoff effectively, users need fine-grained control over the amount of thinking used for a particular query, but few approaches enable such control. Existing methods require users to specify the absolute number of desired tokens, but this requires knowing the difficulty of the problem beforehand to appropriately set the token budget for a query. To address these issues, we propose Adaptive Effort Control, a self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens relative to the current average chain-of-thought length for each query. This approach eliminates dataset- and phase-specific tuning while producing better cost-accuracy tradeoff curves compared to standard methods. Users can dynamically adjust the cost-accuracy trade-off through a continuous effort parameter specified at inference time. We observe that the model automatically learns to allocate resources proportionally to the task difficulty and, across model scales ranging from 1.5B to 32B parameters, our approach enables approximately 3x reduction in chain-of-thought length while maintaining or improving performance relative to the base model used for RL training.
[205] Adaptive Data Flywheel: Applying MAPE Control Loops to AI Agent Improvement
Aaditya Shukla, Sidney Knowles, Meenakshi Madugula, Dave Farris, Ryan Angilly, Santiago Pombo, Anbang Xu, Lu An, Abhinav Balasubramanian, Tan Yu, Jiaxiang Ren, Rama Akkiraju
Main category: cs.AI
TL;DR: NVIDIA implemented a MAPE-driven data flywheel in their NVInfo AI enterprise assistant, using human feedback to identify and fix RAG pipeline failures through targeted fine-tuning, achieving significant improvements in accuracy and latency.
Details
Motivation: Enterprise AI agents need continuous adaptation to maintain accuracy, reduce latency, and stay aligned with user needs in production environments.Method: Implemented a MAPE-driven data flywheel with closed-loop system that collects user feedback, identifies RAG pipeline failures (routing and query rephrasal errors), and uses NVIDIA NeMo microservices for targeted fine-tuning of smaller models.
Result: Reduced routing model from 70B to 8B with 96% accuracy (10x size reduction, 70% latency improvement). Query rephrasal fine-tuning achieved 3.7% accuracy gain and 40% latency reduction. Collected 495 negative samples over 3 months.
Conclusion: Human-in-the-loop feedback structured within a data flywheel enables enterprise AI agents to become self-improving systems, providing a repeatable blueprint for building robust, adaptive AI agents that learn from real-world usage.
Abstract: Enterprise AI agents must continuously adapt to maintain accuracy, reduce latency, and remain aligned with user needs. We present a practical implementation of a data flywheel in NVInfo AI, NVIDIA’s Mixture-of-Experts (MoE) Knowledge Assistant serving over 30,000 employees. By operationalizing a MAPE-driven data flywheel, we built a closed-loop system that systematically addresses failures in retrieval-augmented generation (RAG) pipelines and enables continuous learning. Over a 3-month post-deployment period, we monitored feedback and collected 495 negative samples. Analysis revealed two major failure modes: routing errors (5.25%) and query rephrasal errors (3.2%). Using NVIDIA NeMo microservices, we implemented targeted improvements through fine-tuning. For routing, we replaced a Llama 3.1 70B model with a fine-tuned 8B variant, achieving 96% accuracy, a 10x reduction in model size, and 70% latency improvement. For query rephrasal, fine-tuning yielded a 3.7% gain in accuracy and a 40% latency reduction. Our approach demonstrates how human-in-the-loop (HITL) feedback, when structured within a data flywheel, transforms enterprise AI agents into self-improving systems. Key learnings include approaches to ensure agent robustness despite limited user feedback, navigating privacy constraints, and executing staged rollouts in production. This work offers a repeatable blueprint for building robust, adaptive enterprise AI agents capable of learning from real-world usage at scale.
[206] CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning
Hamed Mahdavi, Pouria Mahdavinia, Alireza Farhadi, Pegah Mohammadipour, Samira Malek, Majid Daliri, Pedram Mohammadipour, Alireza Hashemi, Amir Khasahmadi, Vasant Honavar
Main category: cs.AI
TL;DR: LLMs can detect proof errors but struggle with partial credit grading. The paper introduces agentic workflows that analyze reference solutions to create problem-specific rubrics, improving agreement with human grades.
Details
Motivation: Given that SOTA LLMs can now solve most IMO problems, the research assesses whether they can effectively grade proofs by detecting errors, judging severity, and assigning fair scores beyond binary correctness.Method: Used a corpus of 90 Gemini 2.5 Pro-generated solutions graded on 1-4 scale with error annotations, and MathArena solution sets for IMO/USAMO 2025 scored 0-7. Introduced agentic workflows that extract reference solutions and automatically derive problem-specific rubrics for multi-step grading.
Result: Models can reliably flag incorrect solutions but exhibit calibration gaps in partial credit assignment. The proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics.
Conclusion: Agentic workflows with problem-specific rubrics improve LLM-based proof grading, addressing calibration gaps in partial credit assignment while maintaining reliable error detection capabilities.
Abstract: State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.
[207] Glia: A Human-Inspired AI for Automated Systems Design and Optimization
Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, Hari Balakrishnan
Main category: cs.AI
TL;DR: Glia is an AI architecture that uses LLMs in a multi-agent workflow to autonomously design computer system mechanisms at human-expert levels, producing interpretable designs and novel insights.
Details
Motivation: To explore whether AI can autonomously design computer system mechanisms with creativity and reasoning comparable to human experts, moving beyond black-box optimization approaches.Method: Uses large language models in a human-inspired multi-agent workflow where specialized agents handle reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback.
Result: When applied to distributed GPU clusters for LLM inference, Glia produced new algorithms for request routing, scheduling, and auto-scaling that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior.
Conclusion: By combining reasoning LLMs with structured experimentation, AI can produce creative and understandable designs for complex systems problems.
Abstract: Can an AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts? We present Glia, an AI architecture for networked systems design that uses large language models (LLMs) in a human-inspired, multi-agent workflow. Each agent specializes in reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback. Unlike prior ML-for-systems methods that optimize black-box policies, Glia generates interpretable designs and exposes its reasoning process. When applied to a distributed GPU cluster for LLM inference, it produces new algorithms for request routing, scheduling, and auto-scaling that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior. Our results suggest that by combining reasoning LLMs with structured experimentation, an AI can produce creative and understandable designs for complex systems problems.
[208] From product to system network challenges in system of systems lifecycle management
Vahid Salehi, Josef Vilsmeier, Shirui Wang
Main category: cs.AI
TL;DR: The paper proposes a practical framework for System of Systems (SoS) lifecycle management that integrates MBSE, PLM, CAD-CAE, digital thread, and digital twin to address challenges in networked product development.
Details
Motivation: Traditional linear life cycle models are insufficient for modern networked systems where interoperability, variant management, traceability, and cross-organizational governance are critical factors.Method: Proposes a reference framework combining MBSE as semantic backbone, PLM for governance, CAD-CAE as model-derived domains, and digital thread/twin for continuous feedback. Identifies four principles and a three-step roadmap for transition.
Result: The approach enables increased change robustness, shorter throughput times, improved reuse, and better sustainability decisions through measurable value contributions.
Conclusion: The framework helps decision-makers manage complexity and design scalable SoS value streams by transitioning from product-centric to network-centric development.
Abstract: Today, products are no longer isolated artifacts, but nodes in networked systems. This means that traditional, linearly conceived life cycle models are reaching their limits: Interoperability across disciplines, variant and configuration management, traceability, and governance across organizational boundaries are becoming key factors. This collective contribution classifies the state of the art and proposes a practical frame of reference for SoS lifecycle management, model-based systems engineering (MBSE) as the semantic backbone, product lifecycle management (PLM) as the governance and configuration level, CAD-CAE as model-derived domains, and digital thread and digital twin as continuous feedback. Based on current literature and industry experience, mobility, healthcare, and the public sector, we identify four principles: (1) referenced architecture and data models, (2) end-to-end configuration sovereignty instead of tool silos, (3) curated models with clear review gates, and (4) measurable value contributions along time, quality, cost, and sustainability. A three-step roadmap shows the transition from product- to network- centric development: piloting with reference architecture, scaling across variant and supply chain spaces, organizational anchoring (roles, training, compliance). The results are increased change robustness, shorter throughput times, improved reuse, and informed sustainability decisions. This article is aimed at decision-makers and practitioners who want to make complexity manageable and design SoS value streams to be scalable.
[209] Fints: Efficient Inference-Time Personalization for LLMs with Fine-Grained Instance-Tailored Steering
Kounianhua Du, Jianxing Liu, Kangning Zhang, Wenxiang Jiao, Yuan Lu, Jiarui Jin, Weiwen Liu, Yong Yu, Weinan Zhang
Main category: cs.AI
TL;DR: A fine-grained steering framework for LLM personalization that dynamically generates sample-level interference vectors from user data and injects them during forward pass, addressing limitations of existing methods in handling dynamic user patterns and data sparsity.
Details
Motivation: Current parametric adaptation methods for LLM personalization struggle with dynamic user patterns and high data sparsity due to low adaptability and data efficiency, creating a need for more flexible and efficient personalization techniques.Method: Proposes a fine-grained steering framework with two innovations: 1) fine-grained steering component that hooks activations from attention and MLP layers to capture nuanced signals, and 2) input-aware aggregation module that synthesizes these signals into contextually relevant enhancements. The method operates as a plug-in component compatible with different personalization techniques.
Result: Extensive experiments across diverse scenarios (short-to-long text generation, web function calling) show the method significantly enhances personalization performance in fast-shifting environments while maintaining robustness across varying interaction modes and context lengths.
Conclusion: The proposed fine-grained steering framework effectively addresses dynamic user patterns and data sparsity challenges in LLM personalization, demonstrating high flexibility, data efficiency, and compatibility with existing methods.
Abstract: The rapid evolution of large language models (LLMs) has intensified the demand for effective personalization techniques that can adapt model behavior to individual user preferences. Despite the non-parametric methods utilizing the in-context learning ability of LLMs, recent parametric adaptation methods, including personalized parameter-efficient fine-tuning and reward modeling emerge. However, these methods face limitations in handling dynamic user patterns and high data sparsity scenarios, due to low adaptability and data efficiency. To address these challenges, we propose a fine-grained and instance-tailored steering framework that dynamically generates sample-level interference vectors from user data and injects them into the model’s forward pass for personalized adaptation. Our approach introduces two key technical innovations: a fine-grained steering component that captures nuanced signals by hooking activations from attention and MLP layers, and an input-aware aggregation module that synthesizes these signals into contextually relevant enhancements. The method demonstrates high flexibility and data efficiency, excelling in fast-changing distribution and high data sparsity scenarios. In addition, the proposed method is orthogonal to existing methods and operates as a plug-in component compatible with different personalization techniques. Extensive experiments across diverse scenarios–including short-to-long text generation, and web function calling–validate the effectiveness and compatibility of our approach. Results show that our method significantly enhances personalization performance in fast-shifting environments while maintaining robustness across varying interaction modes and context lengths. Implementation is available at https://github.com/KounianhuaDu/Fints.
[210] A Framework for Objective-Driven Dynamical Stochastic Fields
Yibo Jacky Zhang, Sanmi Koyejo
Main category: cs.AI
TL;DR: The paper proposes a theoretical framework for intelligent fields - complex dynamical systems with goal-directed behaviors - based on three principles: complete configuration, locality, and purposefulness.
Details
Motivation: Complex dynamical and stochastic systems with goal-directed behaviors (intelligent fields) lack formal theoretical descriptions and practical applications due to their inherent complexity.Method: Proposes three fundamental principles: complete configuration, locality, and purposefulness to establish a theoretical framework, and explores AI-based methodologies for designing such fields.
Result: Establishes initial theoretical groundwork for understanding intelligent fields as objective-driven dynamical stochastic systems.
Conclusion: This foundational work aims to enable future theoretical developments and practical applications for harnessing the potential of intelligent fields.
Abstract: Fields offer a versatile approach for describing complex systems composed of interacting and dynamic components. In particular, some of these dynamical and stochastic systems may exhibit goal-directed behaviors aimed at achieving specific objectives, which we refer to as $\textit{intelligent fields}$. However, due to their inherent complexity, it remains challenging to develop a formal theoretical description of such systems and to effectively translate these descriptions into practical applications. In this paper, we propose three fundamental principles to establish a theoretical framework for understanding intelligent fields: complete configuration, locality, and purposefulness. Moreover, we explore methodologies for designing such fields from the perspective of artificial intelligence applications. This initial investigation aims to lay the groundwork for future theoretical developments and practical advances in understanding and harnessing the potential of such objective-driven dynamical stochastic fields.
[211] GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation
Tao Liu, Chongyu Wang, Rongjie Li, Yingchen Yu, Xuming He, Bai Song
Main category: cs.AI
TL;DR: GUI-Rise is a reasoning-enhanced framework for GUI navigation agents that combines structured reasoning, action prediction, and history summarization to improve cross-domain generalization and history utilization.
Details
Motivation: Current MLLM-based GUI navigation agents face limitations in cross-domain generalization and effective history utilization, requiring a more systematic approach to reasoning and memory management.Method: A framework with structured reasoning generating Chain-of-Thought analyses, action prediction, and history summarization. GUI-Rise is trained through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with GRPO using specialized rewards including history-aware objectives.
Result: State-of-the-art performance on standard benchmarks under identical training data conditions, with particularly strong results in out-of-domain scenarios, demonstrating robust reasoning and generalization across diverse GUI navigation tasks.
Conclusion: The reasoning-enhanced framework effectively addresses cross-domain generalization and history utilization challenges in GUI navigation, validating the approach’s ability to maintain robust performance across diverse tasks.
Abstract: While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework’s ability to maintain robust reasoning and generalization across diverse GUI navigation tasks. Code is available at https://leon022.github.io/GUI-Rise.
[212] Reinforcement Learning for Long-Horizon Unordered Tasks: From Boolean to Coupled Reward Machines
Kristina Levina, Nikolaos Pappas, Athanasios Karapantelakis, Aneta Vulgarakis Feljan, Jendrik Seipp
Main category: cs.AI
TL;DR: The paper introduces three generalizations of reward machines (numeric RMs, agenda RMs, and coupled RMs) and a new compositional learning algorithm (CoRM) to address scalability issues in long-horizon problems with unordered subtasks.
Details
Motivation: Standard reward machines are inefficient for long-horizon problems where subtasks can be completed in any order, as the learning complexity grows exponentially with the number of unordered subtasks.Method: Proposed three RM generalizations: numeric RMs for compact task expression, agenda RMs with state-associated agendas tracking remaining subtasks, and coupled RMs with states coupled to subtasks. Also introduced CoRM - Q-learning with coupled RMs as a compositional learning algorithm.
Result: Experiments show that CoRM scales better than state-of-the-art RM algorithms for long-horizon problems with unordered subtasks.
Conclusion: The proposed coupled RMs and CoRM algorithm effectively address the scalability limitations of traditional reward machines in complex long-horizon environments with unordered subtasks.
Abstract: Reward machines (RMs) inform reinforcement learning agents about the reward structure of the environment. This is particularly advantageous for complex non-Markovian tasks because agents with access to RMs can learn more efficiently from fewer samples. However, learning with RMs is ill-suited for long-horizon problems in which a set of subtasks can be executed in any order. In such cases, the amount of information to learn increases exponentially with the number of unordered subtasks. In this work, we address this limitation by introducing three generalisations of RMs: (1) Numeric RMs allow users to express complex tasks in a compact form. (2) In Agenda RMs, states are associated with an agenda that tracks the remaining subtasks to complete. (3) Coupled RMs have coupled states associated with each subtask in the agenda. Furthermore, we introduce a new compositional learning algorithm that leverages coupled RMs: Q-learning with coupled RMs (CoRM). Our experiments show that CoRM scales better than state-of-the-art RM algorithms for long-horizon problems with unordered subtasks.
[213] Discriminative Rule Learning for Outcome-Guided Process Model Discovery
Ali Norouzifar, Wil van der Aalst
Main category: cs.AI
TL;DR: The paper presents an outcome-aware process discovery approach that distinguishes between desirable and undesirable process executions to create focused, interpretable models that reveal key behavioral differences.
Details
Motivation: Traditional process discovery creates single models that fail to capture critical differences between desirable (efficient/compliant) and undesirable (inefficient/violating) executions, limiting effectiveness for conformance checking and performance analysis.Method: Learn interpretable discriminative rules over control-flow features to group traces by desirability, then apply process discovery separately within each group to create focused models.
Result: The approach effectively isolates and visualizes critical process patterns, as demonstrated through evaluation on multiple real-life event logs using a publicly available implementation.
Conclusion: Outcome-aware process discovery with separate modeling for desirable and undesirable executions produces more interpretable and useful models that better capture behavioral distinctions for process analysis and improvement.
Abstract: Event logs extracted from information systems offer a rich foundation for understanding and improving business processes. In many real-world applications, it is possible to distinguish between desirable and undesirable process executions, where desirable traces reflect efficient or compliant behavior, and undesirable ones may involve inefficiencies, rule violations, delays, or resource waste. This distinction presents an opportunity to guide process discovery in a more outcome-aware manner. Discovering a single process model without considering outcomes can yield representations poorly suited for conformance checking and performance analysis, as they fail to capture critical behavioral differences. Moreover, prioritizing one behavior over the other may obscure structural distinctions vital for understanding process outcomes. By learning interpretable discriminative rules over control-flow features, we group traces with similar desirability profiles and apply process discovery separately within each group. This results in focused and interpretable models that reveal the drivers of both desirable and undesirable executions. The approach is implemented as a publicly available tool and it is evaluated on multiple real-life event logs, demonstrating its effectiveness in isolating and visualizing critical process patterns.
[214] An In-depth Study of LLM Contributions to the Bin Packing Problem
Julien Herrmann, Guillaume Pallez
Main category: cs.AI
TL;DR: LLM-generated heuristics for bin packing are opaque and not truly novel; simpler, more interpretable algorithms perform better, questioning LLMs’ claimed mathematical contributions.
Details
Motivation: To reassess claims that LLMs provide meaningful mathematical insights by analyzing their generated heuristics for online bin packing problems.Method: Detailed analysis of LLM-generated heuristics, followed by development of new tailored algorithms for specific bin packing instances.
Result: LLM heuristics are largely opaque even to experts; newly developed algorithms are simpler, more efficient, interpretable, and generalizable.
Conclusion: LLMs’ claimed mathematical contributions are overstated; rigorous validation is needed when assessing LLM-generated scientific outputs.
Abstract: Recent studies have suggested that Large Language Models (LLMs) could provide interesting ideas contributing to mathematical discovery. This claim was motivated by reports that LLM-based genetic algorithms produced heuristics offering new insights into the online bin packing problem under uniform and Weibull distributions. In this work, we reassess this claim through a detailed analysis of the heuristics produced by LLMs, examining both their behavior and interpretability. Despite being human-readable, these heuristics remain largely opaque even to domain experts. Building on this analysis, we propose a new class of algorithms tailored to these specific bin packing instances. The derived algorithms are significantly simpler, more efficient, more interpretable, and more generalizable, suggesting that the considered instances are themselves relatively simple. We then discuss the limitations of the claim regarding LLMs’ contribution to this problem, which appears to rest on the mistaken assumption that the instances had previously been studied. Our findings instead emphasize the need for rigorous validation and contextualization when assessing the scientific value of LLM-generated outputs.
[215] ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use
Mengjie Deng, Guanting Dong, Zhicheng Dou
Main category: cs.AI
TL;DR: ToolScope is an agentic framework that enables multimodal large language models to effectively use external tools for complex visual question answering tasks through global planning and local perception integration.
Details
Motivation: Multimodal LLMs struggle with flexibly and efficiently utilizing external tools during reasoning due to the complex nature of multimodal information, particularly in long-horizon VQA tasks where visual context degradation occurs.Method: ToolScope uses a three-component framework: Global Navigator for high-level strategic guidance, Agentic Executor for iterative local perception with external tools (Search, Code, Perceive), and Response Synthesizer for coherent output generation.
Result: ToolScope achieved an average performance improvement of +6.69% across four VQA benchmarks (VQA 2.0, ScienceQA, MAT-Search, MathVista), demonstrating strong generalization capabilities.
Conclusion: The framework successfully addresses visual context degradation in long-horizon VQA tasks by unifying global planning with local multimodal perception through specialized tool integration.
Abstract: Recently, large language models (LLMs) have demonstrated remarkable problem-solving capabilities by autonomously integrating with external tools for collaborative reasoning. However, due to the inherently complex and diverse nature of multimodal information, enabling multimodal large language models (MLLMs) to flexibly and efficiently utilize external tools during reasoning remains an underexplored challenge. In this work, we introduce ToolScope, an agentic framework designed to unify global planning with local multimodal perception, adopting a specialized Perceive tool to mitigates visual context degradation in long-horizon VQA task. ToolScope comprises three primary components: the Global Navigator, the Agentic Executor, and the Response Synthesizer. The Global Navigator functions as a “telescope”, offering high-level strategic guidance. The Agentic Executor operates iteratively to augment MLLM with local perception through the integration of external tools-Search, Code, and Perceive. Finally, the Response Synthesizer consolidates and organizes the reasoning process into a coherent, user-friendly output. We evaluate ToolScope on four VQA benchmarks across diverse domains, including VQA 2.0, ScienceQA, MAT-Search and MathVista. It demonstrates strong generalization capabilities, achieving an average performance improvement of up to +6.69% across all datasets.
[216] Realistic pedestrian-driver interaction modelling using multi-agent RL with human perceptual-motor constraints
Yueyang Wang, Mehmet Dogar, Gustav Markkula
Main category: cs.AI
TL;DR: A multi-agent reinforcement learning framework with visual and motor constraints outperforms other models in simulating realistic pedestrian-driver interactions at unsignalized crossings.
Details
Motivation: Existing models lack flexibility and overlook sensory/motor constraints that shape how pedestrians and drivers perceive and act in interactive scenarios.Method: Multi-agent reinforcement learning framework integrating visual and motor constraints, evaluated using real-world dataset from unsignalized pedestrian crossing with four model variants.
Result: Combined model with both visual and motor constraints performed best, producing smoother movements and more cautious behavior. Outperformed supervised behavioral cloning model in data-limited setting.
Conclusion: Multi-agent RL with human constraints is a promising approach for simulating realistic road user interactions and accounts for individual differences through population-level parameter distributions.
Abstract: Modelling pedestrian-driver interactions is critical for understanding human road user behaviour and developing safe autonomous vehicle systems. Existing approaches often rely on rule-based logic, game-theoretic models, or ‘black-box’ machine learning methods. However, these models typically lack flexibility or overlook the underlying mechanisms, such as sensory and motor constraints, which shape how pedestrians and drivers perceive and act in interactive scenarios. In this study, we propose a multi-agent reinforcement learning (RL) framework that integrates both visual and motor constraints of pedestrian and driver agents. Using a real-world dataset from an unsignalised pedestrian crossing, we evaluate four model variants, one without constraints, two with either motor or visual constraints, and one with both, across behavioural metrics of interaction realism. Results show that the combined model with both visual and motor constraints performs best. Motor constraints lead to smoother movements that resemble human speed adjustments during crossing interactions. The addition of visual constraints introduces perceptual uncertainty and field-of-view limitations, leading the agents to exhibit more cautious and variable behaviour, such as less abrupt deceleration. In this data-limited setting, our model outperforms a supervised behavioural cloning model, demonstrating that our approach can be effective without large training datasets. Finally, our framework accounts for individual differences by modelling parameters controlling the human constraints as population-level distributions, a perspective that has not been explored in previous work on pedestrian-vehicle interaction modelling. Overall, our work demonstrates that multi-agent RL with human constraints is a promising modelling approach for simulating realistic road user interactions.
[217] Dialogue as Discovery: Navigating Human Intent Through Principled Inquiry
Jianwen Sun, Yukang Feng, Yifan Chang, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yu Dai, Kaipeng Zhang
Main category: cs.AI
TL;DR: Proposes Nous, an AI agent that actively probes users to resolve intent uncertainty using information theory principles, achieving efficient collaboration without human preference annotations.
Details
Motivation: Address the 'intention expression gap' where humans struggle to convey complex thoughts to AI, leading to inefficient trial-and-error loops, especially problematic with diverse user expertise levels.Method: Training framework based on information theory, using information gain from dialogue as intrinsic reward (equivalent to Shannon entropy reduction), avoiding human preference annotations. Uses automated simulation pipeline for dataset generation.
Result: Nous achieves leading efficiency and output quality in scientific diagram generation, robust to varying user expertise. Shows generalization beyond diagram generation with principled, scalable performance.
Conclusion: Offers a principled, scalable, and adaptive paradigm for resolving user intent uncertainty in complex human-AI collaboration through Socratic questioning approach.
Abstract: A fundamental bottleneck in human-AI collaboration is the “intention expression gap,” the difficulty for humans to effectively convey complex, high-dimensional thoughts to AI. This challenge often traps users in inefficient trial-and-error loops and is exacerbated by the diverse expertise levels of users. We reframe this problem from passive instruction following to a Socratic collaboration paradigm, proposing an agent that actively probes for information to resolve its uncertainty about user intent. we name the proposed agent Nous, trained to acquire proficiency in this inquiry policy. The core mechanism of Nous is a training framework grounded in the first principles of information theory. Within this framework, we define the information gain from dialogue as an intrinsic reward signal, which is fundamentally equivalent to the reduction of Shannon entropy over a structured task space. This reward design enables us to avoid reliance on costly human preference annotations or external reward models. To validate our framework, we develop an automated simulation pipeline to generate a large-scale, preference-based dataset for the challenging task of scientific diagram generation. Comprehensive experiments, including ablations, subjective and objective evaluations, and tests across user expertise levels, demonstrate the effectiveness of our proposed framework. Nous achieves leading efficiency and output quality, while remaining robust to varying user expertise. Moreover, its design is domain-agnostic, and we show evidence of generalization beyond diagram generation. Experimental results prove that our work offers a principled, scalable, and adaptive paradigm for resolving uncertainty about user intent in complex human-AI collaboration.
[218] DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains
Tian Liang, Wenxiang Jiao, Zhiwei He, Jiahao Xu, Haitao Mi, Dong Yu
Main category: cs.AI
TL;DR: DeepCompress is a framework that improves both accuracy and efficiency of Large Reasoning Models by adaptively adjusting reasoning path lengths based on problem difficulty.
Details
Motivation: Current methods improve efficiency but sacrifice accuracy. LRMs suffer from cognitive inefficiencies - overthinking simple problems and underthinking complex ones.Method: Uses adaptive length reward mechanism that dynamically classifies problems as Simple or Hard, encouraging shorter reasoning for simple problems and longer exploratory thought chains for hard problems.
Result: Outperforms baseline methods on mathematical benchmarks, achieving superior accuracy while significantly improving token efficiency.
Conclusion: DeepCompress enables LRMs to autonomously adjust CoT length, compressing reasoning for mastered problems and extending it for challenging ones, achieving better balance between accuracy and efficiency.
Abstract: Large Reasoning Models (LRMs) have demonstrated impressive capabilities but
suffer from cognitive inefficiencies like overthinking'' simple problems and underthinking’’ complex ones. While existing methods that use supervised
fine-tuning~(SFT) or reinforcement learning~(RL) with token-length rewards can
improve efficiency, they often do so at the cost of accuracy. This paper
introduces \textbf{DeepCompress}, a novel framework that simultaneously
enhances both the accuracy and efficiency of LRMs. We challenge the prevailing
approach of consistently favoring shorter reasoning paths, showing that longer
responses can contain a broader range of correct solutions for difficult
problems. DeepCompress employs an adaptive length reward mechanism that
dynamically classifies problems as Simple'' or Hard’’ in real-time based on
the model’s evolving capability. It encourages shorter, more efficient
reasoning for Simple'' problems while promoting longer, more exploratory thought chains for Hard’’ problems. This dual-reward strategy enables the
model to autonomously adjust its Chain-of-Thought (CoT) length, compressing
reasoning for well-mastered problems and extending it for those it finds
challenging. Experimental results on challenging mathematical benchmarks show
that DeepCompress consistently outperforms baseline methods, achieving superior
accuracy while significantly improving token efficiency.
[219] GeoFM: Enhancing Geometric Reasoning of MLLMs via Synthetic Data Generation through Formal Language
Yuhao Zhang, Dingxin Hu, Tinghao Yu, Hao Liu, Yiting Liu
Main category: cs.AI
TL;DR: GeoFM is a novel method for synthesizing diverse and high-fidelity geometric data using formal languages and symbolic engines, significantly improving MLLMs’ performance on geometric reasoning tasks.
Details
Motivation: MLLMs struggle with mathematical geometric reasoning due to scarce high-quality geometric data, and existing synthetic data methods produce limited diversity, noisy data, and unrealistic geometric diagrams.Method: GeoFM uses formal languages to explore condition combinations in metric space and generates correct geometric problems through a symbolic engine, creating diverse and authentic geometric data.
Result: Models trained with GeoFM data outperform GPT-4o by 18.7% on MathVista and 16.5% on GeoQA, and exceed leading open-source models by 5.7% on MathVista and 2.7% on GeoQA.
Conclusion: GeoFM effectively addresses geometric data scarcity by generating high-quality synthetic data that significantly enhances MLLMs’ geometric reasoning capabilities across multiple benchmarks.
Abstract: Multi-modal Large Language Models (MLLMs) have gained significant attention in both academia and industry for their capabilities in handling multi-modal tasks. However, these models face challenges in mathematical geometric reasoning due to the scarcity of high-quality geometric data. To address this issue, synthetic geometric data has become an essential strategy. Current methods for generating synthetic geometric data involve rephrasing or expanding existing problems and utilizing predefined rules and templates to create geometric images and problems. However, these approaches often produce data that lacks diversity or is prone to noise. Additionally, the geometric images synthesized by existing methods tend to exhibit limited variation and deviate significantly from authentic geometric diagrams. To overcome these limitations, we propose GeoFM, a novel method for synthesizing geometric data. GeoFM uses formal languages to explore combinations of conditions within metric space, generating high-fidelity geometric problems that differ from the originals while ensuring correctness through a symbolic engine. Experimental results show that our synthetic data significantly outperforms existing methods. The model trained with our data surpass the proprietary GPT-4o model by 18.7% on geometry problem-solving tasks in MathVista and by 16.5% on GeoQA. Additionally, it exceeds the performance of a leading open-source model by 5.7% on MathVista and by 2.7% on GeoQA.
[220] Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance
Nikolaus Holzer, William Fishell, Baishakhi Ray, Mark Santolucito
Main category: cs.AI
TL;DR: TempoBench is a formally verifiable diagnostic benchmark that evaluates LLM reasoning through temporal trace evaluation (TTE) and temporal causal evaluation (TCE), revealing significant performance drops as reasoning complexity increases.
Details
Motivation: Current LLM reasoning evaluation methods have limitations - ad-hoc datasets may contain bias and lack verifiability, while formal proof systems like Lean don't capture real-world agentic decision chains, creating a gap in reasoning benchmarks.Method: TempoBench uses two evaluation benchmarks: TTE tests LLM’s ability to understand and simulate multi-step reasoning system execution, and TCE tests multi-step causal reasoning and cause-effect relation extraction from complex systems.
Result: Models scored 65.6% on TCE-normal and only 7.5% on TCE-hard, showing that while state-of-the-art LLMs understand the TCE task, their performance dramatically decreases as system complexity increases.
Conclusion: TempoBench provides a formally grounded benchmark that systematically analyzes LLM reasoning capabilities, revealing significant limitations in handling complex multi-step causal reasoning despite understanding the basic tasks.
Abstract: Large Language Models (LLMs) are increasingly excelling and outpacing human performance on many tasks. However, to improve LLM reasoning, researchers either rely on ad-hoc generated datasets or formal mathematical proof systems such as the Lean proof assistant. Whilst ad-hoc generated methods can capture the decision chains of real-world reasoning processes, they may encode some inadvertent bias in the space of reasoning they cover; they also cannot be formally verified. On the other hand, systems like Lean can guarantee verifiability, but are not well-suited to capture the nature of agentic decision chain-based tasks. This creates a gap both in performance for functions such as business agents or code assistants, and in the usefulness of LLM reasoning benchmarks, whereby these fall short in reasoning structure or real-world alignment. We introduce TempoBench, the first formally grounded and verifiable diagnostic benchmark that parametrizes difficulty to systematically analyze how LLMs perform reasoning. TempoBench uses two evaluation benchmarks to break down reasoning ability. First, temporal trace evaluation (TTE) tests the ability of an LLM to understand and simulate the execution of a given multi-step reasoning system. Subsequently, temporal causal evaluation (TCE) tests an LLM’s ability to perform multi-step causal reasoning and to distill cause-and-effect relations from complex systems. We find that models score 65.6% on TCE-normal, and 7.5% on TCE-hard. This shows that state-of-the-art LLMs clearly understand the TCE task but perform poorly as system complexity increases. Our code is available at our \href{https://github.com/nik-hz/tempobench}{GitHub repository}.
[221] SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning
Ali Asgarov, Umid Suleymanov, Aadyant Khatri
Main category: cs.AI
TL;DR: SIGMA framework uses multi-agent system with specialized agents for reasoning, targeted searches, and synthesis to improve mathematical reasoning performance by 7.4% on challenging benchmarks.
Details
Motivation: Current retrieval-augmented models have limitations: single perspective approach, inflexible search strategies, and inability to effectively combine information from multiple sources for mathematical reasoning.Method: SIGMA framework orchestrates specialized agents that independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Agents generate hypothetical passages to optimize retrieval for their analytic perspective.
Result: Outperforms both open- and closed-source systems on MATH500, AIME, and PhD-level science QA GPQA benchmarks, achieving 7.4% absolute performance improvement.
Conclusion: Multi-agent, on-demand knowledge integration significantly enhances reasoning accuracy and efficiency, offering scalable approach for complex, knowledge-intensive problem-solving.
Abstract: Solving mathematical reasoning problems requires not only accurate access to relevant knowledge but also careful, multi-step thinking. However, current retrieval-augmented models often rely on a single perspective, follow inflexible search strategies, and struggle to effectively combine information from multiple sources. We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Each agent generates hypothetical passages to optimize retrieval for its analytic perspective, ensuring knowledge integration is both context-sensitive and computation-efficient. When evaluated on challenging benchmarks such as MATH500, AIME, and PhD-level science QA GPQA, SIGMA consistently outperforms both open- and closed-source systems, achieving an absolute performance improvement of 7.4%. Our results demonstrate that multi-agent, on-demand knowledge integration significantly enhances both reasoning accuracy and efficiency, offering a scalable approach for complex, knowledge-intensive problem-solving. We will release the code upon publication.
[222] InnovatorBench: Evaluating Agents’ Ability to Conduct Innovative LLM Research
Yunze Wu, Dayuan Fu, Weiye Si, Zhen Huang, Mohan Jiang, Keyu Li, Shijie Xia, Jie Sun, Tianze Xu, Xiangkun Hu, Pengrui Lu, Xiaojie Cai, Lyumanshan Ye, Wenhong Zhu, Yang Xiao, Pengfei Liu
Main category: cs.AI
TL;DR: InnovatorBench is a benchmark-platform pair for realistic assessment of AI agents in LLM research, featuring 20 tasks across various research components and requiring runnable artifacts with comprehensive evaluation metrics.
Details
Motivation: Existing benchmarks for AI agents in scientific discovery are narrow and simplified, failing to capture the complexity of end-to-end research processes. There's a need for more realistic assessment of agents' capabilities in automating scientific research.Method: Developed InnovatorBench with 20 research tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction. Created ResearchGym environment with rich action spaces, distributed execution, and monitoring. Implemented a lightweight ReAct agent using frontier models like Claude-4, GPT-5, GLM-4.5, and Kimi-K2.
Result: Frontier models show promise in code-driven research tasks but struggle with fragile algorithm-related tasks and long-horizon decision making (impatience, poor resource management, template-based reasoning). Agents require over 11 hours to achieve best performance, demonstrating the benchmark’s difficulty.
Conclusion: InnovatorBench represents a significant advancement in realistic AI agent evaluation for scientific research and has potential to become the next generation of code-based research benchmarks, highlighting both the capabilities and limitations of current frontier models in complex research scenarios.
Abstract: AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark’s difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.
[223] VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation
Heng Ping, Arijit Bhattacharjee, Peiyu Zhang, Shixuan Li, Wei Yang, Anzhe Cheng, Xiaole Zhang, Jesse Thomason, Ali Jannesari, Nesreen Ahmed, Paul Bogdan
Main category: cs.AI
TL;DR: VeriMoA is a training-free mixture-of-agents framework that improves HDL generation through quality-guided caching and multi-path generation strategies, achieving 15-30% improvements in Pass@1 metrics.
Details
Motivation: Current multi-agent approaches for HDL generation suffer from noise propagation and constrained reasoning space exploration, while existing methods like prompt engineering and fine-tuning have limitations in knowledge coverage and training costs.Method: Proposes VeriMoA with two innovations: 1) quality-guided caching mechanism to maintain and rank all intermediate HDL outputs, and 2) multi-path generation strategy using C++ and Python as intermediate representations for two-stage specification-to-HDL translation.
Result: Achieves 15-30% improvements in Pass@1 across VerilogEval 2.0 and RTLLM 2.0 benchmarks, enabling smaller models to match larger models and fine-tuned alternatives without costly training.
Conclusion: VeriMoA provides an effective training-free paradigm for HDL generation that enhances reasoning through collaborative generation while overcoming limitations of current multi-agent approaches.
Abstract: Automation of Register Transfer Level (RTL) design can help developers meet increasing computational demands. Large Language Models (LLMs) show promise for Hardware Description Language (HDL) generation, but face challenges due to limited parametric knowledge and domain-specific constraints. While prompt engineering and fine-tuning have limitations in knowledge coverage and training costs, multi-agent architectures offer a training-free paradigm to enhance reasoning through collaborative generation. However, current multi-agent approaches suffer from two critical deficiencies: susceptibility to noise propagation and constrained reasoning space exploration. We propose VeriMoA, a training-free mixture-of-agents (MoA) framework with two synergistic innovations. First, a quality-guided caching mechanism to maintain all intermediate HDL outputs and enables quality-based ranking and selection across the entire generation process, encouraging knowledge accumulation over layers of reasoning. Second, a multi-path generation strategy that leverages C++ and Python as intermediate representations, decomposing specification-to-HDL translation into two-stage processes that exploit LLM fluency in high-resource languages while promoting solution diversity. Comprehensive experiments on VerilogEval 2.0 and RTLLM 2.0 benchmarks demonstrate that VeriMoA achieves 15–30% improvements in Pass@1 across diverse LLM backbones, especially enabling smaller models to match larger models and fine-tuned alternatives without requiring costly training.
[224] Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen, Liang-Yan Gui, Yu-Xiong Wang, Huan Zhang, Heng Ji, Daniel Kang
Main category: cs.AI
TL;DR: BEAT is a framework that injects visual backdoor attacks into MLLM-based embodied agents using environmental objects as triggers, achieving up to 80% attack success while maintaining normal task performance.
Details
Motivation: MLLM-based embodied agents create new security vulnerabilities through visual backdoor attacks, where agents behave normally until a visual trigger appears, then persistently execute attacker-specified malicious policies.Method: BEAT uses a two-stage training: supervised fine-tuning (SFT) followed by Contrastive Trigger Learning (CTL), which formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs to sharpen decision boundaries.
Result: BEAT achieves up to 80% attack success rate across various benchmarks and MLLMs, maintains strong benign task performance, and generalizes to out-of-distribution trigger placements. CTL boosts backdoor activation accuracy by up to 39% compared to naive SFT.
Conclusion: The research exposes critical security risks in MLLM-based embodied agents, highlighting the urgent need for robust defenses before real-world deployment of these systems.
Abstract: Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into MLLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and MLLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in MLLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.
[225] Validity Is What You Need
Sebastian Benthall, Andrew Clark
Main category: cs.AI
TL;DR: Agentic AI is defined as a software delivery mechanism that autonomously operates applications in complex enterprise settings, with success depending on user validation rather than just foundation models like LLMs.
Details
Motivation: To provide a realist definition of Agentic AI and emphasize that its success depends on validation by end users and stakeholders, not just on advanced foundation models.Method: Proposes a new definition of Agentic AI as a software delivery mechanism comparable to SaaS, and discusses the importance of validation tools and techniques for end users.
Result: The paper argues that with proper validation measures, simpler and more interpretable models can often replace foundation models like LLMs in Agentic AI systems.
Conclusion: Validity is the key requirement for Agentic AI, and LLMs are just one possible option to achieve it, with simpler models often being preferable when good validation is in place.
Abstract: While AI agents have long been discussed and studied in computer science, today’s Agentic AI systems are something new. We consider other definitions of Agentic AI and propose a new realist definition. Agentic AI is a software delivery mechanism, comparable to software as a service (SaaS), which puts an application to work autonomously in a complex enterprise setting. Recent advances in large language models (LLMs) as foundation models have driven excitement in Agentic AI. We note, however, that Agentic AI systems are primarily applications, not foundations, and so their success depends on validation by end users and principal stakeholders. The tools and techniques needed by the principal users to validate their applications are quite different from the tools and techniques used to evaluate foundation models. Ironically, with good validation measures in place, in many cases the foundation models can be replaced with much simpler, faster, and more interpretable models that handle core logic. When it comes to Agentic AI, validity is what you need. LLMs are one option that might achieve it.
[226] Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training
Dayuan Fu, Yunze Wu, Xiaojie Cai, Lyumanshan Ye, Shijie Xia, Zhen Huang, Weiye Si, Tianze Xu, Jie Sun, Keyu Li, Mohan Jiang, Junfei Wang, Qishuo Hua, Pengrui Lu, Yang Xiao, Pengfei Liu
Main category: cs.AI
TL;DR: Apollo is a sampling framework that integrates asynchronous human guidance with action-level data filtering to train LLM agents on long-horizon, domain-specialized tasks more efficiently than existing methods.
Details
Motivation: Training LLM agents for long-horizon, domain-specialized tasks is challenging due to the high cost of dense human annotations in behavior cloning and the rarity of positive trajectories in outcome-driven sampling.Method: Apollo allows human annotators to intervene only when agents drift from promising trajectories, providing prior knowledge and strategic advice. It then applies supervision control to filter sub-optimal actions and prevent error propagation.
Result: When applied to train GLM-4.5 on InnovatorBench, Apollo achieved over 50% improvement over the untrained baseline and 28% improvement over a variant trained without human interaction.
Conclusion: Apollo demonstrates the critical role of human-in-the-loop sampling and provides a robust framework for handling long-horizon, domain-specialized tasks effectively.
Abstract: Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo’s design in handling long-horizon, domain-specialized tasks.
[227] MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design
Wei Zhang, Zekun Guo, Yingce Xia, Peiran Jin, Shufang Xie, Tao Qin, Xiang-Yang Li
Main category: cs.AI
TL;DR: MolChord is a novel SBDD approach that aligns protein and molecule structures with textual descriptions using NatureLM and diffusion models, while guiding molecule generation toward desired properties through DPO optimization.
Details
Motivation: To address the challenge of effectively aligning protein structural representations with molecular representations and ensuring alignment between generated drugs and their pharmacological properties in structure-based drug design.Method: Integrates two techniques: (1) uses NatureLM (autoregressive model unifying text, molecules, and proteins) with diffusion-based structure encoder to align structures with textual descriptions; (2) curates property-aware dataset and refines alignment using Direct Preference Optimization (DPO).
Result: Achieves state-of-the-art performance on CrossDocked2020 benchmark, demonstrating superior performance on key evaluation metrics.
Conclusion: MolChord shows strong potential as a practical tool for structure-based drug design by effectively addressing representation alignment and property guidance challenges.
Abstract: Structure-based drug design (SBDD), which maps target proteins to candidate molecular ligands, is a fundamental task in drug discovery. Effectively aligning protein structural representations with molecular representations, and ensuring alignment between generated drugs and their pharmacological properties, remains a critical challenge. To address these challenges, we propose MolChord, which integrates two key techniques: (1) to align protein and molecule structures with their textual descriptions and sequential representations (e.g., FASTA for proteins and SMILES for molecules), we leverage NatureLM, an autoregressive model unifying text, small molecules, and proteins, as the molecule generator, alongside a diffusion-based structure encoder; and (2) to guide molecules toward desired properties, we curate a property-aware dataset by integrating preference data and refine the alignment process using Direct Preference Optimization (DPO). Experimental results on CrossDocked2020 demonstrate that our approach achieves state-of-the-art performance on key evaluation metrics, highlighting its potential as a practical tool for SBDD.
[228] A Comprehensive Survey on Process-Oriented Automatic Text Summarization with Exploration of LLM-Based Methods
Yang Zhang, Hanlei Jin, Dan Meng, Jun Wang, Jinghua Tan
Main category: cs.AI
TL;DR: This survey provides a comprehensive overview of Automatic Text Summarization (ATS) from a practical “Process-Oriented Schema” perspective, specifically focusing on LLM-based methods to bridge a two-year gap in literature.
Details
Motivation: Previous ATS surveys lack practicality for real-world implementations and don't adequately address the impact of Large Language Models (LLMs) on conventional ATS methods.Method: The survey adopts a “Process-Oriented Schema” perspective aligned with real-world implementations and comprehensively reviews the latest LLM-based ATS works.
Result: This is the first survey to specifically investigate LLM-based ATS methods, providing an up-to-date overview that bridges the two-year gap in ATS literature.
Conclusion: The survey successfully delivers a comprehensive and practical overview of ATS that addresses the limitations of previous theoretical approaches and incorporates the transformative impact of LLMs on the field.
Abstract: Automatic Text Summarization (ATS), utilizing Natural Language Processing (NLP) algorithms, aims to create concise and accurate summaries, thereby significantly reducing the human effort required in processing large volumes of text. ATS has drawn considerable interest in both academic and industrial circles. Many studies have been conducted in the past to survey ATS methods; however, they generally lack practicality for real-world implementations, as they often categorize previous methods from a theoretical standpoint. Moreover, the advent of Large Language Models (LLMs) has altered conventional ATS methods. In this survey, we aim to 1) provide a comprehensive overview of ATS from a ``Process-Oriented Schema’’ perspective, which is best aligned with real-world implementations; 2) comprehensively review the latest LLM-based ATS works; and 3) deliver an up-to-date survey of ATS, bridging the two-year gap in the literature. To the best of our knowledge, this is the first survey to specifically investigate LLM-based ATS methods.
[229] VRoPE: Rotary Position Embedding for Video Large Language Models
Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, Jing Liu
Main category: cs.AI
TL;DR: VRoPE is a novel positional encoding method for Video-LLMs that addresses limitations of existing RoPE adaptations by providing balanced spatial encoding and smooth video-text transitions, achieving superior performance in video understanding tasks.
Details
Motivation: Existing RoPE adaptations for video suffer from positional bias in attention distribution and disruptions in video-text transitions, limiting their effectiveness in handling the complex spatiotemporal structure of video frames.Method: VRoPE introduces a balanced encoding strategy to mitigate attention biases and restructures positional indices to ensure smooth transitions between video and text tokens, specifically designed for Video-LLMs.
Result: Extensive experiments show VRoPE consistently outperforms previous RoPE variants, achieving significant improvements in video understanding, temporal reasoning, and retrieval tasks across different models.
Conclusion: VRoPE effectively addresses the limitations of existing positional encoding methods for video, providing a more robust solution for Video-LLMs that handles spatiotemporal complexity and video-text transitions better than previous approaches.
Abstract: Rotary Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs), but extending it to video remains a challenge due to the intricate spatiotemporal structure of video frames. Existing adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations: positional bias in attention distribution and disruptions in video-text transitions. To overcome these issues, we propose Video Rotary Position Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs. Specifically, we introduce a more balanced encoding strategy that mitigates attention biases, ensuring a more uniform distribution of spatial focus. Additionally, our approach restructures positional indices to ensure a smooth transition between video and text tokens. Extensive experiments on different models demonstrate that VRoPE consistently outperforms previous RoPE variants, achieving significant improvements in video understanding, temporal reasoning, and retrieval tasks. Code is available at https://github.com/johncaged/VRoPE.
[230] Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models
Zhaoxin Li, Zhang Xi-Jia, Batuhan Altundas, Letian Chen, Rohan Paleja, Matthew Gombolay
Main category: cs.AI
TL;DR: iTRACE is an automated framework that uses vision-language models for semantic feature extraction and trains interpretable tree-based reinforcement learning policies, eliminating the need for manual human annotation while maintaining performance comparable to black-box models.
Details
Motivation: Traditional RL lacks semantic interpretability due to reliance on manually specified features and black-box policies, which hinders transparency and verifiability of decision-making.Method: Uses pre-trained vision-language models for automated concept extraction, distills VLM outputs into lightweight models, and trains interpretable tree-based policies via reinforcement learning.
Result: iTRACE outperforms other interpretable policy baselines and matches black-box policy performance across Atari games, grid-world navigation, and driving domains.
Conclusion: The framework successfully automates interpretable RL by leveraging VLMs, addressing limitations of both manual annotation and standalone VLMs while maintaining competitive performance.
Abstract: Semantic interpretability in Reinforcement Learning (RL) enables transparency and verifiability of decision-making. Achieving semantic interpretability in reinforcement learning requires (1) a feature space composed of human-understandable concepts and (2) a policy that is interpretable and verifiable. However, constructing such a feature space has traditionally relied on manual human specification, which often fails to generalize to unseen environments. Moreover, even when interpretable features are available, most reinforcement learning algorithms employ black-box models as policies, thereby hindering transparency. We introduce interpretable Tree-based Reinforcement learning via Automated Concept Extraction (iTRACE), an automated framework that leverages pre-trained vision-language models (VLM) for semantic feature extraction and train a interpretable tree-based model via RL. To address the impracticality of running VLMs in RL loops, we distill their outputs into a lightweight model. By leveraging Vision-Language Models (VLMs) to automate tree-based reinforcement learning, iTRACE loosens the reliance the need for human annotation that is traditionally required by interpretable models. In addition, it addresses key limitations of VLMs alone, such as their lack of grounding in action spaces and their inability to directly optimize policies. We evaluate iTRACE across three domains: Atari games, grid-world navigation, and driving. The results show that iTRACE outperforms other interpretable policy baselines and matches the performance of black-box policies on the same interpretable feature space.
[231] Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning
Minwu Kim, Anubhav Shrestha, Safal Shrestha, Aadim Nepal, Keith Ross
Main category: cs.AI
TL;DR: RLVR improves accuracy but not capability in LLM reasoning tasks, while distillation can improve both. RLVR focuses on easier questions at the expense of harder ones, and capability improvement requires introducing new knowledge rather than just distilling reasoning patterns.
Details
Motivation: To understand why reinforcement learning with verifiable rewards (RLVR) enhances accuracy but fails to improve capability in LLM reasoning tasks, while distillation can improve both, and to investigate the underlying mechanisms.Method: Analyzed RLVR’s impact on question difficulty distribution, examined response quality changes in small model settings, and conducted distillation experiments comparing knowledge introduction vs. reasoning pattern distillation.
Result: RLVR improves easier questions but harms hardest questions; produces new quality responses not in original distribution; capability improves only when distillation introduces new knowledge, not just reasoning patterns.
Conclusion: RLVR and reasoning pattern distillation sacrifice performance on difficult questions while improving accuracy on easier ones. True capability improvement requires introducing new knowledge, providing clearer understanding of how these methods shape LLM reasoning behavior.
Abstract: Recent studies have shown that reinforcement learning with verifiable rewards (RLVR) enhances overall accuracy (pass@1) but often fails to improve capability (pass@k) of LLMs in reasoning tasks, while distillation can improve both. In this paper, we investigate the mechanisms behind these phenomena. First, we demonstrate that RLVR struggles to improve capability as it focuses on improving the accuracy of the easier questions to the detriment of the accuracy of the most difficult questions. Second, we show that RLVR does not merely increase the success probability for the easier questions, but in our small model settings, produces quality responses that were absent in its original output distribution. In addition, we show these responses are neither noticeably longer nor feature more reflection-related keywords, underscoring the need for more reliable indicators of response quality. Third, from the experiment distilling teacher responses to in-distribution problems, we find that capability does not always improve with distillation. We conjecture that capability improves only when new knowledge is introduced, whereas distilling reasoning patterns only improves accuracy but not capability, sacrificing performance on the most difficult questions, similar to RLVR. Together, these findings offer a clearer understanding of how RLVR and distillation shape reasoning behavior in LLMs
[232] Building Trustworthy AI by Addressing its 16+2 Desiderata with Goal-Directed Commonsense Reasoning
Alexis R. Tudor, Yankai Zeng, Huaduo Wang, Joaquin Arias, Gopal Gupta
Main category: cs.AI
TL;DR: The paper proposes s(CASP), a goal-directed constraint-based answer set programming reasoner, as a middle ground between sub-symbolic ML (like LLMs) and complex rule-based systems (like Cyc) to achieve trustworthy AI with human-style commonsense reasoning.
Details
Motivation: Current AI systems face trustworthiness issues - LLMs hallucinate and lack explainability, while rule-based systems are complex with many reasoners. There's a need for reliable, explainable commonsense reasoning for trustworthy AI.Method: Uses s(CASP), a goal-directed constraint-based answer set programming reasoner that employs a small number of mechanisms to emulate human-style commonsense reasoning. Supports 16 desiderata for trustworthy AI plus two additional ones: inconsistency detection and assumption of alternative worlds.
Result: Demonstrates feasibility through diverse applications including a conversational chatbot and a virtually embodied reasoner, showing synergies of the approach.
Conclusion: s(CASP) provides a viable middle ground approach for trustworthy AI that combines the benefits of both sub-symbolic and rule-based systems while addressing their limitations.
Abstract: Current advances in AI and its applicability have highlighted the need to ensure its trustworthiness for legal, ethical, and even commercial reasons. Sub-symbolic machine learning algorithms, such as the LLMs, simulate reasoning but hallucinate and their decisions cannot be explained or audited (crucial aspects for trustworthiness). On the other hand, rule-based reasoners, such as Cyc, are able to provide the chain of reasoning steps but are complex and use a large number of reasoners. We propose a middle ground using s(CASP), a goal-directed constraint-based answer set programming reasoner that employs a small number of mechanisms to emulate reliable and explainable human-style commonsense reasoning. In this paper, we explain how s(CASP) supports the 16 desiderata for trustworthy AI introduced by Doug Lenat and Gary Marcus (2023), and two additional ones: inconsistency detection and the assumption of alternative worlds. To illustrate the feasibility and synergies of s(CASP), we present a range of diverse applications, including a conversational chatbot and a virtually embodied reasoner.
[233] Don’t throw the baby out with the bathwater: How and why deep learning for ARC
Jack Cole, Mohamed Osman
Main category: cs.AI
TL;DR: The paper proposes using deep learning with test-time fine-tuning and augmentation techniques to achieve state-of-the-art performance on the Abstraction and Reasoning Corpus (ARC-AGI), demonstrating significant accuracy improvements.
Details
Motivation: Despite low performance on ARC-AGI, deep learning remains the most effective paradigm for training neural networks across diverse domains. The authors aim to fully leverage deep learning's capacity to acquire novel abstractions for ARC reasoning.Method: Incorporates on-the-fly neural network training at test time, treating both the network and optimizer as integral to inference. Uses Test-Time Fine-Tuning (TTFT) and Augment Inference Reverse-Augmentation and Vote (AIRV) techniques, starting from pretrained LLMs.
Result: Achieved up to 260% accuracy boost with AIRV and additional 300% boost with TTFT. Won first place in 2023 ARCathon competition and achieved current best score (58%) on ARC private test-set.
Conclusion: Deep learning can be effectively used for ARC reasoning, highlighting key ingredients for robust reasoning systems in unfamiliar domains and mechanisms that improve broad perceptual reasoning.
Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) presents a formidable challenge for AI systems. Despite the typically low performance on ARC, the deep learning paradigm remains the most effective known strategy for generating skillful (state-of-the-art) neural networks (NN) across varied modalities and tasks in vision, language etc. The deep learning paradigm has proven to be able to train these skillful neural networks and learn the abstractions needed in these diverse domains. Our work doubles down on that and continues to leverage this paradigm by incorporating on-the-fly NN training at test time. We demonstrate that fully committing to deep learning’s capacity to acquire novel abstractions yields state-of-the-art performance on ARC. Specifically, we treat both the neural network and the optimizer (rather than just a pre-trained network) as integral components of the inference process, fostering generalization to unseen tasks. Concretely, we propose a methodology for training on ARC, starting from pretrained LLMs, and enhancing their ARC reasoning. We also propose Test-Time Fine-Tuning (TTFT) and the Augment Inference Reverse-Augmentation and Vote (AIRV) as effective test-time techniques. We are the first to propose and show deep learning can be used effectively for ARC, showing boosts of up to 260% in accuracy with AIRV and a further 300% boost with TTFT. An early version of this approach secured first place in the 2023 ARCathon competition, while the final version achieved the current best score on the ARC private test-set (58%). Our findings highlight the key ingredients of a robust reasoning system in unfamiliar domains, underscoring the central mechanisms that improve broad perceptual reasoning.
[234] NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration
Yan Jiang, Hao Zhou, LiZhong GU, Ai Han, TianLong Li
Main category: cs.AI
TL;DR: NaviAgent is a bilevel architecture that decouples task planning from tool execution using graph-based modeling to address limitations of step-by-step tool calling in LLM agents, enabling scalable navigation of large tool ecosystems.
Details
Motivation: Existing LLM agents call tools step by step without global task structure view, leading to error accumulation and limited scalability when dealing with thousands of interdependent tools.Method: Proposes NaviAgent with two levels: task-planning level where LLM decides response strategy, and execution level with Tool World Navigation Model (TWNM) that encodes tool relations to guide robust invocation sequences. Supports closed-loop optimization through real tool interaction feedback.
Result: Achieves best task success rates across models and tasks. Integrating TWNM boosts performance by up to 17 points on complex tasks, demonstrating effective toolchain orchestration.
Conclusion: NaviAgent moves beyond simple tool calling toward adaptive navigation of large-scale tool ecosystems through bilevel architecture and graph-based modeling, enabling scalable and robust tool interactions.
Abstract: Large language models (LLMs) have recently demonstrated the ability to act as function call agents by invoking external tools, enabling them to solve tasks beyond their static knowledge. However, existing agents typically call tools step by step at a time without a global view of task structure. As tools depend on each other, this leads to error accumulation and limited scalability, particularly when scaling to thousands of tools. To address these limitations, we propose NaviAgent, a novel bilevel architecture that decouples task planning from tool execution through graph-based modeling of the tool ecosystem. At the task-planning level, the LLM-based agent decides whether to respond directly, clarify user intent, invoke a toolchain, or execute tool outputs, ensuring broad coverage of interaction scenarios independent of inter-tool complexity. At the execution level, a continuously evolving Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools, guiding the agent to generate scalable and robust invocation sequences. By incorporating feedback from real tool interactions, NaviAgent supports closed-loop optimization of planning and execution, moving beyond tool calling toward adaptive navigation of large-scale tool ecosystems. Experiments show that NaviAgent achieves the best task success rates across models and tasks, and integrating TWMN further boosts performance by up to 17 points on complex tasks, underscoring its key role in toolchain orchestration.
[235] HiRA: A Hierarchical Reasoning Framework for Decoupled Planning and Execution in Deep Search
Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu, Yang Zhao, Hongjin Qian, Zhicheng Dou
Main category: cs.AI
TL;DR: HiRA introduces a hierarchical framework that separates strategic planning from specialized execution for complex search tasks, outperforming state-of-the-art RAG and agent-based systems.
Details
Motivation: Traditional RAG pipelines struggle with complex information needs requiring deep reasoning across diverse sources, and current reasoning approaches inefficiently use a single model for both planning and execution.Method: Decomposes complex search tasks into focused subtasks, assigns each to domain-specific agents with external tools and reasoning capabilities, and coordinates results through structured integration.
Result: Significantly outperforms state-of-the-art systems on four complex cross-modal deep search benchmarks, showing improvements in both answer quality and system efficiency.
Conclusion: The hierarchical decoupling of planning and execution is effective for multi-step information seeking tasks, enabling specialized expertise while maintaining strategic coherence.
Abstract: Complex information needs in real-world search scenarios demand deep reasoning and knowledge synthesis across diverse sources, which traditional retrieval-augmented generation (RAG) pipelines struggle to address effectively. Current reasoning-based approaches suffer from a fundamental limitation: they use a single model to handle both high-level planning and detailed execution, leading to inefficient reasoning and limited scalability. In this paper, we introduce HiRA, a hierarchical framework that separates strategic planning from specialized execution. Our approach decomposes complex search tasks into focused subtasks, assigns each subtask to domain-specific agents equipped with external tools and reasoning capabilities, and coordinates the results through a structured integration mechanism. This separation prevents execution details from disrupting high-level reasoning while enabling the system to leverage specialized expertise for different types of information processing. Experiments on four complex, cross-modal deep search benchmarks demonstrate that HiRA significantly outperforms state-of-the-art RAG and agent-based systems. Our results show improvements in both answer quality and system efficiency, highlighting the effectiveness of decoupled planning and execution for multi-step information seeking tasks. Our code is available at https://github.com/ignorejjj/HiRA.
[236] Red Teaming AI Red Teaming
Subhabrata Majumdar, Brian Pendleton, Abhishek Gupta
Main category: cs.AI
TL;DR: AI red teaming has shifted from critical thinking to narrow model-level flaw detection, overlooking broader sociotechnical systems. A new framework proposes macro-level system red teaming and micro-level model red teaming with six recommendations for examining emergent risks and systemic vulnerabilities.
Details
Motivation: Current AI red teaming focuses too narrowly on individual model vulnerabilities while ignoring broader sociotechnical systems and emergent behaviors from complex interactions between models, users, and environments.Method: Proposes a comprehensive framework with two levels: macro-level system red teaming spanning the entire AI development lifecycle, and micro-level model red teaming. Draws on cybersecurity experience and systems theory with six recommendations.
Result: A framework that addresses the deficiency in current AI red teaming by examining emergent risks, systemic vulnerabilities, and the interplay between technical and social factors through multifunctional teams.
Conclusion: Effective AI red teaming requires moving beyond narrow model-level focus to examine broader sociotechnical systems, emergent behaviors, and systemic vulnerabilities through comprehensive frameworks and multifunctional teams.
Abstract: Red teaming has evolved from its origins in military applications to become a widely adopted methodology in cybersecurity and AI. In this paper, we take a critical look at the practice of AI red teaming. We argue that despite its current popularity in AI governance, there exists a significant gap between red teaming’s original intent as a critical thinking exercise and its narrow focus on discovering model-level flaws in the context of generative AI. Current AI red teaming efforts focus predominantly on individual model vulnerabilities while overlooking the broader sociotechnical systems and emergent behaviors that arise from complex interactions between models, users, and environments. To address this deficiency, we propose a comprehensive framework operationalizing red teaming in AI systems at two levels: macro-level system red teaming spanning the entire AI development lifecycle, and micro-level model red teaming. Drawing on cybersecurity experience and systems theory, we further propose a set of six recommendations. In these, we emphasize that effective AI red teaming requires multifunctional teams that examine emergent risks, systemic vulnerabilities, and the interplay between technical and social factors.
[237] Why Isn’t Relational Learning Taking Over the World?
David Poole
Main category: cs.AI
TL;DR: Relational learning should be more prominent in AI since real-world data is often relational (spreadsheets, databases) rather than just text/images, but it hasn’t achieved widespread adoption due to limitations with complex relations.
Details
Motivation: Current AI focuses too much on modeling pixels, words, and phonemes, while the real world consists of entities with properties and relations. Most valuable company data exists in relational formats like spreadsheets and databases.Method: The paper analyzes why relational learning hasn’t achieved widespread adoption despite its potential, examining cases where it works (with restricted relations) and identifying what needs to be addressed.
Result: Relational learning is not taking over the world except in limited cases with restricted relations, indicating current limitations in handling complex relational data.
Conclusion: Specific improvements are needed to make relational learning achieve its rightful prominence in AI, moving beyond current limitations with complex relations.
Abstract: Artificial intelligence seems to be taking over the world with systems that model pixels, words, and phonemes. The world is arguably made up, not of pixels, words, and phonemes but of entities (objects, things, including events) with properties and relations among them. Surely we should model these, not the perception or description of them. You might suspect that concentrating on modeling words and pixels is because all of the (valuable) data in the world is in terms of text and images. If you look into almost any company you will find their most valuable data is in spreadsheets, databases and other relational formats. These are not the form that are studied in introductory machine learning, but are full of product numbers, student numbers, transaction numbers and other identifiers that can’t be interpreted naively as numbers. The field that studies this sort of data has various names including relational learning, statistical relational AI, and many others. This paper explains why relational learning is not taking over the world – except in a few cases with restricted relations – and what needs to be done to bring it to it’s rightful prominence.
[238] Emergent Cognitive Convergence via Implementation: A Structured Loop Reflecting Four Theories of Mind
Myung Ho Kim
Main category: cs.AI
TL;DR: The paper reports an unintentional structural convergence among four major cognitive theories (Kahneman, Friston, Minsky, Clark) within the Agentic Flow AI architecture, which achieves 95.8% task success compared to 62.3% for baseline LLMs.
Details
Motivation: To address limitations of large language models (LLMs) by developing a practical AI architecture that naturally incorporates principles from multiple cognitive theories without deliberate synthesis.Method: Developed Agentic Flow architecture with five interlocking modules (Retrieval, Cognition, Control, Memory, Action) organized into a repeatable cognitive loop, later formalized as the Structured Cognitive Loop (SCL).
Result: The structured agent achieved 95.8% task success versus 62.3% for baseline LLMs, demonstrating robust constraint adherence and reproducible reasoning, while revealing computational motifs from all four cognitive theories.
Conclusion: Intelligent architectures may naturally evolve toward shared structural patterns shaped by practical constraints, with Agentic Flow/SCL showing how unified cognitive forms can arise from real-world reasoning necessities rather than theoretical abstraction.
Abstract: We report a structural convergence among four influential theories of mind: Kahneman’s dual-system theory, Friston’s predictive processing, Minsky’s society of mind, and Clark’s extended mind, emerging unintentionally within a practical AI architecture known as Agentic Flow. Designed to address the limitations of large language models (LLMs), Agentic Flow comprises five interlocking modules: Retrieval, Cognition, Control, Memory, and Action, organized into a repeatable cognitive loop. Although originally inspired only by Minsky and Clark, subsequent analysis revealed that its structure echoes computational motifs from all four theories, suggesting that theoretical convergence can emerge naturally from implementation demands rather than deliberate synthesis. Controlled evaluations confirmed this: the structured agent achieved 95.8% task success versus 62.3% for baseline LLMs, demonstrating robust constraint adherence and reproducible reasoning. We describe this convergence under a broader descriptive meta-architecture called PEACE, highlighting recurring design patterns such as predictive modeling, associative recall, and error-sensitive control. Later formalized as the Structured Cognitive Loop (SCL), this framework generalizes the same principles as a foundation for behavioral intelligence in LLM-based agents. Rather than claiming theoretical unification, this paper proposes that intelligent architectures may evolve toward shared structural patterns shaped by practical constraints. As a position paper, it aims to frame this convergence as an interpretive reflection rather than a finalized theory, inviting further theoretical and experimental dialogue. Agentic Flow, or equivalently the Structured Cognitive Loop, thus offers a glimpse of how a unified cognitive form can arise not from abstraction, but from the necessities of real-world reasoning.
[239] Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents
Thassilo M. Schiepanski, Nicholas Piël
Main category: cs.AI
TL;DR: D2Snap is a DOM downsampling algorithm that enables web agents to use DOM snapshots instead of GUI screenshots, achieving comparable or better performance while maintaining manageable token sizes.
Details
Motivation: Current web agents rely on GUI snapshots (screenshots) due to LLMs' better vision capabilities, but DOM snapshots are structurally more suitable for LLMs. However, DOM snapshots require too many tokens for reliable implementation.Method: Proposed D2Snap algorithm for downsampling DOM snapshots to reduce token size while preserving essential UI hierarchy and structure. Evaluated using GPT-4o on Online-Mind2Web dataset tasks.
Result: D2Snap-downsampled DOM snapshots achieved 67% success rate, matching the GUI snapshot baseline (65%) while using similar token order (1e3). Best configurations outperformed baseline by 8% with slightly higher token count.
Conclusion: DOM snapshots with proper downsampling can match or exceed GUI snapshot performance for web agents, and DOM hierarchy provides strong UI features for LLMs.
Abstract: Frontier LLMs only recently enabled serviceable, autonomous web agents. At that, a model poses as an instantaneous domain model backend. Ought to suggest interaction, it is consulted with a web-based task and respective application state. The key problem lies in application state serialisation - referred to as snapshot. State-of-the-art web agents are premised on grounded GUI snapshots, i.e., screenshots enhanced with visual cues. Not least to resemble human perception, but for images representing relatively cheap means of model input. LLM vision still lag behind code interpretation capabilities. DOM snapshots, which structurally resemble HTML, impose a desired alternative. Vast model input token size, however, disables reliable implementation with web agents to date. We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based on a GPT-4o backend, we evaluate D2Snap on tasks sampled from the Online-Mind2Web dataset. The success rate of D2Snap-downsampled DOM snapshots (67%) matches a grounded GUI snapshot baseline (65%) - within the same input token order of magnitude (1e3). Our best evaluated configurations - one token order above, but within the model’s context window - outperform this baseline by 8%. Our evaluation, moreover, yields that DOM-inherent hierarchy embodies a strong UI feature for LLMs.
[240] Artificially intelligent agents in the social and behavioral sciences: A history and outlook
Petter Holme, Milena Tsvetkova
Main category: cs.AI
TL;DR: Historical review of AI agents in social and behavioral sciences from 1950s to present, covering social simulations, intelligent agents, big data, and generative AI applications.
Details
Motivation: To document the evolution and current trends of AI agents in social sciences, highlighting how technological advancements have transformed scientific understanding of human behavior.Method: Comprehensive literature review and historical analysis tracing developments from early social simulations to modern large language models.
Result: Identifies key milestones including early social simulations, rise of social systems science, intelligent game theory agents, big data era, and current generative AI applications.
Conclusion: AI technologies are deeply intertwined with how we understand human behavior, with each technological advancement bringing new perspectives and challenges to social science research.
Abstract: We review the historical development and current trends of artificially intelligent agents (agentic AI) in the social and behavioral sciences: from the first programmable computers, and social simulations soon thereafter, to today’s experiments with large language models. This overview emphasizes the role of AI in the scientific process and the changes brought about, both through technological advancements and the broader evolution of science from around 1950 to the present. Some of the specific points we cover include: the challenges of presenting the first social simulation studies to a world unaware of computers, the rise of social systems science, intelligent game theoretic agents, the age of big data and the epistemic upheaval in its wake, and the current enthusiasm around applications of generative AI, and many other topics. A pervasive theme is how deeply entwined we are with the technologies we use to understand ourselves.
[241] BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction
Tian Xia, Tianrun Gao, Wenhao Deng, Long Wei, Xiaowei Qian, Yixian Jiang, Chenglei Yu, Tailin Wu
Main category: cs.AI
TL;DR: BuildArena is the first physics-aligned interactive benchmark for evaluating LLMs’ capabilities in engineering construction automation, featuring customizable framework, extendable tasks, 3D computation library, and baseline agent workflow.
Details
Motivation: To address the gap in evaluating LLMs' construction competencies despite their broad knowledge and reasoning capabilities, as engineering construction automation requires complex reasoning under strict physical constraints.Method: Developed BuildArena benchmark with four components: customizable framework, extendable task design across static/dynamic mechanics, 3D Spatial Geometric Computation Library, and baseline LLM agentic workflow for comprehensive evaluation.
Result: Comprehensively evaluated eight frontier LLMs on their capabilities for language-driven and physics-grounded construction automation using the BuildArena benchmark.
Conclusion: BuildArena provides the first comprehensive benchmark for assessing LLMs in engineering construction automation, enabling systematic evaluation of language-driven construction capabilities under physical constraints.
Abstract: Engineering construction automation aims to transform natural language specifications into physically viable structures, requiring complex integrated reasoning under strict physical constraints. While modern LLMs possess broad knowledge and strong reasoning capabilities that make them promising candidates for this domain, their construction competencies remain largely unevaluated. To address this gap, we introduce BuildArena, the first physics-aligned interactive benchmark designed for language-driven engineering construction. It contributes to the community in four aspects: (1) a highly customizable benchmarking framework for in-depth comparison and analysis of LLMs; (2) an extendable task design strategy spanning static and dynamic mechanics across multiple difficulty tiers; (3) a 3D Spatial Geometric Computation Library for supporting construction based on language instructions; (4) a baseline LLM agentic workflow that effectively evaluates diverse model capabilities. On eight frontier LLMs, BuildArena comprehensively evaluates their capabilities for language-driven and physics-grounded construction automation. The project page is at https://build-arena.github.io/.
[242] RELATE: A Schema-Agnostic Perceiver Encoder for Multimodal Relational Graphs
Joe Meyer, Divyansha Lachi, Mahmoud Mohammadi, Roshan Reddy Upendra, Eva L. Dyer, Mark Li, Tom Palczewski
Main category: cs.AI
TL;DR: RELATE is a schema-agnostic feature encoder for heterogeneous temporal graphs that uses shared modality-specific encoders and cross-attention to create fixed-size node representations, achieving near-schema-specific performance with 5x parameter reduction.
Details
Motivation: Existing GNNs require schema-specific feature encoders with separate modules for each node type and feature column, which limits scalability and parameter sharing across different datasets and schemas.Method: RELATE employs shared modality-specific encoders for categorical, numerical, textual, and temporal attributes, followed by a Perceiver-style cross-attention module that aggregates features into fixed-size, permutation-invariant node representations.
Result: On RelBench benchmark with ReLGNN and HGT, RELATE achieves performance within 3% of schema-specific encoders while reducing parameter counts by up to 5x.
Conclusion: RELATE enables varying schema support and multi-dataset pretraining for general-purpose GNNs, paving the way toward foundation models for relational graph data.
Abstract: Relational multi-table data is common in domains such as e-commerce, healthcare, and scientific research, and can be naturally represented as heterogeneous temporal graphs with multi-modal node attributes. Existing graph neural networks (GNNs) rely on schema-specific feature encoders, requiring separate modules for each node type and feature column, which hinders scalability and parameter sharing. We introduce RELATE (Relational Encoder for Latent Aggregation of Typed Entities), a schema-agnostic, plug-and-play feature encoder that can be used with any general purpose GNN. RELATE employs shared modality-specific encoders for categorical, numerical, textual, and temporal attributes, followed by a Perceiver-style cross-attention module that aggregates features into a fixed-size, permutation-invariant node representation. We evaluate RELATE on ReLGNN and HGT in the RelBench benchmark, where it achieves performance within 3% of schema-specific encoders while reducing parameter counts by up to 5x. This design supports varying schemas and enables multi-dataset pretraining for general-purpose GNNs, paving the way toward foundation models for relational graph data.
[243] Towards the Formalization of a Trustworthy AI for Mining Interpretable Models explOiting Sophisticated Algorithms
Riccardo Guidotti, Martina Cinquini, Marta Marchiori Manerba, Mattia Setzu, Francesco Spinnato
Main category: cs.AI
TL;DR: The MIMOSA framework formalizes interpretable-by-design models that balance interpretability with performance while embedding key ethical properties like causality, fairness, and privacy.
Details
Motivation: Interpretable models are crucial for trust, accountability, and safe adoption of automated decision-making in real-world applications.Method: Formalizes supervised learning across diverse data types, characterizes three families of interpretable models (feature importance, rule, and instance based), and embeds ethical properties through formal definitions, evaluation metrics, and verification procedures.
Result: Establishes theoretical foundations for developing AI systems that are accurate, interpretable, fair, privacy-preserving, and causally aware.
Conclusion: The framework enables trustworthy AI systems by evaluating ethical measures during model generation and addressing trade-offs between interpretability, performance, and ethical properties.
Abstract: Interpretable-by-design models are crucial for fostering trust, accountability, and safe adoption of automated decision-making models in real-world applications. In this paper we formalize the ground for the MIMOSA (Mining Interpretable Models explOiting Sophisticated Algorithms) framework, a comprehensive methodology for generating predictive models that balance interpretability with performance while embedding key ethical properties. We formally define here the supervised learning setting across diverse decision-making tasks and data types, including tabular data, time series, images, text, transactions, and trajectories. We characterize three major families of interpretable models: feature importance, rule, and instance based models. For each family, we analyze their interpretability dimensions, reasoning mechanisms, and complexity. Beyond interpretability, we formalize three critical ethical properties, namely causality, fairness, and privacy, providing formal definitions, evaluation metrics, and verification procedures for each. We then examine the inherent trade-offs between these properties and discuss how privacy requirements, fairness constraints, and causal reasoning can be embedded within interpretable pipelines. By evaluating ethical measures during model generation, this framework establishes the theoretical foundations for developing AI systems that are not only accurate and interpretable but also fair, privacy-preserving, and causally aware, i.e., trustworthy.
[244] A Survey of AI Scientists
Guiyao Tie, Pan Zhou, Lichao Sun
Main category: cs.AI
TL;DR: This survey introduces a unified six-stage framework to systematically analyze the evolution of AI scientists - autonomous systems that conduct end-to-end scientific workflows from hypothesis generation to paper writing.
Details
Motivation: The rapid proliferation of AI scientist systems has created a fragmented research landscape, obscuring methodological principles and developmental trends, necessitating a systematic synthesis of the field.Method: The paper proposes a six-stage methodological framework deconstructing the scientific process: Literature Review, Idea Generation, Experimental Preparation, Experimental Execution, Scientific Writing, and Paper Generation.
Result: The survey charts the field’s evolution from early Foundational Modules (2022-2023) to integrated Closed-Loop Systems (2024), and to current focus on Scalability, Impact, and Human-AI Collaboration (2025-present).
Conclusion: The framework provides a critical roadmap for overcoming challenges in robustness and governance, guiding next-generation systems toward becoming trustworthy partners in human scientific inquiry.
Abstract: Artificial intelligence is undergoing a profound transition from a computational instrument to an autonomous originator of scientific knowledge. This emerging paradigm, the AI scientist, is architected to emulate the complete scientific workflow-from initial hypothesis generation to the final synthesis of publishable findings-thereby promising to fundamentally reshape the pace and scale of discovery. However, the rapid and unstructured proliferation of these systems has created a fragmented research landscape, obscuring overarching methodological principles and developmental trends. This survey provides a systematic and comprehensive synthesis of this domain by introducing a unified, six-stage methodological framework that deconstructs the end-to-end scientific process into: Literature Review, Idea Generation, Experimental Preparation, Experimental Execution, Scientific Writing, and Paper Generation. Through this analytical lens, we chart the field’s evolution from early Foundational Modules (2022-2023) to integrated Closed-Loop Systems (2024), and finally to the current frontier of Scalability, Impact, and Human-AI Collaboration (2025-present). By rigorously synthesizing these developments, this survey not only clarifies the current state of autonomous science but also provides a critical roadmap for overcoming remaining challenges in robustness and governance, ultimately guiding the next generation of systems toward becoming trustworthy and indispensable partners in human scientific inquiry.
[245] Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives
Kentaro Ozeki, Risako Ando, Takanobu Morishita, Hirohiko Abe, Koji Mineshima, Mitsuhiro Okada
Main category: cs.AI
TL;DR: This paper evaluates LLMs’ normative reasoning capabilities by comparing them with epistemic reasoning, revealing inconsistencies and cognitive biases despite generally following valid patterns.
Details
Motivation: To systematically assess LLMs' ability to handle normative reasoning involving obligation and permission, which remains underexplored despite their strong performance in other reasoning tasks.Method: Created a new dataset covering formal reasoning patterns in both normative and epistemic domains, incorporating cognitive factors that influence human reasoning. Compared LLMs’ performance on normative vs epistemic modals.
Result: LLMs generally adhere to valid reasoning patterns but show notable inconsistencies in specific types of normative reasoning and display cognitive biases similar to those in human reasoning studies.
Conclusion: The findings highlight challenges in achieving logical consistency in LLMs’ normative reasoning and provide insights for enhancing their reliability in handling normative concepts.
Abstract: Normative reasoning is a type of reasoning that involves normative or deontic modality, such as obligation and permission. While large language models (LLMs) have demonstrated remarkable performance across various reasoning tasks, their ability to handle normative reasoning remains underexplored. In this paper, we systematically evaluate LLMs’ reasoning capabilities in the normative domain from both logical and modal perspectives. Specifically, to assess how well LLMs reason with normative modals, we make a comparison between their reasoning with normative modals and their reasoning with epistemic modals, which share a common formal structure. To this end, we introduce a new dataset covering a wide range of formal patterns of reasoning in both normative and epistemic domains, while also incorporating non-formal cognitive factors that influence human reasoning. Our results indicate that, although LLMs generally adhere to valid reasoning patterns, they exhibit notable inconsistencies in specific types of normative reasoning and display cognitive biases similar to those observed in psychological studies of human reasoning. These findings highlight challenges in achieving logical consistency in LLMs’ normative reasoning and provide insights for enhancing their reliability. All data and code are released publicly at https://github.com/kmineshima/NeuBAROCO.
cs.SD
[246] GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment
Jinting Wang, Chenxing Li, Li Liu
Main category: cs.SD
TL;DR: GACA-DiT is a diffusion transformer framework for dance-to-music generation that improves rhythmic alignment and temporal synchronization through genre-adaptive rhythm extraction and context-aware temporal alignment modules.
Details
Motivation: Existing dance-to-music generation methods use coarse rhythm embeddings that discard fine-grained motion cues, leading to weak rhythmic alignment and temporal mismatches from feature downsampling.Method: Proposes GACA-DiT with two novel modules: 1) genre-adaptive rhythm extraction using multi-scale temporal wavelet analysis and spatial phase histograms with adaptive joint weighting, and 2) context-aware temporal alignment using learnable context queries to align music latents with dance rhythm features.
Result: Extensive experiments on AIST++ and TikTok datasets show GACA-DiT outperforms state-of-the-art methods in both objective metrics and human evaluation.
Conclusion: GACA-DiT effectively addresses rhythmic alignment and temporal synchronization issues in dance-to-music generation, achieving superior performance compared to existing approaches.
Abstract: Dance-to-music (D2M) generation aims to automatically compose music that is rhythmically and temporally aligned with dance movements. Existing methods typically rely on coarse rhythm embeddings, such as global motion features or binarized joint-based rhythm values, which discard fine-grained motion cues and result in weak rhythmic alignment. Moreover, temporal mismatches introduced by feature downsampling further hinder precise synchronization between dance and music. To address these problems, we propose \textbf{GACA-DiT}, a diffusion transformer-based framework with two novel modules for rhythmically consistent and temporally aligned music generation. First, a \textbf{genre-adaptive rhythm extraction} module combines multi-scale temporal wavelet analysis and spatial phase histograms with adaptive joint weighting to capture fine-grained, genre-specific rhythm patterns. Second, a \textbf{context-aware temporal alignment} module resolves temporal mismatches using learnable context queries to align music latents with relevant dance rhythm features. Extensive experiments on the AIST++ and TikTok datasets demonstrate that GACA-DiT outperforms state-of-the-art methods in both objective metrics and human evaluation. Project page: https://beria-moon.github.io/GACA-DiT/.
[247] Oral Tradition-Encoded NanyinHGNN: Integrating Nanyin Music Preservation and Generation through a Pipa-Centric Dataset
Jianbing Xiahou, Weixi Zhai, Xu Cui
Main category: cs.SD
TL;DR: NanyinHGNN is a heterogeneous graph network model for generating Nanyin instrumental music, addressing challenges in preserving this UNESCO intangible cultural heritage by converting symbolic sequences into graph structures and generating ornamentations through rule-guided refinement.
Details
Motivation: Nanyin music faces preservation challenges due to its heterophonic tradition where core melodies are notated but ornamentations are orally transmitted, creating difficulties for both preservation and contemporary innovation in computational ethnomusicology.Method: Constructed a Pipa-Centric MIDI dataset, developed NanyinTok tokenization method, converted symbolic sequences into graph structures using Graph Converter, and used graph neural networks to generate melodic outlines optimized for ornamentations followed by rule-guided refinement based on Nanyin performance practices.
Result: The model successfully generates authentic heterophonic ensembles featuring four traditional instruments, demonstrating effective integration of domain-specific knowledge to address data scarcity in computational ethnomusicology.
Conclusion: Integrating domain-specific knowledge into model architecture can effectively mitigate data scarcity challenges in computational ethnomusicology, as validated by the successful generation of authentic Nanyin instrumental music.
Abstract: We propose NanyinHGNN, a heterogeneous graph network model for generating Nanyin instrumental music. As a UNESCO-recognized intangible cultural heritage, Nanyin follows a heterophonic tradition centered around the pipa, where core melodies are notated in traditional notation while ornamentations are passed down orally, presenting challenges for both preservation and contemporary innovation. To address this, we construct a Pipa-Centric MIDI dataset, develop NanyinTok as a specialized tokenization method, and convert symbolic sequences into graph structures using a Graph Converter to ensure that key musical features are preserved. Our key innovation reformulates ornamentation generation as the creation of ornamentation nodes within a heterogeneous graph. First, a graph neural network generates melodic outlines optimized for ornamentations. Then, a rule-guided system informed by Nanyin performance practices refines these outlines into complete ornamentations without requiring explicit ornamentation annotations during training. Experimental results demonstrate that our model successfully generates authentic heterophonic ensembles featuring four traditional instruments. These findings validate that integrating domain-specific knowledge into model architecture can effectively mitigate data scarcity challenges in computational ethnomusicology.
[248] Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
Jiarong Du, Zhan Jin, Peijun Yang, Juan Liu, Zhuo Li, Xin Liu, Ming Li
Main category: cs.SD
TL;DR: Proposed an effective audio-visual speech enhancement system with a ‘separation before dereverberation’ pipeline that performs well in complex acoustic environments, achieving first place in the AVSEC-4 challenge.
Details
Motivation: Most previous AVSE methods struggle with complex real-world acoustic environments containing interfering sounds and reverberation, resulting in poor perceptual quality of extracted speech.Method: Designed a ‘separation before dereverberation’ pipeline that can be extended to other AVSE networks, specifically developed for handling complex acoustic conditions.
Result: Achieved excellent results in three objective metrics on the AVSEC-4 competition leaderboard and secured first place in the human subjective listening test.
Conclusion: The proposed AVSE system effectively handles complex acoustic environments and demonstrates superior performance compared to previous methods.
Abstract: Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker’s speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a “separation before dereverberation” pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.
[249] Cross-Corpus Validation of Speech Emotion Recognition in Urdu using Domain-Knowledge Acoustic Features
Unzela Talpur, Zafi Sherhan Syed, Muhammad Shehram Shah Syed, Abbas Shah Syed
Main category: cs.SD
TL;DR: This paper investigates Urdu Speech Emotion Recognition in cross-corpus settings, revealing that self-corpus validation overestimates performance compared to more realistic cross-corpus evaluation.
Details
Motivation: Speech Emotion Recognition is challenging for low-resource languages like Urdu, and cross-corpus evaluation for Urdu SER remains largely unexplored, making it important to test model generalization across different datasets.Method: Used cross-corpus evaluation framework across three Urdu emotional speech datasets with eGeMAPS and ComParE acoustic feature sets, processed by Logistic Regression and Multilayer Perceptron classifiers, assessed using unweighted average recall.
Result: Self-corpus validation overestimates performance by up to 13% compared to cross-corpus evaluation, showing cross-corpus evaluation provides more realistic measure of model robustness.
Conclusion: Cross-corpus validation is crucial for Urdu SER and contributes to advancing affective computing research for underrepresented language communities.
Abstract: Speech Emotion Recognition (SER) is a key affective computing technology that enables emotionally intelligent artificial intelligence. While SER is challenging in general, it is particularly difficult for low-resource languages such as Urdu. This study investigates Urdu SER in a cross-corpus setting, an area that has remained largely unexplored. We employ a cross-corpus evaluation framework across three different Urdu emotional speech datasets to test model generalization. Two standard domain-knowledge based acoustic feature sets, eGeMAPS and ComParE, are used to represent speech signals as feature vectors which are then passed to Logistic Regression and Multilayer Perceptron classifiers. Classification performance is assessed using unweighted average recall (UAR) whilst considering class-label imbalance. Results show that Self-corpus validation often overestimates performance, with UAR exceeding cross-corpus evaluation by up to 13%, underscoring that cross-corpus evaluation offers a more realistic measure of model robustness. Overall, this work emphasizes the importance of cross-corpus validation for Urdu SER and its implications contribute to advancing affective computing research for underrepresented language communities.
[250] Expressive Range Characterization of Open Text-to-Audio Models
Jonathan Morse, Azadeh Naderi, Swen Gaudl, Mark Cartwright, Amy K. Hoover, Mark J. Nelson
Main category: cs.SD
TL;DR: This paper adapts expressive range analysis (ERA) from level generation to text-to-audio models, analyzing their output variability and fidelity for fixed prompts derived from environmental sounds.
Details
Motivation: Text-to-audio models are increasingly used in procedurally generated content but lack clear understanding of what they generate and with what variability. Audio is a broad output class that needs systematic evaluation.Method: Adapted expressive range analysis (ERA) to text-to-audio models by using fixed prompts from ESC-50 dataset and analyzing output along acoustic dimensions like pitch, loudness, and timbre.
Result: The paper provides a framework for ERA-based exploratory evaluation of generative audio models, characterizing their output space quantitatively.
Conclusion: ERA can be successfully adapted to evaluate text-to-audio models, offering a systematic way to analyze their expressive range and output characteristics for specific prompts.
Abstract: Text-to-audio models are a type of generative model that produces audio output in response to a given textual prompt. Although level generators and the properties of the functional content that they create (e.g., playability) dominate most discourse in procedurally generated content (PCG), games that emotionally resonate with players tend to weave together a range of creative and multimodal content (e.g., music, sounds, visuals, narrative tone), and multimodal models have begun seeing at least experimental use for this purpose. However, it remains unclear what exactly such models generate, and with what degree of variability and fidelity: audio is an extremely broad class of output for a generative system to target. Within the PCG community, expressive range analysis (ERA) has been used as a quantitative way to characterize generators’ output space, especially for level generators. This paper adapts ERA to text-to-audio models, making the analysis tractable by looking at the expressive range of outputs for specific, fixed prompts. Experiments are conducted by prompting the models with several standardized prompts derived from the Environmental Sound Classification (ESC-50) dataset. The resulting audio is analyzed along key acoustic dimensions (e.g., pitch, loudness, and timbre). More broadly, this paper offers a framework for ERA-based exploratory evaluation of generative audio models.
[251] Representing Classical Compositions through Implication-Realization Temporal-Gestalt Graphs
A. V. Bomediano, R. J. Conanan, L. D. Santuyo, A. Coronel
Main category: cs.SD
TL;DR: A graph-based computational approach that operationalizes cognitive music models (I-R model and Temporal Gestalt theory) to analyze musical structure through perceptual segmentation and melodic expectancy patterns.
Details
Motivation: To bridge the gap between traditional music analysis methods and cognitive models by developing a computational framework that captures how listeners perceive and anticipate musical structure.Method: Segments melodies into perceptual units, annotates with I-R patterns, uses Dynamic Time Warping for comparison, organizes into k-nearest neighbors graphs with nodes labeled by melodic expectancy values from Schellenberg’s two-factor model.
Result: Graph-based representations successfully distinguish intra- and inter-graph structures, capture structural and stylistic features beyond composer identity, and reflect perceptual similarity at segment level.
Conclusion: Graph-based methods provide a structured, cognitively informed framework for computational music analysis that enables nuanced understanding of musical structure through listener perception.
Abstract: Understanding the structural and cognitive underpinnings of musical compositions remains a key challenge in music theory and computational musicology. While traditional methods focus on harmony and rhythm, cognitive models such as the Implication-Realization (I-R) model and Temporal Gestalt theory offer insight into how listeners perceive and anticipate musical structure. This study presents a graph-based computational approach that operationalizes these models by segmenting melodies into perceptual units and annotating them with I-R patterns. These segments are compared using Dynamic Time Warping and organized into k-nearest neighbors graphs to model intra- and inter-segment relationships. Each segment is represented as a node in the graph, and nodes are further labeled with melodic expectancy values derived from Schellenberg’s two-factor I-R model-quantifying pitch proximity and pitch reversal at the segment level. This labeling enables the graphs to encode both structural and cognitive information, reflecting how listeners experience musical tension and resolution. To evaluate the expressiveness of these graphs, we apply the Weisfeiler-Lehman graph kernel to measure similarity between and within compositions. Results reveal statistically significant distinctions between intra- and inter-graph structures. Segment-level analysis via multidimensional scaling confirms that structural similarity at the graph level reflects perceptual similarity at the segment level. Graph2vec embeddings and clustering demonstrate that these representations capture stylistic and structural features that extend beyond composer identity. These findings highlight the potential of graph-based methods as a structured, cognitively informed framework for computational music analysis, enabling a more nuanced understanding of musical structure and style through the lens of listener perception.
[252] UTI-LLM: A Personalized Articulatory-Speech Therapy Assistance System Based on Multimodal Large Language Model
Yudong Yang, Xiaokang Liu, Shaofeng zhao, Rongfeng Su, Nan Yan, Lan Wang
Main category: cs.SD
TL;DR: An MLLM-based speech therapy system using ultrasound tongue imaging and speech signals for real-time articulatory feedback, addressing limitations of traditional methods through a specialized dataset and spatiotemporal fusion training.
Details
Motivation: Traditional speech therapy systems lack real-time accessibility and articulatory motion feedback. MLLMs show healthcare potential but face challenges in articulatory information processing and domain-specific data scarcity.Method: Proposed MLLM-based system using ultrasound tongue imaging and speech signals, built a domain-specific dataset with ultrasound-speech dialogue pairs, and developed spatiotemporal fusion training strategy for fine-grained articulatory analysis.
Result: Experimental results show the model effectively performs articulatory analysis and clinical assessment, demonstrating improved precision in articulatory feedback generation.
Conclusion: The proposed MLLM-based approach successfully addresses limitations in speech therapy by providing precise, interactive articulatory feedback through multimodal fusion and domain adaptation.
Abstract: Speech therapy is essential for rehabilitating speech disorders caused by neurological impairments such as stroke. However, traditional manual and computer-assisted systems are limited in real-time accessibility and articulatory motion feedback. Recent advances in multimodal large language models (MLLMs) have demonstrated significant potential in healthcare, especially through their adaptive assessment and therapeutic feedback capabilities. Nevertheless, challenges including insufficient acquisition and fusion of articulatory information, inadequate parsing of articulatory organ motion trajectories, and the scarcity of domain-specific datasets hinder the application of MLLMs in speech therapy. To address these limitations, we propose an MLLM-based speech rehabilitation assistance system that leverages ultrasound tongue imaging and speech signals to deliver precise, interactive articulatory feedback. We construct a high-quality domain-specific dataset comprising ultrasound-speech dialogue pairs. This dataset facilitates fine-tuning to enhance the model’s clinical adaptability. Furthermore, our method develops spatiotemporal fusion training strategy of ultrasound videos and speech signals, enabling fine-grained articulatory impairment analysis and ultimately generating actionable feedback. Experimental results demonstrate the effectiveness of our model in articulatory analysis and clinical assessment.
[253] ‘Studies for’: A Human-AI Co-Creative Sound Artwork Using a Real-time Multi-channel Sound Generation Model
Chihiro Nagashima, Akira Takahashi, Zhi Zhong, Shusuke Takahashi, Yuki Mitsufuji
Main category: cs.SD
TL;DR: This paper presents Studies for, a generative sound installation using AI to create a ’new form of archive’ that preserves an artist’s style while generating new sound elements in real-time.
Details
Motivation: To explore AI integration in artistic workflows and create a speculative archival system that preserves an artist's style while generating new creative outputs beyond their existing body of work.Method: Developed SpecMaskGIT, a lightweight sound generation AI model trained on 200+ hours of artist Evala’s past works, integrated into an eight-channel real-time sound installation with artist feedback loops.
Result: Successfully created an immersive auditory experience that generates new sounds while maintaining the artist’s artistic identity, demonstrating effective Human-AI co-creation in sound art.
Conclusion: Proposes a Human-AI co-creation framework for sound art that enables new possibilities for creating and archiving artistic works that extend beyond an artist’s physical existence.
Abstract: This paper explores the integration of AI technologies into the artistic workflow through the creation of Studies for, a generative sound installation developed in collaboration with sound artist Evala (https://www.ntticc.or.jp/en/archive/works/studies-for/). The installation employs SpecMaskGIT, a lightweight yet high-quality sound generation AI model, to generate and playback eight-channel sound in real-time, creating an immersive auditory experience over the course of a three-month exhibition. The work is grounded in the concept of a “new form of archive,” which aims to preserve the artistic style of an artist while expanding beyond artists’ past artworks by continued generation of new sound elements. This speculative approach to archival preservation is facilitated by training the AI model on a dataset consisting of over 200 hours of Evala’s past sound artworks. By addressing key requirements in the co-creation of art using AI, this study highlights the value of the following aspects: (1) the necessity of integrating artist feedback, (2) datasets derived from an artist’s past works, and (3) ensuring the inclusion of unexpected, novel outputs. In Studies for, the model was designed to reflect the artist’s artistic identity while generating new, previously unheard sounds, making it a fitting realization of the concept of “a new form of archive.” We propose a Human-AI co-creation framework for effectively incorporating sound generation AI models into the sound art creation process and suggest new possibilities for creating and archiving sound art that extend an artist’s work beyond their physical existence. Demo page: https://sony.github.io/studies-for/
cs.LG
[254] Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning
Svetlana Churina, Niranjan Chebrolu, Kokil Jaidka
Main category: cs.LG
TL;DR: LLMs exhibit human-like vulnerability to misinformation through repeated exposure during continual pre-training, showing persistent representational drift even with minimal poisoned data.
Details
Motivation: To investigate whether LLMs show similar vulnerability to the illusory truth effect in humans, where repeated exposure to falsehoods increases belief in their accuracy, particularly during continual pre-training.Method: Developed Layer of Truth framework and dataset to inject controlled poisoned data and probe intermediate representations across checkpoints, model scales, and question types.
Result: Even minimal exposure to false but confidently stated facts induces persistent representational drift in well-established facts, with susceptibility varying across layers and model sizes.
Conclusion: Continually updated LLMs can internalize misinformation analogously to humans, highlighting the need for robust monitoring of factual integrity during model updates.
Abstract: Large language models (LLMs) continually evolve through pre-training on ever-expanding web data, but this adaptive process also exposes them to subtle forms of misinformation. While prior work has explored data poisoning during static pre-training, the effects of such manipulations under continual pre-training remain largely unexplored. Drawing inspiration from the illusory truth effect in human cognition - where repeated exposure to falsehoods increases belief in their accuracy - we ask whether LLMs exhibit a similar vulnerability. We investigate whether repeated exposure to false but confidently stated facts can shift a model’s internal representation away from the truth. We introduce Layer of Truth, a framework and dataset for probing belief dynamics in continually trained LLMs. By injecting controlled amounts of poisoned data and probing intermediate representations across checkpoints, model scales, and question types, we quantify when and how factual beliefs shift. Our findings reveal that even minimal exposure can induce persistent representational drift in well-established facts, with susceptibility varying across layers and model sizes. These results highlight an overlooked vulnerability of continually updated LLMs: their capacity to internalize misinformation analogously to humans, underscoring the need for robust monitoring of factual integrity during model updates.
[255] SmoothGuard: Defending Multimodal Large Language Models with Noise Perturbation and Clustering Aggregation
Guangzhi Su, Shuchang Huang, Yutong Ke, Zhuohang Liu, Long Qian, Kaizhu Huang
Main category: cs.LG
TL;DR: SmoothGuard is a lightweight defense framework that enhances multimodal LLM robustness against adversarial attacks through randomized noise injection and clustering-based prediction aggregation.
Details
Motivation: Multimodal LLMs are vulnerable to adversarial manipulations, raising safety and reliability concerns in deployment.Method: Perturbs continuous modalities with Gaussian noise, generates multiple candidate outputs, and applies embedding-based clustering to filter out adversarial predictions. Final answer is selected from majority cluster.
Result: Extensive experiments show SmoothGuard improves resilience to adversarial attacks while maintaining competitive utility on benchmarks like POPE, LLaVA-Bench, and MM-SafetyBench.
Conclusion: SmoothGuard provides effective defense against adversarial attacks with optimal noise range (0.1-0.2) balancing robustness and utility.
Abstract: Multimodal large language models (MLLMs) have achieved impressive performance across diverse tasks by jointly reasoning over textual and visual inputs. Despite their success, these models remain highly vulnerable to adversarial manipulations, raising concerns about their safety and reliability in deployment. In this work, we first generalize an approach for generating adversarial images within the HuggingFace ecosystem and then introduce SmoothGuard, a lightweight and model-agnostic defense framework that enhances the robustness of MLLMs through randomized noise injection and clustering-based prediction aggregation. Our method perturbs continuous modalities (e.g., images and audio) with Gaussian noise, generates multiple candidate outputs, and applies embedding-based clustering to filter out adversarially influenced predictions. The final answer is selected from the majority cluster, ensuring stable responses even under malicious perturbations. Extensive experiments on POPE, LLaVA-Bench (In-the-Wild), and MM-SafetyBench demonstrate that SmoothGuard improves resilience to adversarial attacks while maintaining competitive utility. Ablation studies further identify an optimal noise range (0.1-0.2) that balances robustness and utility.
[256] Accurate Target Privacy Preserving Federated Learning Balancing Fairness and Utility
Kangkang Sun, Jun Wu, Minyi Guo, Jianhua Li, Jianwei Huang
Main category: cs.LG
TL;DR: FedPF is a differentially private fair federated learning algorithm that transforms multi-objective optimization into a zero-sum game between fairness, privacy, and model utility, revealing an inherent tension between privacy protection and bias correction.
Details
Motivation: Federated Learning enables collaborative training without data sharing, but participants face challenges in simultaneously ensuring fairness across demographic groups while protecting sensitive client data.Method: Introduces FedPF algorithm that transforms multi-objective optimization into a zero-sum game where fairness and privacy constraints compete against model utility, with theoretical analysis revealing inverse relationships between objectives.
Result: Experimental validation shows up to 42.9% discrimination reduction across three datasets while maintaining competitive accuracy, but reveals that privacy-fairness tension is unavoidable.
Conclusion: Achieving both privacy and fairness objectives simultaneously requires carefully balanced compromises rather than optimization of either in isolation, as stricter privacy protection fundamentally limits the system’s ability to detect and correct demographic biases.
Abstract: Federated Learning (FL) enables collaborative model training without data sharing, yet participants face a fundamental challenge, e.g., simultaneously ensuring fairness across demographic groups while protecting sensitive client data. We introduce a differentially private fair FL algorithm (\textit{FedPF}) that transforms this multi-objective optimization into a zero-sum game where fairness and privacy constraints compete against model utility. Our theoretical analysis reveals a surprising inverse relationship, i.e., stricter privacy protection fundamentally limits the system’s ability to detect and correct demographic biases, creating an inherent tension between privacy and fairness. Counterintuitively, we prove that moderate fairness constraints initially improve model generalization before causing performance degradation, where a non-monotonic relationship that challenges conventional wisdom about fairness-utility tradeoffs. Experimental validation demonstrates up to 42.9 % discrimination reduction across three datasets while maintaining competitive accuracy, but more importantly, reveals that the privacy-fairness tension is unavoidable, i.e., achieving both objectives simultaneously requires carefully balanced compromises rather than optimization of either in isolation. The source code for our proposed algorithm is publicly accessible at https://github.com/szpsunkk/FedPF.
[257] CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs
Zhiyuan Ning, Jiawei Shao, Ruge Xu, Xinfei Guo, Jun Zhang, Chi Zhang, Xuelong Li
Main category: cs.LG
TL;DR: CAS-Spec is a novel self-speculative decoding method that uses dynamically switchable inference acceleration strategies and a Dynamic Tree Cascade algorithm to achieve state-of-the-art acceleration for LLM inference without requiring specialized training.
Details
Motivation: Existing self-speculative methods fall short of specialized training methods' speed gains, while cascade methods are impractical due to high training costs of multiple models.Method: Uses dynamically switchable inference acceleration (layer sparsity, activation quantization) to create draft models and introduces Dynamic Tree Cascade algorithm for adaptive routing and draft length assignment.
Result: Achieves 1.1× to 2.3× speedup over autoregressive decoding across various LLMs and datasets, with DyTC improving speedup by 47% over cascade baseline and 48% over tree baseline.
Conclusion: CAS-Spec provides state-of-the-art acceleration that can be easily integrated into existing LLMs and has promising potential for further improvement as self-speculative techniques evolve.
Abstract: Speculative decoding has become a widely adopted as an effective technique for lossless inference acceleration when deploying large language models (LLMs). While on-the-fly self-speculative methods offer seamless integration and broad utility, they often fall short of the speed gains achieved by methods relying on specialized training. Cascading a hierarchy of draft models promises further acceleration and flexibility, but the high cost of training multiple models has limited its practical application. In this paper, we propose a novel Cascade Adaptive Self-Speculative Decoding (CAS-Spec) method which constructs speculative draft models by leveraging dynamically switchable inference acceleration (DSIA) strategies, including layer sparsity and activation quantization. Furthermore, traditional vertical and horizontal cascade algorithms are inefficient when applied to self-speculative decoding methods. We introduce a Dynamic Tree Cascade (DyTC) algorithm that adaptively routes the multi-level draft models and assigns the draft lengths, based on the heuristics of acceptance rates and latency prediction. Our CAS-Spec method achieves state-of-the-art acceleration compared to existing on-the-fly speculative decoding methods, with an average speedup from $1.1\times$ to $2.3\times$ over autoregressive decoding across various LLMs and datasets. DyTC improves the average speedup by $47$% and $48$% over cascade-based baseline and tree-based baseline algorithms, respectively. CAS-Spec can be easily integrated into most existing LLMs and holds promising potential for further acceleration as self-speculative decoding techniques continue to evolve.
[258] BI-DCGAN: A Theoretically Grounded Bayesian Framework for Efficient and Diverse GANs
Mahsa Valizadeh, Rui Tuo, James Caverlee
Main category: cs.LG
TL;DR: BI-DCGAN is a Bayesian extension of DCGAN that addresses mode collapse by incorporating model uncertainty, using Bayes by Backprop and variational inference to enhance sample diversity while maintaining computational efficiency.
Details
Motivation: GANs suffer from mode collapse, producing limited outputs that fail to capture full data distribution, which is problematic for real-world applications requiring diversity and uncertainty awareness.Method: Extends DCGAN with Bayesian modeling using Bayes by Backprop to learn weight distributions and mean-field variational inference to approximate posterior distributions during GAN training.
Result: Establishes theoretical proof that Bayesian modeling enhances GAN diversity through covariance matrix analysis, and experimental validation shows BI-DCGAN produces more diverse and robust outputs than conventional DCGANs while maintaining training efficiency.
Conclusion: BI-DCGAN provides a scalable solution for applications requiring diversity and uncertainty, positioning it as a practical alternative to resource-intensive methods like diffusion models.
Abstract: Generative Adversarial Networks (GANs) are proficient at generating synthetic data but continue to suffer from mode collapse, where the generator produces a narrow range of outputs that fool the discriminator but fail to capture the full data distribution. This limitation is particularly problematic, as generative models are increasingly deployed in real-world applications that demand both diversity and uncertainty awareness. In response, we introduce BI-DCGAN, a Bayesian extension of DCGAN that incorporates model uncertainty into the generative process while maintaining computational efficiency. BI-DCGAN integrates Bayes by Backprop to learn a distribution over network weights and employs mean-field variational inference to efficiently approximate the posterior distribution during GAN training. We establishes the first theoretical proof, based on covariance matrix analysis, that Bayesian modeling enhances sample diversity in GANs. We validate this theoretical result through extensive experiments on standard generative benchmarks, demonstrating that BI-DCGAN produces more diverse and robust outputs than conventional DCGANs, while maintaining training efficiency. These findings position BI-DCGAN as a scalable and timely solution for applications where both diversity and uncertainty are critical, and where modern alternatives like diffusion models remain too resource-intensive.
[259] Challenges in Credit Assignment for Multi-Agent Reinforcement Learning in Open Agent Systems
Alireza Saleh Abadi, Leen-Kiat Soh
Main category: cs.LG
TL;DR: This paper analyzes how openness in multi-agent reinforcement learning systems affects credit assignment, showing that dynamic agent populations, tasks, and agent types cause credit misattribution and performance degradation.
Details
Motivation: Understanding the dynamics of open systems in MARL is crucial, as traditional credit assignment methods assume static environments but real-world systems have dynamic agent populations, evolving tasks, and changing agent capabilities.Method: The authors conduct both conceptual analysis (introducing new sub-categories of openness) and empirical study using temporal and structural algorithms in open environments to examine how openness breaks traditional assumptions.
Result: The empirical results demonstrate that openness directly causes credit misattribution, evidenced by unstable loss functions and significant performance degradation in the tested algorithms.
Conclusion: Openness in MARL systems fundamentally challenges traditional credit assignment methods, leading to misattribution and performance issues that require new approaches designed specifically for dynamic environments.
Abstract: In the rapidly evolving field of multi-agent reinforcement learning (MARL), understanding the dynamics of open systems is crucial. Openness in MARL refers to the dynam-ic nature of agent populations, tasks, and agent types with-in a system. Specifically, there are three types of openness as reported in (Eck et al. 2023) [2]: agent openness, where agents can enter or leave the system at any time; task openness, where new tasks emerge, and existing ones evolve or disappear; and type openness, where the capabil-ities and behaviors of agents change over time. This report provides a conceptual and empirical review, focusing on the interplay between openness and the credit assignment problem (CAP). CAP involves determining the contribution of individual agents to the overall system performance, a task that becomes increasingly complex in open environ-ments. Traditional credit assignment (CA) methods often assume static agent populations, fixed and pre-defined tasks, and stationary types, making them inadequate for open systems. We first conduct a conceptual analysis, in-troducing new sub-categories of openness to detail how events like agent turnover or task cancellation break the assumptions of environmental stationarity and fixed team composition that underpin existing CAP methods. We then present an empirical study using representative temporal and structural algorithms in an open environment. The results demonstrate that openness directly causes credit misattribution, evidenced by unstable loss functions and significant performance degradation.
[260] Integrating Ontologies with Large Language Models for Enhanced Control Systems in Chemical Engineering
Crystal Su, Kuai Yu, Jingrui Zhang, Mingyuan Shao, Daniel Bauer
Main category: cs.LG
TL;DR: An ontology-integrated LLM framework for chemical engineering that combines structured domain knowledge with generative reasoning through COPE ontology alignment, constrained decoding, and evaluation metrics.
Details
Motivation: To create a transparent, auditable approach for applying LLMs to critical engineering contexts like process control and safety analysis by integrating symbolic structure with neural generation.Method: Pipeline with data acquisition, semantic preprocessing, information extraction, ontology mapping to produce templated QA pairs for fine-tuning, plus control-focused decoding and citation gate for factual grounding.
Result: Developed a framework that enforces syntactic and factual grounding by constraining outputs to ontology-linked terms, with metrics quantifying both linguistic quality and ontological accuracy.
Conclusion: The integration of symbolic structure and neural generation provides enhanced interpretability and reliability for applying LLMs to critical chemical engineering applications.
Abstract: This work presents an ontology-integrated large language model (LLM) framework for chemical engineering that unites structured domain knowledge with generative reasoning. The proposed pipeline aligns model training and inference with the COPE ontology through a sequence of data acquisition, semantic preprocessing, information extraction, and ontology mapping steps, producing templated question-answer pairs that guide fine-tuning. A control-focused decoding stage and citation gate enforce syntactic and factual grounding by constraining outputs to ontology-linked terms, while evaluation metrics quantify both linguistic quality and ontological accuracy. Feedback and future extensions, including semantic retrieval and iterative validation, further enhance the system’s interpretability and reliability. This integration of symbolic structure and neural generation provides a transparent, auditable approach for applying LLMs to process control, safety analysis, and other critical engineering contexts.
[261] Discovering EV Charging Site Archetypes Through Few Shot Forecasting: The First U.S.-Wide Study
Kshitij Nikhal, Luke Ackerknecht, Benjamin S. Riggan, Phil Stahlfeld
Main category: cs.LG
TL;DR: A framework combining clustering and few-shot forecasting to predict EV charging demand using large-scale data, enabling better infrastructure planning and grid resilience.
Details
Motivation: Existing EV charging behavior models are limited by small datasets, simple temporal modeling, and poor generalization to new sites with limited history.Method: Integrates clustering with few-shot forecasting to identify site archetypes using a novel large-scale charging demand dataset.
Result: Archetype-specific expert models outperform global baselines in forecasting demand at unseen sites.
Conclusion: The framework enables operators to lower costs, optimize energy/pricing strategies, and support grid resilience critical to climate goals through actionable infrastructure segmentation insights.
Abstract: The decarbonization of transportation relies on the widespread adoption of electric vehicles (EVs), which requires an accurate understanding of charging behavior to ensure cost-effective, grid-resilient infrastructure. Existing work is constrained by small-scale datasets, simple proximity-based modeling of temporal dependencies, and weak generalization to sites with limited operational history. To overcome these limitations, this work proposes a framework that integrates clustering with few-shot forecasting to uncover site archetypes using a novel large-scale dataset of charging demand. The results demonstrate that archetype-specific expert models outperform global baselines in forecasting demand at unseen sites. By establishing forecast performance as a basis for infrastructure segmentation, we generate actionable insights that enable operators to lower costs, optimize energy and pricing strategies, and support grid resilience critical to climate goals.
[262] MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models
Zimeng Huang, Jinxin Ke, Xiaoxuan Fan, Yufeng Yang, Yang Liu, Liu Zhonghan, Zedi Wang, Junteng Dai, Haoyi Jiang, Yuyu Zhou, Keze Wang, Ziliang Chen
Main category: cs.LG
TL;DR: MM-OPERA is a benchmark for evaluating association intelligence in Large Vision-Language Models through two open-ended tasks: Remote-Item Association and In-Context Association, using 11,497 instances and LLM-as-a-Judge evaluation strategies.
Details
Motivation: Current LVLMs show deficiencies in human-like intelligence like association reasoning, and existing benchmarks are limited to closed-ended tasks that don't capture the complexity of open-ended association reasoning needed for real-world applications.Method: Created MM-OPERA benchmark with 11,497 instances across RIA and ICA tasks, using free-form responses and explicit reasoning paths. Deployed tailored LLM-as-a-Judge strategies with process-reward-informed judgment to evaluate open-ended outputs.
Result: Extensive empirical studies on state-of-the-art LVLMs revealed comprehensive limitations in associative reasoning, including sensitivity analysis, validity analysis of evaluation strategies, and diversity analysis across abilities, domains, languages, and cultures.
Conclusion: The benchmark provides nuanced understanding of LVLM limitations in associative reasoning and paves the way for more human-like and general-purpose AI systems.
Abstract: Large Vision-Language Models (LVLMs) have exhibited remarkable progress. However, deficiencies remain compared to human intelligence, such as hallucination and shallow pattern matching. In this work, we aim to evaluate a fundamental yet underexplored intelligence: association, a cornerstone of human cognition for creative thinking and knowledge integration. Current benchmarks, often limited to closed-ended tasks, fail to capture the complexity of open-ended association reasoning vital for real-world applications. To address this, we present MM-OPERA, a systematic benchmark with 11,497 instances across two open-ended tasks: Remote-Item Association (RIA) and In-Context Association (ICA), aligning association intelligence evaluation with human psychometric principles. It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning through free-form responses and explicit reasoning paths. We deploy tailored LLM-as-a-Judge strategies to evaluate open-ended outputs, applying process-reward-informed judgment to dissect reasoning with precision. Extensive empirical studies on state-of-the-art LVLMs, including sensitivity analysis of task instances, validity analysis of LLM-as-a-Judge strategies, and diversity analysis across abilities, domains, languages, cultures, etc., provide a comprehensive and nuanced understanding of the limitations of current LVLMs in associative reasoning, paving the way for more human-like and general-purpose AI. The dataset and code are available at https://github.com/MM-OPERA-Bench/MM-OPERA.
[263] Mind the Gaps: Auditing and Reducing Group Inequity in Large-Scale Mobility Prediction
Ashwin Kumar, Hanyu Zhang, David A. Schweidel, William Yeoh
Main category: cs.LG
TL;DR: The paper audits mobility prediction models for demographic disparities and proposes FGIS, a fairness-guided sampling method that reduces performance gaps between racial groups by up to 40% with minimal accuracy loss.
Details
Motivation: To address hidden disparities in mobility prediction models based on user demographics and expose structural inequities in prediction pipelines.Method: Proposed Fairness-Guided Incremental Sampling (FGIS) with Size-Aware K-Means (SAKM) clustering to create proxy racial labels from census data, then incrementally samples users to reduce performance gaps while maintaining accuracy.
Result: Reduces total disparity between racial groups by up to 40% with minimal accuracy trade-offs, with most significant improvements in early sampling stages.
Conclusion: Lightweight, data-centric interventions can effectively improve fairness in mobility prediction with little added complexity, especially beneficial for low-data applications.
Abstract: Next location prediction underpins a growing number of mobility, retail, and public-health applications, yet its societal impacts remain largely unexplored. In this paper, we audit state-of-the-art mobility prediction models trained on a large-scale dataset, highlighting hidden disparities based on user demographics. Drawing from aggregate census data, we compute the difference in predictive performance on racial and ethnic user groups and show a systematic disparity resulting from the underlying dataset, resulting in large differences in accuracy based on location and user groups. To address this, we propose Fairness-Guided Incremental Sampling (FGIS), a group-aware sampling strategy designed for incremental data collection settings. Because individual-level demographic labels are unavailable, we introduce Size-Aware K-Means (SAKM), a clustering method that partitions users in latent mobility space while enforcing census-derived group proportions. This yields proxy racial labels for the four largest groups in the state: Asian, Black, Hispanic, and White. Built on these labels, our sampling algorithm prioritizes users based on expected performance gains and current group representation. This method incrementally constructs training datasets that reduce demographic performance gaps while preserving overall accuracy. Our method reduces total disparity between groups by up to 40% with minimal accuracy trade-offs, as evaluated on a state-of-art MetaPath2Vec model and a transformer-encoder model. Improvements are most significant in early sampling stages, highlighting the potential for fairness-aware strategies to deliver meaningful gains even in low-resource settings. Our findings expose structural inequities in mobility prediction pipelines and demonstrate how lightweight, data-centric interventions can improve fairness with little added complexity, especially for low-data applications.
[264] Can machines think efficiently?
Adam Winchell
Main category: cs.LG
TL;DR: Proposes an updated Turing Test that incorporates energy efficiency as an additional constraint to evaluate intelligence.
Details
Motivation: The original Turing Test is no longer adequate as AI systems can pass it, and there are growing ethical and environmental concerns about AI's resource consumption.Method: Expands the original imitation game by adding energy consumption as a constraint, evaluating intelligence through the lens of efficiency.
Result: Creates a test that connects abstract thinking to concrete resource limitations and provides a measurable, practical finish line for intelligence evaluation.
Conclusion: The energy-constrained Turing Test forces society to consider the trade-off between AI’s time-saving benefits and its total resource costs.
Abstract: The Turing Test is no longer adequate for distinguishing human and machine intelligence. With advanced artificial intelligence systems already passing the original Turing Test and contributing to serious ethical and environmental concerns, we urgently need to update the test. This work expands upon the original imitation game by accounting for an additional factor: the energy spent answering the questions. By adding the constraint of energy, the new test forces us to evaluate intelligence through the lens of efficiency, connecting the abstract problem of thinking to the concrete reality of finite resources. Further, this proposed new test ensures the evaluation of intelligence has a measurable, practical finish line that the original test lacks. This additional constraint compels society to weigh the time savings of using artificial intelligence against its total resource cost.
[265] Predicting Household Water Consumption Using Satellite and Street View Images in Two Indian Cities
Qiao Wang, Joseph George
Main category: cs.LG
TL;DR: Using publicly available imagery and geospatial data to predict household water consumption in India, achieving results comparable to traditional survey methods.
Details
Motivation: Traditional household water monitoring methods are costly and time-intensive, especially in rapidly urbanizing regions where timely data is crucial.Method: Four approaches compared: survey features (benchmark), CNN embeddings from satellite/GSV imagery, and GSV semantic maps with geospatial covariates like nightlight intensity and population density.
Result: GSV segmentation plus remote-sensing covariates achieved 0.55 accuracy for water use, approaching survey-based models (0.59 accuracy). High precision at consumption extremes but confusion in middle classes.
Conclusion: Open-access imagery with minimal geospatial data offers a promising alternative to surveys for reliable household water consumption estimates in urban analytics.
Abstract: Monitoring household water use in rapidly urbanizing regions is hampered by costly, time-intensive enumeration methods and surveys. We investigate whether publicly available imagery-satellite tiles, Google Street View (GSV) segmentation-and simple geospatial covariates (nightlight intensity, population density) can be utilized to predict household water consumption in Hubballi-Dharwad, India. We compare four approaches: survey features (benchmark), CNN embeddings (satellite, GSV, combined), and GSV semantic maps with auxiliary data. Under an ordinal classification framework, GSV segmentation plus remote-sensing covariates achieves 0.55 accuracy for water use, approaching survey-based models (0.59 accuracy). Error analysis shows high precision at extremes of the household water consumption distribution, but confusion among middle classes is due to overlapping visual proxies. We also compare and contrast our estimates for household water consumption to that of household subjective income. Our findings demonstrate that open-access imagery, coupled with minimal geospatial data, offers a promising alternative to obtaining reliable household water consumption estimates using surveys in urban analytics.
[266] Fine-Grained Iterative Adversarial Attacks with Limited Computation Budget
Zhichao Hou, Weizhi Gao, Xiaorui Liu
Main category: cs.LG
TL;DR: Proposes a fine-grained control mechanism for iterative adversarial attacks that selectively recomputes layer activations to maximize attack strength within constrained compute budgets.
Details
Motivation: Address the challenge of maximizing adversarial attack effectiveness under limited computational resources, where simply reducing iterations weakens attack performance.Method: Fine-grained control mechanism that selectively recomputes layer activations across both iteration-wise and layer-wise levels to optimize compute usage.
Result: Method consistently outperforms existing baselines at equal cost, and when integrated into adversarial training, achieves comparable performance with only 30% of original budget.
Conclusion: The proposed approach effectively maximizes adversarial attack strength within constrained compute budgets through selective activation recomputation.
Abstract: This work tackles a critical challenge in AI safety research under limited compute: given a fixed computation budget, how can one maximize the strength of iterative adversarial attacks? Coarsely reducing the number of attack iterations lowers cost but substantially weakens effectiveness. To fulfill the attainable attack efficacy within a constrained budget, we propose a fine-grained control mechanism that selectively recomputes layer activations across both iteration-wise and layer-wise levels. Extensive experiments show that our method consistently outperforms existing baselines at equal cost. Moreover, when integrated into adversarial training, it attains comparable performance with only 30% of the original budget.
[267] HADSF: Aspect Aware Semantic Control for Explainable Recommendation
Zheng Nie, Peijie Sun
Main category: cs.LG
TL;DR: HADSF is a two-stage framework that addresses LLM limitations in review-based recommendation by creating compact aspect vocabularies and performing constrained extraction, reducing hallucination and improving rating prediction.
Details
Motivation: Current LLM methods for review-based recommendation suffer from uncontrolled extraction producing noisy representations, lack principled hallucination metrics, and unexplored cost-quality trade-offs across model scales.Method: Two-stage approach: (1) adaptive selection for corpus-level aspect vocabulary, (2) vocabulary-guided constrained extraction of structured aspect-opinion triples. Introduces Aspect Drift Rate (ADR) and Opinion Fidelity Rate (OFR) metrics.
Result: Experiments on 3M reviews across 1.5B-70B parameter LLMs show consistent reductions in prediction error and enable smaller models to achieve competitive performance. Uncovers nonmonotonic relationship between hallucination severity and rating prediction error.
Conclusion: HADSF provides an effective framework for hallucination-aware LLM-enhanced explainable recommendation, with released code and metrics to support reproducible research.
Abstract: Recent advances in large language models (LLMs) promise more effective information extraction for review-based recommender systems, yet current methods still (i) mine free-form reviews without scope control, producing redundant and noisy representations, (ii) lack principled metrics that link LLM hallucination to downstream effectiveness, and (iii) leave the cost-quality trade-off across model scales largely unexplored. We address these gaps with the Hyper-Adaptive Dual-Stage Semantic Framework (HADSF), a two-stage approach that first induces a compact, corpus-level aspect vocabulary via adaptive selection and then performs vocabulary-guided, explicitly constrained extraction of structured aspect-opinion triples. To assess the fidelity of the resulting representations, we introduce Aspect Drift Rate (ADR) and Opinion Fidelity Rate (OFR) and empirically uncover a nonmonotonic relationship between hallucination severity and rating prediction error. Experiments on approximately 3 million reviews across LLMs spanning 1.5B-70B parameters show that, when integrated into standard rating predictors, HADSF yields consistent reductions in prediction error and enables smaller models to achieve competitive performance in representative deployment scenarios. We release code, data pipelines, and metric implementations to support reproducible research on hallucination-aware, LLM-enhanced explainable recommendation. Code is available at https://github.com/niez233/HADSF
[268] Gradient Descent as Loss Landscape Navigation: a Normative Framework for Deriving Learning Rules
John J. Vastola, Samuel J. Gershman, Kanaka Rajan
Main category: cs.LG
TL;DR: The paper proposes a theoretical framework that treats learning rules as policies for navigating loss landscapes, identifying optimal rules as solutions to optimal control problems.
Details
Motivation: To understand why some learning rules work better than others and under what assumptions they can be considered optimal, rather than just assuming them.Method: Casts learning rules as policies for navigating (partially observable) loss landscapes and identifies optimal rules as solutions to an associated optimal control problem.
Result: Shows that various well-known learning rules emerge naturally under different assumptions: gradient descent, momentum, natural gradients, non-gradient rules, and adaptive optimizers like Adam. Also explains continual learning strategies as optimal responses to task uncertainty.
Conclusion: The framework unifies diverse learning phenomena under a single objective, clarifies the computational structure of learning, and provides a principled foundation for designing adaptive algorithms.
Abstract: Learning rules – prescriptions for updating model parameters to improve performance – are typically assumed rather than derived. Why do some learning rules work better than others, and under what assumptions can a given rule be considered optimal? We propose a theoretical framework that casts learning rules as policies for navigating (partially observable) loss landscapes, and identifies optimal rules as solutions to an associated optimal control problem. A range of well-known rules emerge naturally within this framework under different assumptions: gradient descent from short-horizon optimization, momentum from longer-horizon planning, natural gradients from accounting for parameter space geometry, non-gradient rules from partial controllability, and adaptive optimizers like Adam from online Bayesian inference of loss landscape shape. We further show that continual learning strategies like weight resetting can be understood as optimal responses to task uncertainty. By unifying these phenomena under a single objective, our framework clarifies the computational structure of learning and offers a principled foundation for designing adaptive algorithms.
[269] A Framework for Fair Evaluation of Variance-Aware Bandit Algorithms
Elise Wolf
Main category: cs.LG
TL;DR: This paper presents a reproducible evaluation framework for comparing multi-armed bandit algorithms, showing that variance-aware methods excel in high-uncertainty environments with subtle reward differences, while classical algorithms perform better in separable scenarios or with extensive tuning.
Details
Motivation: Evaluating and comparing MAB algorithms is challenging due to lack of standardized conditions and replicability, especially for variance-aware extensions whose performance heavily depends on the underlying environment.Method: Developed a reproducible evaluation framework (Bandit Playground codebase) with clearly defined experimental setups, multiple performance metrics (reward, regret, reward distribution, value-at-risk, action optimality), and an interactive evaluation interface for systematic comparison of eight classical and variance-aware MAB algorithms.
Result: Variance-aware algorithms offer advantages in high-uncertainty settings with subtle differences between arm rewards, while classical algorithms perform equally well or better in more separable scenarios or with extensive fine-tuning.
Conclusion: The paper provides both a framework for systematic MAB algorithm evaluation and insights into when variance-aware approaches outperform classical counterparts, addressing the challenge of reliable performance comparison in multi-armed bandit research.
Abstract: Multi-armed bandit (MAB) problems serve as a fundamental building block for more complex reinforcement learning algorithms. However, evaluating and comparing MAB algorithms remains challenging due to the lack of standardized conditions and replicability. This is particularly problematic for variance-aware extensions of classical methods like UCB, whose performance can heavily depend on the underlying environment. In this study, we address how performance differences between bandit algorithms can be reliably observed, and under what conditions variance-aware algorithms outperform classical ones. We present a reproducible evaluation designed to systematically compare eight classical and variance-aware MAB algorithms. The evaluation framework, implemented in our Bandit Playground codebase, features clearly defined experimental setups, multiple performance metrics (reward, regret, reward distribution, value-at-risk, and action optimality), and an interactive evaluation interface that supports consistent and transparent analysis. We show that variance-aware algorithms can offer advantages in settings with high uncertainty where the difficulty arises from subtle differences between arm rewards. In contrast, classical algorithms often perform equally well or better in more separable scenarios or if fine-tuned extensively. Our contributions are twofold: (1) a framework for systematic evaluation of MAB algorithms, and (2) insights into the conditions under which variance-aware approaches outperform their classical counterparts.
[270] Jasmine: A Simple, Performant and Scalable JAX-based World Modeling Codebase
Mihir Mahajan, Alfred Nguyen, Franz Srambical, Stefan Bauer
Main category: cs.LG
TL;DR: Jasmine is a high-performance JAX-based world modeling codebase that enables scalable training from single hosts to hundreds of accelerators with minimal code changes, achieving 10x faster reproduction of CoinRun case study.
Details
Motivation: To address the nascent state of open training infrastructure for world modeling and overcome data scarcity in domains like robotics by providing performant and scalable training tools.Method: Developed a JAX-based world modeling codebase with performance optimizations across data loading, training and checkpointing, supporting diverse sharding configurations and guaranteeing fully reproducible training.
Result: Achieved order-of-magnitude faster reproduction of CoinRun case study compared to prior open implementations, and established infrastructure for rigorous benchmarking pipelines across model families and architectural ablations.
Conclusion: Jasmine provides a scalable and performant foundation for world modeling research, enabling efficient training and benchmarking through optimized infrastructure paired with curated large-scale datasets.
Abstract: While world models are increasingly positioned as a pathway to overcoming data scarcity in domains such as robotics, open training infrastructure for world modeling remains nascent. We introduce Jasmine, a performant JAX-based world modeling codebase that scales from single hosts to hundreds of accelerators with minimal code changes. Jasmine achieves an order-of-magnitude faster reproduction of the CoinRun case study compared to prior open implementations, enabled by performance optimizations across data loading, training and checkpointing. The codebase guarantees fully reproducible training and supports diverse sharding configurations. By pairing Jasmine with curated large-scale datasets, we establish infrastructure for rigorous benchmarking pipelines across model families and architectural ablations.
[271] Mixture-of-Transformers Learn Faster: A Theoretical Study on Classification Problems
Hongbo Li, Qinhang Wu, Sen Lin, Yingbin Liang, Ness B. Shroff
Main category: cs.LG
TL;DR: Mixture-of-Transformers (MoT) provides a theoretical framework for transformer-level expert specialization with continuous gating network training, achieving faster convergence than single transformers.
Details
Motivation: Mixture-of-Experts models improve transformer efficiency but lack unified theoretical explanation, especially when both feed-forward and attention layers can specialize.Method: Developed a three-stage training algorithm with continuous gating network training, allowing transformer blocks to act as experts that specialize in distinct task classes.
Result: MoT achieves near-zero expected prediction loss in O(log(ε⁻¹)) iterations, significantly faster than O(ε⁻¹) for single transformers, with experts specializing in distinct tasks and accurate routing.
Conclusion: Provides the first unified theoretical account of transformer-level specialization and learning dynamics, offering practical guidance for efficient large-scale model design.
Abstract: Mixture-of-Experts (MoE) models improve transformer efficiency but lack a unified theoretical explanation, especially when both feed-forward and attention layers are allowed to specialize. To this end, we study the Mixture-of-Transformers (MoT), a tractable theoretical framework in which each transformer block acts as an expert governed by a continuously trained gating network. This design allows us to isolate and study the core learning dynamics of expert specialization and attention alignment. In particular, we develop a three-stage training algorithm with continuous training of the gating network, and show that each transformer expert specializes in a distinct class of tasks and that the gating network accurately routes data samples to the correct expert. Our analysis shows how expert specialization reduces gradient conflicts and makes each subtask strongly convex. We prove that the training drives the expected prediction loss to near zero in $O(\log(\epsilon^{-1}))$ iteration steps, significantly improving over the $O(\epsilon^{-1})$ rate for a single transformer. We further validate our theoretical findings through extensive real-data experiments, demonstrating the practical effectiveness of MoT. Together, these results offer the first unified theoretical account of transformer-level specialization and learning dynamics, providing practical guidance for designing efficient large-scale models.
[272] Enhancing Sentiment Classification with Machine Learning and Combinatorial Fusion
Sean Patten, Pin-Yu Chen, Christina Schweikert, D. Frank Hsu
Main category: cs.LG
TL;DR: Novel sentiment classification using Combinatorial Fusion Analysis (CFA) to ensemble diverse ML models, achieving 97.072% accuracy on IMDB dataset by leveraging cognitive diversity through rank-score functions.
Details
Motivation: To improve sentiment classification accuracy while being computationally efficient, moving beyond simply scaling individual model sizes by strategically combining diverse models.Method: Uses Combinatorial Fusion Analysis (CFA) with rank-score characteristic functions to quantify model dissimilarity and strategically combine predictions from RoBERTa transformer with traditional ML models (Random Forest, SVM, XGBoost).
Result: Achieved state-of-the-art 97.072% accuracy on IMDB sentiment analysis dataset, outperforming traditional ensemble methods through effective computation and use of model diversity.
Conclusion: CFA provides an efficient and effective approach for sentiment classification by strategically leveraging cognitive diversity in model ensembles, offering superior performance compared to traditional ensemble methods.
Abstract: This paper presents a novel approach to sentiment classification using the application of Combinatorial Fusion Analysis (CFA) to integrate an ensemble of diverse machine learning models, achieving state-of-the-art accuracy on the IMDB sentiment analysis dataset of 97.072%. CFA leverages the concept of cognitive diversity, which utilizes rank-score characteristic functions to quantify the dissimilarity between models and strategically combine their predictions. This is in contrast to the common process of scaling the size of individual models, and thus is comparatively efficient in computing resource use. Experimental results also indicate that CFA outperforms traditional ensemble methods by effectively computing and employing model diversity. The approach in this paper implements the combination of a transformer-based model of the RoBERTa architecture with traditional machine learning models, including Random Forest, SVM, and XGBoost.
[273] Quantitative Bounds for Length Generalization in Transformers
Zachary Izzo, Eshaan Nichani, Jason D. Lee
Main category: cs.LG
TL;DR: This paper provides the first quantitative bounds on required training sequence length for transformers to achieve length generalization, analyzing different problem settings including error control types, attention precision, and transformer depth.
Details
Motivation: Prior work established transformers eventually achieve length generalization but didn't quantify the required training length threshold. This work aims to provide concrete bounds on how much training data is needed.Method: Analyzed length generalization in multiple settings: ℓ∞ vs average error control, infinite-precision softmax vs finite-precision attention, and one-layer vs two-layer transformers. Proved LG occurs when transformer behavior on longer sequences can be simulated by behavior on shorter training sequences.
Result: Provided quantitative bounds for training length requirements across different scenarios. Verified insights empirically. Showed that richer training data is needed for generalization on more complex tasks.
Conclusion: The results sharpen theoretical understanding of extrapolation mechanisms in transformers and formalize the intuition that more complex tasks require more extensive training data for effective length generalization.
Abstract: We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2025) established that transformers eventually achieve length generalization once the training sequence length exceeds some finite threshold, but left open the question of how large it must be. In this work, we provide the first quantitative bounds on the required training length for length generalization to occur. Motivated by previous empirical and theoretical work, we analyze LG in several distinct problem settings: $\ell_\infty$ error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, and one- vs. two-layer transformers. In all scenarios, we prove that LG occurs when the internal behavior of the transformer on longer sequences can be “simulated” by its behavior on shorter sequences seen during training. Our bounds give qualitative estimates for the length of training data required for a transformer to generalize, and we verify these insights empirically. These results sharpen our theoretical understanding of the mechanisms underlying extrapolation in transformers, and formalize the intuition that richer training data is required for generalization on more complex tasks.
[274] Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning
Md Tanvirul Alam, Nidhi Rastogi
Main category: cs.LG
TL;DR: RLVR improves evaluation metrics on combinatorial problems but often reinforces superficial heuristics rather than genuine reasoning strategies, highlighting limits in generalization.
Details
Motivation: To investigate whether Reinforcement Learning with Verifiable Rewards (RLVR) actually fosters genuine mathematical reasoning in LLMs, beyond just improving evaluation metrics.Method: Tested RLVR on two combinatorial problems (Activity Scheduling and Longest Increasing Subsequence) with fully verifiable solutions, using curated datasets with unique optima and multiple reward designs.
Result: RLVR improves evaluation metrics but often achieves this by reinforcing superficial heuristics rather than acquiring new reasoning strategies, showing limited generalization.
Conclusion: Benchmarks are needed that can disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress in reasoning capabilities.
Abstract: Mathematical reasoning is a central challenge for large language models (LLMs), requiring not only correct answers but also faithful reasoning processes. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities; however, its ability to foster genuine reasoning remains unclear. We investigate RLVR on two combinatorial problems with fully verifiable solutions: \emph{Activity Scheduling} and the \emph{Longest Increasing Subsequence}, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. These findings highlight the limits of RLVR generalization, emphasizing the importance of benchmarks that disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress. Code available at https://github.com/xashru/rlvr-seq-generalization.
[275] Consistency Training Helps Stop Sycophancy and Jailbreaks
Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah
Main category: cs.LG
TL;DR: Consistency training teaches LLMs to be invariant to irrelevant prompt cues like leading questions or jailbreak text, reducing sycophancy and jailbreaking susceptibility.
Details
Motivation: LLMs' factuality and refusal training can be compromised by simple prompt changes, leading to sycophancy (adopting user beliefs) or jailbreaking (satisfying inappropriate requests wrapped in special text).Method: Two consistency training approaches: Bias-augmented Consistency Training (BCT) enforces invariance over model outputs, and Activation Consistency Training (ACT) enforces invariance over internal activations. Both use model’s own responses as training data.
Result: Both methods reduce Gemini 2.5 Flash’s susceptibility to irrelevant cues. BCT and ACT reduce sycophancy equally well, but BCT performs better at jailbreak reduction. Methods avoid stale training data issues.
Conclusion: Some alignment problems are better viewed as consistency issues rather than optimal response problems. BCT can simplify training pipelines by removing reliance on static datasets.
Abstract: An LLM’s factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). We explore \emph{consistency training}, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model’s external outputs (\emph{Bias-augmented Consistency Training} (BCT) from Chua et al. [2025]) and over its internal activations (\emph{Activation Consistency Training} (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash’s susceptibility to irrelevant cues. Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data, such as degrading model capabilities or enforcing outdated response guidelines. While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction. We think that BCT can simplify training pipelines by removing reliance on static datasets. We argue that some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.
[276] Towards a Measure of Algorithm Similarity
Shairoz Sohail, Taher Ali
Main category: cs.LG
TL;DR: EMOC framework embeds algorithm implementations into feature space for similarity assessment, supporting clustering, classification, and diversity quantification.
Details
Motivation: Need for pragmatic similarity metrics in applications like clone detection and program synthesis, as general similarity determination is uncomputable.Method: EMOC (Evaluation-Memory-Operations-Complexity) framework that embeds algorithms into feature space, tested on PACD dataset of verified Python implementations.
Result: EMOC features support clustering/classification of algorithm types, near-duplicate detection, and diversity quantification in LLM-generated programs.
Conclusion: EMOC provides practical similarity metrics for algorithms, with released code/data to facilitate reproducibility and future work.
Abstract: Given two algorithms for the same problem, can we determine whether they are meaningfully different? In full generality, the question is uncomputable, and empirically it is muddied by competing notions of similarity. Yet, in many applications (such as clone detection or program synthesis) a pragmatic and consistent similarity metric is necessary. We review existing equivalence and similarity notions and introduce EMOC: An Evaluation-Memory-Operations-Complexity framework that embeds algorithm implementations into a feature space suitable for downstream tasks. We compile PACD, a curated dataset of verified Python implementations across three problems, and show that EMOC features support clustering and classification of algorithm types, detection of near-duplicates, and quantification of diversity in LLM-generated programs. Code, data, and utilities for computing EMOC embeddings are released to facilitate reproducibility and future work on algorithm similarity.
[277] MLPerf Automotive
Radoyeh Shojaei, Predrag Djurdjevic, Mostafa El-Khamy, James Goel, Kasper Mecklenburg, John Owens, Pınar Muyan-Özçelik, Tom St. John, Jinho Suh, Arjun Suresh
Main category: cs.LG
TL;DR: MLPerf Automotive is the first standardized public benchmark for evaluating ML systems in automotive applications, addressing unique constraints like safety and real-time processing through latency and accuracy metrics.
Details
Motivation: Existing benchmarks cannot be used for automotive ML systems due to unique constraints including safety requirements and real-time processing needs that distinguish automotive workloads from other domains.Method: Developed through collaboration between MLCommons and AVCC, the benchmark provides latency and accuracy metrics with evaluation protocols for consistent performance comparisons across hardware and software platforms. It includes automotive perception tasks: 2D object detection, 2D semantic segmentation, and 3D object detection.
Result: The benchmark framework enables consistent and reproducible performance comparisons across different hardware platforms and software implementations. The first iteration focuses on automotive perception tasks with available reference implementations.
Conclusion: MLPerf Automotive establishes the first standardized benchmark for automotive ML systems, addressing the gap in evaluation methodologies for safety-critical, real-time automotive workloads and providing a foundation for future development in this domain.
Abstract: We present MLPerf Automotive, the first standardized public benchmark for evaluating Machine Learning systems that are deployed for AI acceleration in automotive systems. Developed through a collaborative partnership between MLCommons and the Autonomous Vehicle Computing Consortium, this benchmark addresses the need for standardized performance evaluation methodologies in automotive machine learning systems. Existing benchmark suites cannot be utilized for these systems since automotive workloads have unique constraints including safety and real-time processing that distinguish them from the domains that previously introduced benchmarks target. Our benchmarking framework provides latency and accuracy metrics along with evaluation protocols that enable consistent and reproducible performance comparisons across different hardware platforms and software implementations. The first iteration of the benchmark consists of automotive perception tasks in 2D object detection, 2D semantic segmentation, and 3D object detection. We describe the methodology behind the benchmark design including the task selection, reference models, and submission rules. We also discuss the first round of benchmark submissions and the challenges involved in acquiring the datasets and the engineering efforts to develop the reference implementations. Our benchmark code is available at https://github.com/mlcommons/mlperf_automotive.
[278] Towards Understanding Self-play for LLM Reasoning
Justin Yang Chae, Md Tanvirul Alam, Nidhi Rastogi
Main category: cs.LG
TL;DR: Analysis of self-play training dynamics in LLM reasoning, comparing it with RLVR and SFT methods through parameter updates, entropy dynamics, and alternative reward functions.
Details
Motivation: To understand the mechanisms behind self-play improvements in LLM reasoning, which remain poorly understood despite showing strong in-domain and out-of-domain gains.Method: Analyzed training dynamics through parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions, using pass@k evaluations to connect dynamics to reasoning performance.
Result: Findings clarify how self-play differs from other post-training strategies and highlight its inherent limitations.
Conclusion: Points toward future directions for improving LLM math reasoning through self-play by better understanding its training dynamics.
Abstract: Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While self-play has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future directions for improving LLM math reasoning through self-play.
[279] Functional embeddings enable Aggregation of multi-area SEEG recordings over subjects and sessions
Sina Javadzadeh, Rahil Soroushmojdehi, S. Alireza Seyyed Mousavi, Mehrnaz Asadi, Sumiko Abe, Terence D. Sanger
Main category: cs.LG
TL;DR: A framework for cross-subject aggregation of intracranial recordings using functional embeddings and transformers, enabling accurate region clustering and masked channel reconstruction without requiring uniform electrode placement.
Details
Motivation: Aggregating intracranial recordings across subjects is challenging due to varying electrode count, placement, and covered regions. Spatial normalization methods like MNI coordinates often fail to capture true functional similarity, as even at matched anatomical coordinates, neural dynamics can differ substantially between individuals.Method: A scalable representation-learning framework with: (i) Siamese encoder with contrastive objectives to learn subject-agnostic functional embeddings that are locality-sensitive to region-specific neural signatures, and (ii) transformer that tokenizes these embeddings to model inter-regional relationships with variable channel counts.
Result: The learned functional space supports accurate within-subject discrimination, forms clear region-consistent clusters, and transfers zero-shot to unseen channels. The transformer captures cross-region dependencies and enables reconstruction of masked channels without subject-specific supervision.
Conclusion: This approach provides a path toward large-scale, cross-subject aggregation and pretraining for intracranial neural data where strict task structure and uniform sensor placement are unavailable.
Abstract: Aggregating intracranial recordings across subjects is challenging since electrode count, placement, and covered regions vary widely. Spatial normalization methods like MNI coordinates offer a shared anatomical reference, but often fail to capture true functional similarity, particularly when localization is imprecise; even at matched anatomical coordinates, the targeted brain region and underlying neural dynamics can differ substantially between individuals. We propose a scalable representation-learning framework that (i) learns a subject-agnostic functional identity for each electrode from multi-region local field potentials using a Siamese encoder with contrastive objectives, inducing an embedding geometry that is locality-sensitive to region-specific neural signatures, and (ii) tokenizes these embeddings for a transformer that models inter-regional relationships with a variable number of channels. We evaluate this framework on a 20-subject dataset spanning basal ganglia-thalamic regions collected during flexible rest/movement recording sessions with heterogeneous electrode layouts. The learned functional space supports accurate within-subject discrimination and forms clear, region-consistent clusters; it transfers zero-shot to unseen channels. The transformer, operating on functional tokens without subject-specific heads or supervision, captures cross-region dependencies and enables reconstruction of masked channels, providing a subject-agnostic backbone for downstream decoding. Together, these results indicate a path toward large-scale, cross-subject aggregation and pretraining for intracranial neural data where strict task structure and uniform sensor placement are unavailable.
[280] QiNN-QJ: A Quantum-inspired Neural Network with Quantum Jump for Multimodal Sentiment Analysis
Yiwei Chen, Kehuan Yan, Yu Pan, Daoyi Dong
Main category: cs.LG
TL;DR: A quantum-inspired neural network with quantum jump (QiNN-QJ) for multimodal entanglement modeling that uses differentiable quantum jump operators to create controllable cross-modal entanglement with dissipative dynamics, achieving state-of-the-art performance on benchmark datasets.
Details
Motivation: Existing quantum-inspired fusion models rely solely on unitary transformations for entanglement generation, which suffer from training instability and limited generalizability. The authors aim to develop a more stable and controllable approach to multimodal entanglement modeling.Method: Each modality is encoded as a quantum pure state, then a differentiable module simulating quantum jump operators transforms separable product states into entangled representations. The model jointly learns Hamiltonian and Lindblad operators to generate controllable cross-modal entanglement with dissipative dynamics, using structured stochasticity and steady-state attractor properties for stability.
Result: QiNN-QJ achieves superior performance over state-of-the-art models on benchmark datasets including CMU-MOSI, CMU-MOSEI, and CH-SIMS. It also provides enhanced post-hoc interpretability through von-Neumann entanglement entropy analysis.
Conclusion: The work establishes a principled framework for entangled multimodal fusion and paves the way for quantum-inspired approaches in modeling complex cross-modal correlations, offering both improved performance and interpretability.
Abstract: Quantum theory provides non-classical principles, such as superposition and entanglement, that inspires promising paradigms in machine learning. However, most existing quantum-inspired fusion models rely solely on unitary or unitary-like transformations to generate quantum entanglement. While theoretically expressive, such approaches often suffer from training instability and limited generalizability. In this work, we propose a Quantum-inspired Neural Network with Quantum Jump (QiNN-QJ) for multimodal entanglement modelling. Each modality is firstly encoded as a quantum pure state, after which a differentiable module simulating the QJ operator transforms the separable product state into the entangled representation. By jointly learning Hamiltonian and Lindblad operators, QiNN-QJ generates controllable cross-modal entanglement among modalities with dissipative dynamics, where structured stochasticity and steady-state attractor properties serve to stabilize training and constrain entanglement shaping. The resulting entangled states are projected onto trainable measurement vectors to produce predictions. In addition to achieving superior performance over the state-of-the-art models on benchmark datasets, including CMU-MOSI, CMU-MOSEI, and CH-SIMS, QiNN-QJ facilitates enhanced post-hoc interpretability through von-Neumann entanglement entropy. This work establishes a principled framework for entangled multimodal fusion and paves the way for quantum-inspired approaches in modelling complex cross-modal correlations.
[281] Hierarchical Bayesian Model for Gene Deconvolution and Functional Analysis in Human Endometrium Across the Menstrual Cycle
Crystal Su, Kuai Yu, Mingyuan Shao, Daniel Bauer
Main category: cs.LG
TL;DR: A probabilistic hierarchical Bayesian model for deconvolving bulk RNA-seq data into cell-type expression profiles and proportions using single-cell reference data, applied to human endometrial tissue across menstrual cycle phases.
Details
Motivation: Bulk tissue RNA sequencing obscures cell type-specific dynamics in heterogeneous samples, particularly in contexts like the menstrual cycle where dramatic hormone-driven cellular composition changes occur.Method: Developed a probabilistic hierarchical Bayesian model that leverages high-resolution single-cell reference data to infer cell type proportions and cell-specific gene expression changes across biological conditions.
Result: Revealed dynamic shifts in epithelial, stromal, and immune cell fractions between menstrual phases, and identified cell-type-specific differential gene expression (e.g., decidualization markers in stromal cells during secretory phase). The model showed robustness to reference mismatches and noise.
Conclusion: The Bayesian approach provides principled inference of cellular dynamics in heterogeneous tissues, with potential clinical implications for fertility and endometrial disorders, and future integration with spatial transcriptomics.
Abstract: Bulk tissue RNA sequencing of heterogeneous samples provides averaged gene expression profiles, obscuring cell type-specific dynamics. To address this, we present a probabilistic hierarchical Bayesian model that deconvolves bulk RNA-seq data into constituent cell-type expression profiles and proportions, leveraging a high-resolution single-cell reference. We apply our model to human endometrial tissue across the menstrual cycle, a context characterized by dramatic hormone-driven cellular composition changes. Our extended framework provides a principled inference of cell type proportions and cell-specific gene expression changes across cycle phases. We demonstrate the model’s structure, priors, and inference strategy in detail, and we validate its performance with simulations and comparisons to existing methods. The results reveal dynamic shifts in epithelial, stromal, and immune cell fractions between menstrual phases, and identify cell-type-specific differential gene expression associated with endometrial function (e.g., decidualization markers in stromal cells during the secretory phase). We further conduct robustness tests and show that our Bayesian approach is resilient to reference mismatches and noise. Finally, we discuss the biological significance of our findings, potential clinical implications for fertility and endometrial disorders, and future directions, including integration of spatial transcriptomics.
[282] Group-Sensitive Offline Contextual Bandits
Yihong Guo, Junjie Luo, Guodong Gao, Ritu Agarwal, Anqi Liu
Main category: cs.LG
TL;DR: Proposes a constrained offline policy optimization framework for contextual bandits that reduces group-wise reward disparities while maintaining competitive overall performance.
Details
Motivation: Offline policy optimization in contextual bandits can unintentionally amplify reward disparities across groups, raising fairness concerns when resources are limited.Method: Constrained offline policy optimization framework with group-wise reward disparity constraints using doubly robust estimator for improved estimation and convergence guarantee.
Result: Empirical results show effective reduction of reward disparities while maintaining competitive overall performance on synthetic and real-world datasets.
Conclusion: The proposed method successfully addresses group-sensitive fairness in offline contextual bandits by constraining reward disparities during policy optimization.
Abstract: Offline contextual bandits allow one to learn policies from historical/offline data without requiring online interaction. However, offline policy optimization that maximizes overall expected rewards can unintentionally amplify the reward disparities across groups. As a result, some groups might benefit more than others from the learned policy, raising concerns about fairness, especially when the resources are limited. In this paper, we study a group-sensitive fairness constraint in offline contextual bandits, reducing group-wise reward disparities that may arise during policy learning. We tackle the following common-parity requirements: the reward disparity is constrained within some user-defined threshold or the reward disparity should be minimized during policy optimization. We propose a constrained offline policy optimization framework by introducing group-wise reward disparity constraints into an off-policy gradient-based optimization procedure. To improve the estimation of the group-wise reward disparity during training, we employ a doubly robust estimator and further provide a convergence guarantee for policy optimization. Empirical results in synthetic and real-world datasets demonstrate that our method effectively reduces reward disparities while maintaining competitive overall performance.
[283] AI Agents in Drug Discovery
Srijit Seal, Dinh Long Huynh, Moudather Chelbi, Sara Khosravi, Ankur Kumar, Mattson Thieme, Isaac Wilks, Mark Davies, Jessica Mustali, Yannick Sun, Nick Edwards, Daniil Boiko, Andrei Tyrin, Douglas W. Selinger, Ayaan Parikh, Rahul Vijayan, Shoman Kasbekar, Dylan Reid, Andreas Bender, Ola Spjuth
Main category: cs.LG
TL;DR: Agentic AI systems using LLMs with specialized tools are transforming drug discovery by autonomously executing complex workflows, compressing months-long processes into hours while maintaining scientific rigor.
Details
Motivation: To demonstrate how AI agents can revolutionize drug discovery by integrating diverse biomedical data, executing experiments, and iteratively refining hypotheses in closed-loop systems.Method: Uses large language models (LLMs) coupled with perception, computation, action, and memory tools to create agentic AI architectures including ReAct, Reflection, Supervisor, and Swarm systems.
Result: Early implementations show substantial gains in speed, reproducibility, and scalability, compressing workflows from months to hours while maintaining scientific traceability. Real-world deployments demonstrate quantifiable impacts in operational drug discovery settings.
Conclusion: Agentic AI represents a transformative approach for drug discovery, though challenges remain in data heterogeneity, system reliability, privacy, and benchmarking. Future directions focus on supporting science and translation.
Abstract: Artificial intelligence (AI) agents are emerging as transformative tools in drug discovery, with the ability to autonomously reason, act, and learn through complicated research workflows. Building on large language models (LLMs) coupled with perception, computation, action, and memory tools, these agentic AI systems could integrate diverse biomedical data, execute tasks, carry out experiments via robotic platforms, and iteratively refine hypotheses in closed loops. We provide a conceptual and technical overview of agentic AI architectures, ranging from ReAct and Reflection to Supervisor and Swarm systems, and illustrate their applications across key stages of drug discovery, including literature synthesis, toxicity prediction, automated protocol generation, small-molecule synthesis, drug repurposing, and end-to-end decision-making. To our knowledge, this represents the first comprehensive work to present real-world implementations and quantifiable impacts of agentic AI systems deployed in operational drug discovery settings. Early implementations demonstrate substantial gains in speed, reproducibility, and scalability, compressing workflows that once took months into hours while maintaining scientific traceability. We discuss the current challenges related to data heterogeneity, system reliability, privacy, and benchmarking, and outline future directions towards technology in support of science and translation.
[284] Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring
Hong Jiao, Hanna Choi, Haowei Hua
Main category: cs.LG
TL;DR: This study compares essay-based vs rationale-based automated scoring using GPT models, finding essay-based scoring generally performs better but rationale-based scoring improves accuracy for underrepresented score 0. Ensemble modeling combining both approaches achieves the best results.
Details
Motivation: To explore the utilities of GPT-generated rationales in automated essay scoring and compare their effectiveness against traditional essay-based scoring methods.Method: Used GPT-4.1 and GPT-5 to generate rationales for Prompt 6 essays from 2012 Kaggle ASAP data, comparing essay-based scoring with rationale-based scoring, and implementing ensemble modeling approaches.
Result: Essay-based scoring generally performed better with higher QWK, but rationale-based scoring improved F1 scores for underrepresented score 0. Ensemble modeling combining essay-based and both rationale-based scorings achieved the best QWK of 0.870.
Conclusion: Rationale-based scoring complements essay-based scoring, particularly for underrepresented classes, and ensemble approaches combining both methods yield superior automated scoring performance.
Abstract: This study explored the utilities of rationales generated by GPT-4.1 and GPT-5 in automated scoring using Prompt 6 essays from the 2012 Kaggle ASAP data. Essay-based scoring was compared with rationale-based scoring. The study found in general essay-based scoring performed better than rationale-based scoring with higher Quadratic Weighted Kappa (QWK). However, rationale-based scoring led to higher scoring accuracy in terms of F1 scores for score 0 which had less representation due to class imbalance issues. The ensemble modeling of essay-based scoring models increased the scoring accuracy at both specific score levels and across all score levels. The ensemble modeling of essay-based scoring and each of the rationale-based scoring performed about the same. Further ensemble of essay-based scoring and both rationale-based scoring yielded the best scoring accuracy with QWK of 0.870 compared with 0.848 reported in literature.
[285] FairAD: Computationally Efficient Fair Graph Clustering via Algebraic Distance
Minh Phu Vuong, Young-Ju Lee, Iván Ojeda-Ruiz, Chul-Ho Lee
Main category: cs.LG
TL;DR: FairAD is an efficient fair graph clustering method that uses algebraic distance-based affinity matrices and graph coarsening to achieve fairness while being up to 40x faster than existing methods.
Details
Motivation: Growing concerns about machine learning model biases toward demographic groups motivate the study of fairness in graph clustering, but existing methods face computational challenges especially for large graphs.Method: Constructs affinity matrix using algebraic distance to impose fairness constraints, performs graph coarsening to find representative nodes, and solves constrained minimization for fair clustering solution.
Result: Experiments on modified stochastic block model and six public datasets show FairAD achieves fair clustering while being up to 40 times faster than state-of-the-art fair graph clustering algorithms.
Conclusion: FairAD provides a computationally efficient solution for fair graph clustering that maintains fairness constraints while significantly improving speed over existing methods.
Abstract: Due to the growing concern about unsavory behaviors of machine learning models toward certain demographic groups, the notion of ‘fairness’ has recently drawn much attention from the community, thereby motivating the study of fairness in graph clustering. Fair graph clustering aims to partition the set of nodes in a graph into $k$ disjoint clusters such that the proportion of each protected group within each cluster is consistent with the proportion of that group in the entire dataset. It is, however, computationally challenging to incorporate fairness constraints into existing graph clustering algorithms, particularly for large graphs. To address this problem, we propose FairAD, a computationally efficient fair graph clustering method. It first constructs a new affinity matrix based on the notion of algebraic distance such that fairness constraints are imposed. A graph coarsening process is then performed on this affinity matrix to find representative nodes that correspond to $k$ clusters. Finally, a constrained minimization problem is solved to obtain the solution of fair clustering. Experiment results on the modified stochastic block model and six public datasets show that FairAD can achieve fair clustering while being up to 40 times faster compared to state-of-the-art fair graph clustering algorithms.
[286] Relation-Aware Bayesian Optimization of DBMS Configurations Guided by Affinity Scores
Sein Kwon, Seulgi Baek, Hyunseo Yang, Youngwan Jo, Sanghyun Park
Main category: cs.LG
TL;DR: RelTune is a novel DBMS configuration tuning framework that models parameter dependencies using relational graphs and GNN embeddings, and introduces Hybrid-Score-Guided Bayesian Optimization for more efficient optimization.
Details
Motivation: Existing DBMS tuning approaches ignore parameter dependencies, assume independent parameters, and use simplified top-k parameter selection, while Bayesian Optimization suffers from unstable surrogate models and inefficient exploration.Method: RelTune represents parameter dependencies as a Relational Graph, learns GNN-based latent embeddings, and introduces Hybrid-Score-Guided Bayesian Optimization (HBO) that combines surrogate predictions with Affinity Score measuring proximity to high-performing configurations.
Result: Experimental results on multiple DBMSs and workloads show RelTune achieves faster convergence and higher optimization efficiency than conventional BO-based methods, achieving state-of-the-art performance across all evaluated scenarios.
Conclusion: RelTune effectively addresses limitations of existing DBMS tuning methods by modeling parameter dependencies and introducing hybrid optimization, demonstrating superior performance and efficiency.
Abstract: Database Management Systems (DBMSs) are fundamental for managing large-scale and heterogeneous data, and their performance is critically influenced by configuration parameters. Effective tuning of these parameters is essential for adapting to diverse workloads and maximizing throughput while minimizing latency. Recent research has focused on automated configuration optimization using machine learning; however, existing approaches still exhibit several key limitations. Most tuning frameworks disregard the dependencies among parameters, assuming that each operates independently. This simplification prevents optimizers from leveraging relational effects across parameters, limiting their capacity to capture performancesensitive interactions. Moreover, to reduce the complexity of the high-dimensional search space, prior work often selects only the top few parameters for optimization, overlooking others that contribute meaningfully to performance. Bayesian Optimization (BO), the most common method for automatic tuning, is also constrained by its reliance on surrogate models, which can lead to unstable predictions and inefficient exploration. To overcome these limitations, we propose RelTune, a novel framework that represents parameter dependencies as a Relational Graph and learns GNN-based latent embeddings that encode performancerelevant semantics. RelTune further introduces Hybrid-Score-Guided Bayesian Optimization (HBO), which combines surrogate predictions with an Affinity Score measuring proximity to previously high-performing configurations. Experimental results on multiple DBMSs and workloads demonstrate that RelTune achieves faster convergence and higher optimization efficiency than conventional BO-based methods, achieving state-of-the-art performance across all evaluated scenarios.
[287] Exploring Landscapes for Better Minima along Valleys
Tong Zhao, Jiacheng Li, Yuanchang Zhou, Guangming Tan, Weile Jia
Main category: cs.LG
TL;DR: Proposes an adaptor ‘E’ for gradient-based optimizers that continues exploring landscape valleys after reaching local minima to find lower and flatter minima for better generalization.
Details
Motivation: Most optimizers stop at local minima, but complex loss landscapes make it difficult to guarantee these are the best minima for generalization. Need to continue searching for potentially better minima.Method: Developed an adaptor ‘E’ that modifies gradient-based optimizers to continue exploring along landscape valleys (low-loss areas) even after reaching local minima. Provides convergence proofs for both convex and non-convex cases.
Result: Adapted Lamb optimizer (ALTO) improved test accuracy by average 2.5% across various large-batch training tasks compared to state-of-the-art optimizer.
Conclusion: The approach successfully finds lower and flatter minima leading to better generalization, opening new research directions in optimization algorithm design.
Abstract: Finding lower and better-generalizing minima is crucial for deep learning. However, most existing optimizers stop searching the parameter space once they reach a local minimum. Given the complex geometric properties of the loss landscape, it is difficult to guarantee that such a point is the lowest or provides the best generalization. To address this, we propose an adaptor “E” for gradient-based optimizers. The adapted optimizer tends to continue exploring along landscape valleys (areas with low and nearly identical losses) in order to search for potentially better local minima even after reaching a local minimum. This approach increases the likelihood of finding a lower and flatter local minimum, which is often associated with better generalization. We also provide a proof of convergence for the adapted optimizers in both convex and non-convex scenarios for completeness. Finally, we demonstrate their effectiveness in an important but notoriously difficult training scenario, large-batch training, where Lamb is the benchmark optimizer. Our testing results show that the adapted Lamb, ALTO, increases the test accuracy (generalization) of the current state-of-the-art optimizer by an average of 2.5% across a variety of large-batch training tasks. This work potentially opens a new research direction in the design of optimization algorithms.
[288] Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler
Zixuan Hu, Li Shen, Zhenyi Wang, Yongxian Wei, Dacheng Tao
Main category: cs.LG
TL;DR: Bayesian Data Scheduler (BDS) is an adaptive defense strategy against harmful fine-tuning of LLMs that uses Bayesian inference to weight data safety without needing attack simulation, achieving state-of-the-art performance.
Details
Motivation: Existing defense strategies for harmful fine-tuning rely on attack simulation, which is limited by inability to anticipate unknown attacks and adapt to varying attack settings.Method: BDS formulates defense as Bayesian inference, learning posterior distribution of data safety attributes and constraining fine-tuning by weighting data based on sampled safety attributes. Uses neural scheduler for efficient transfer.
Result: Comprehensive results across diverse attack and defense settings demonstrate state-of-the-art performance.
Conclusion: BDS provides an effective adaptive defense against harmful fine-tuning without requiring attack simulation, leveraging Bayesian inference to tailor defense to specific datasets.
Abstract: Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point’s safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by weighting data with their safety attributes sampled from the posterior, thus mitigating the influence of harmful data. By leveraging the post hoc nature of Bayesian inference, the posterior is conditioned on the fine-tuning dataset, enabling BDS to tailor its defense to the specific dataset, thereby achieving adaptive defense. Furthermore, we introduce a neural scheduler based on amortized Bayesian learning, enabling efficient transfer to new data without retraining. Comprehensive results across diverse attack and defense settings demonstrate the state-of-the-art performance of our approach. Code is available at https://github.com/Egg-Hu/Bayesian-Data-Scheduler.
[289] A Polynomial-time Algorithm for Online Sparse Linear Regression with Improved Regret Bound under Weaker Conditions
Junfan Li, Shizhong Liao, Zenglin Xu, Liqiang Nie
Main category: cs.LG
TL;DR: A new polynomial-time algorithm for online sparse linear regression with k-attribute access constraints, improving regret bounds under weaker compatibility conditions using Dantzig Selector with novel techniques.
Details
Motivation: Online sparse linear regression with limited attribute access (k out of d) is NP-hard, and previous algorithms required strong assumptions like linear independence or restricted isometry property.Method: Uses Dantzig Selector with algorithm-dependent covariance sampling, adaptive parameter tuning, and batching online Newton steps with careful initialization. Also extends to additional observations setting.
Result: Significantly improves previous regret bounds under weaker compatibility condition, with tighter convergence rates for ℓ1-norm error. Also improves bounds for OSLR with additional observations.
Conclusion: The proposed algorithm achieves better performance with weaker assumptions through novel techniques in sampling, parameter tuning, and analysis methods for handling non-independent variables.
Abstract: In this paper, we study the problem of online sparse linear regression (OSLR) where the algorithms are restricted to accessing only $k$ out of $d$ attributes per instance for prediction, which was proved to be NP-hard. Previous work gave polynomial-time algorithms assuming the data matrix satisfies the linear independence of features, the compatibility condition, or the restricted isometry property. We introduce a new polynomial-time algorithm, which significantly improves previous regret bounds (Ito et al., 2017) under the compatibility condition that is weaker than the other two assumptions. The improvements benefit from a tighter convergence rate of the $\ell_1$-norm error of our estimators. Our algorithm leverages the well-studied Dantzig Selector, but importantly with several novel techniques, including an algorithm-dependent sampling scheme for estimating the covariance matrix, an adaptive parameter tuning scheme, and a batching online Newton step with careful initializations. We also give novel and non-trivial analyses, including an induction method for analyzing the $\ell_1$-norm error, careful analyses on the covariance of non-independent random variables, and a decomposition on the regret. We further extend our algorithm to OSLR with additional observations where the algorithms can observe additional $k_0$ attributes after each prediction, and improve previous regret bounds (Kale et al., 2017; Ito et al., 2017).
[290] SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference
Zongshun Zhang, Ibrahim Matta
Main category: cs.LG
TL;DR: SERFLOW is a system that dynamically offloads ML model partitions across FaaS and IaaS services, using stage-specific resource provisioning and adaptive load balancing to reduce cloud costs by over 23% while handling dynamic workloads.
Details
Motivation: Prior work on ML model partitioning overlooks real-world factors like VM cold starts and long-tail service time distributions, leading to inefficient resource utilization and high costs when requests exit at different stages of the model.Method: Models ML queries as traversing an acyclic sequence of stages, uses FaaS-based serverless functions with stage-specific resource provisioning that accounts for request exit rates at each stage, and integrates adaptive load balancing across VMs and serverless functions.
Result: Reduces cloud costs by over 23% while efficiently adapting to dynamic workloads, addressing the challenge of varying input-dependent exit rates that make single resource configurations inefficient.
Conclusion: SERFLOW effectively balances processing and transmission delays while minimizing costs for adaptive inference applications by leveraging hybrid FaaS/IaaS resource orchestration with intelligent stage-aware provisioning.
Abstract: Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while minimizing costs of adaptive inference applications. However, prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc. To tackle these limitations, we model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final classifier where requests may exit. Since input-dependent exit rates vary, no single resource configuration suits all query distributions. IaaS-based VMs become underutilized when many requests exit early, yet rapidly scaling to handle request bursts reaching deep layers is impractical. SERFLOW addresses this challenge by leveraging FaaS-based serverless functions (containers) and using stage-specific resource provisioning that accounts for the fraction of requests exiting at each stage. By integrating this provisioning with adaptive load balancing across VMs and serverless functions based on request ingestion, SERFLOW reduces cloud costs by over $23%$ while efficiently adapting to dynamic workloads.
[291] MDAS-GNN: Multi-Dimensional Spatiotemporal GNN with Spatial Diffusion for Urban Traffic Risk Forecasting
Ziyuan Gao
Main category: cs.LG
TL;DR: MDAS-GNN is a multi-dimensional attention-based graph neural network that integrates traffic safety, infrastructure, and environmental risk dimensions for superior accident prediction, reducing errors by up to 40% compared to traditional methods.
Details
Motivation: Traditional accident prediction models fail to capture complex spatial relationships and temporal dependencies in urban transportation networks, while traffic accidents claim over 1.35 million lives annually worldwide.Method: Develops MDAS-GNN with multi-dimensional attention-based spatial-diffusion GNN, integrating three risk dimensions (traffic safety, infrastructure, environmental) using feature-specific spatial diffusion mechanisms and multi-head temporal attention.
Result: Achieves superior performance on UK Department for Transport data across three cities, maintains consistently low prediction errors across all time horizons, with particular strength in long-term forecasting. Multi-dimensional features reduce prediction errors by up to 40%.
Conclusion: Provides civil engineers and urban planners with advanced predictive capabilities for transportation infrastructure design, enabling data-driven decisions for road network optimization and safety interventions.
Abstract: Traffic accidents represent a critical public health challenge, claiming over 1.35 million lives annually worldwide. Traditional accident prediction models treat road segments independently, failing to capture complex spatial relationships and temporal dependencies in urban transportation networks. This study develops MDAS-GNN, a Multi-Dimensional Attention-based Spatial-diffusion Graph Neural Network integrating three core risk dimensions: traffic safety, infrastructure, and environmental risk. The framework employs feature-specific spatial diffusion mechanisms and multi-head temporal attention to capture dependencies across different time horizons. Evaluated on UK Department for Transport accident data across Central London, South Manchester, and SE Birmingham, MDASGNN achieves superior performance compared to established baseline methods. The model maintains consistently low prediction errors across short, medium, and long-term periods, with particular strength in long-term forecasting. Ablation studies confirm that integrated multi-dimensional features outperform singlefeature approaches, reducing prediction errors by up to 40%. This framework provides civil engineers and urban planners with advanced predictive capabilities for transportation infrastructure design, enabling data-driven decisions for road network optimization, infrastructure resource improvements, and strategic safety interventions in urban development projects.
[292] Feature-Function Curvature Analysis: A Geometric Framework for Explaining Differentiable Models
Hamed Najafi, Dongsheng Luo, Jason Liu
Main category: cs.LG
TL;DR: FFCA is a novel XAI framework that analyzes model geometry through 4D feature signatures and tracks their evolution during training, providing insights into hierarchical learning and practical diagnostics for model capacity and overfitting.
Details
Motivation: Current XAI methods provide incomplete static explanations that collapse feature roles into single scores, failing to capture non-linearity, interactions, and the dynamic learning process.Method: Feature-Function Curvature Analysis (FFCA) quantifies features using 4-dimensional signatures (Impact, Volatility, Non-linearity, Interaction) and extends to Dynamic Archetype Analysis to track signature evolution throughout training.
Result: Provides first direct empirical evidence of hierarchical learning (simple linear effects before complex interactions) and novel diagnostics for identifying insufficient model capacity and predicting overfitting onset.
Conclusion: FFCA transforms model explanation from simple quantification to nuanced analysis of the entire learning process through geometric context in both static and dynamic components.
Abstract: Explainable AI (XAI) is critical for building trust in complex machine learning models, yet mainstream attribution methods often provide an incomplete, static picture of a model’s final state. By collapsing a feature’s role into a single score, they are confounded by non-linearity and interactions. To address this, we introduce Feature-Function Curvature Analysis (FFCA), a novel framework that analyzes the geometry of a model’s learned function. FFCA produces a 4-dimensional signature for each feature, quantifying its: (1) Impact, (2) Volatility, (3) Non-linearity, and (4) Interaction. Crucially, we extend this framework into Dynamic Archetype Analysis, which tracks the evolution of these signatures throughout the training process. This temporal view moves beyond explaining what a model learned to revealing how it learns. We provide the first direct, empirical evidence of hierarchical learning, showing that models consistently learn simple linear effects before complex interactions. Furthermore, this dynamic analysis provides novel, practical diagnostics for identifying insufficient model capacity and predicting the onset of overfitting. Our comprehensive experiments demonstrate that FFCA, through its static and dynamic components, provides the essential geometric context that transforms model explanation from simple quantification to a nuanced, trustworthy analysis of the entire learning process.
[293] Soft Task-Aware Routing of Experts for Equivariant Representation Learning
Jaebyeong Jeon, Hyeonseo Jang, Jy-yong Sohn, Kibok Lee
Main category: cs.LG
TL;DR: STAR introduces a soft task-aware routing strategy for projection heads that models them as experts to capture shared or task-specific information, reducing redundant feature learning in joint invariant and equivariant representation learning.
Details
Motivation: Current methods use separate projection heads for invariant and equivariant learning, which overlooks shared information and leads to redundant feature learning and inefficient model capacity usage.Method: Soft Task-Aware Routing (STAR) models projection heads as experts that specialize in capturing either shared or task-specific information, reducing redundancy between invariant and equivariant embeddings.
Result: Experimental results show lower canonical correlations between invariant and equivariant embeddings and consistent improvements across diverse transfer learning tasks.
Conclusion: STAR effectively reduces redundant feature learning in joint invariant and equivariant representation learning, leading to better performance on downstream tasks.
Abstract: Equivariant representation learning aims to capture variations induced by input transformations in the representation space, whereas invariant representation learning encodes semantic information by disregarding such transformations. Recent studies have shown that jointly learning both types of representations is often beneficial for downstream tasks, typically by employing separate projection heads. However, this design overlooks information shared between invariant and equivariant learning, which leads to redundant feature learning and inefficient use of model capacity. To address this, we introduce Soft Task-Aware Routing (STAR), a routing strategy for projection heads that models them as experts. STAR induces the experts to specialize in capturing either shared or task-specific information, thereby reducing redundant feature learning. We validate this effect by observing lower canonical correlations between invariant and equivariant embeddings. Experimental results show consistent improvements across diverse transfer learning tasks. The code is available at https://github.com/YonseiML/star.
[294] FedSM: Robust Semantics-Guided Feature Mixup for Bias Reduction in Federated Learning with Long-Tail Data
Jingrui Zhang, Yimeng Xu, Shujie Li, Feng Liang, Haihan Duan, Yanjie Dong, Victor C. M. Leung, Xiping Hu
Main category: cs.LG
TL;DR: FedSM is a client-centric federated learning framework that addresses model bias from non-IID and long-tail data distributions through semantics-guided feature mixup and lightweight classifier retraining.
Details
Motivation: Federated Learning suffers from biased global models due to non-IID and long-tail data distributions across decentralized clients.Method: Uses pretrained image-text-aligned model to compute semantic relevance, guides category selection for local feature mixup with global prototypes to generate class-consistent pseudo-features, and employs probabilistic category selection to handle domain shift.
Result: Extensive experiments on long-tail datasets show FedSM consistently outperforms state-of-the-art methods in accuracy, with high robustness to domain shift and computational efficiency.
Conclusion: FedSM effectively mitigates bias in federated learning through semantic-guided feature manipulation while maintaining privacy and computational efficiency.
Abstract: Federated Learning (FL) enables collaborative model training across decentralized clients without sharing private data. However, FL suffers from biased global models due to non-IID and long-tail data distributions. We propose \textbf{FedSM}, a novel client-centric framework that mitigates this bias through semantics-guided feature mixup and lightweight classifier retraining. FedSM uses a pretrained image-text-aligned model to compute category-level semantic relevance, guiding the category selection of local features to mix-up with global prototypes to generate class-consistent pseudo-features. These features correct classifier bias, especially when data are heavily skewed. To address the concern of potential domain shift between the pretrained model and the data, we propose probabilistic category selection, enhancing feature diversity to effectively mitigate biases. All computations are performed locally, requiring minimal server overhead. Extensive experiments on long-tail datasets with various imbalanced levels demonstrate that FedSM consistently outperforms state-of-the-art methods in accuracy, with high robustness to domain shift and computational efficiency.
[295] Not All Instances Are Equally Valuable: Towards Influence-Weighted Dataset Distillation
Qiyan Deng, Changqian Zheng, Lianpeng Qiao, Yuping Wang, Chengliang Chai, Lei Cao
Main category: cs.LG
TL;DR: IWD is a dataset distillation framework that uses influence functions to prioritize beneficial data instances and downweight harmful ones, improving distilled dataset quality and model performance by up to 7.8%.
Details
Motivation: Real-world datasets contain both informative and harmful instances, but existing dataset distillation methods treat all instances equally, which can degrade model performance when distilling without considering data quality.Method: Influence-Weighted Distillation (IWD) leverages influence functions to assign adaptive weights to each instance based on its estimated impact on the distillation objective, prioritizing beneficial data while downweighting less useful or harmful instances.
Result: Integrating IWD improves the quality of distilled datasets and enhances model performance with accuracy gains of up to 7.8%.
Conclusion: IWD provides a principled framework that explicitly accounts for data quality in dataset distillation and can be seamlessly integrated into diverse distillation frameworks to improve results.
Abstract: Dataset distillation condenses large datasets into synthetic subsets, achieving performance comparable to training on the full dataset while substantially reducing storage and computation costs. Most existing dataset distillation methods assume that all real instances contribute equally to the process. In practice, real-world datasets contain both informative and redundant or even harmful instances, and directly distilling the full dataset without considering data quality can degrade model performance. In this work, we present Influence-Weighted Distillation IWD, a principled framework that leverages influence functions to explicitly account for data quality in the distillation process. IWD assigns adaptive weights to each instance based on its estimated impact on the distillation objective, prioritizing beneficial data while downweighting less useful or harmful ones. Owing to its modular design, IWD can be seamlessly integrated into diverse dataset distillation frameworks. Our empirical results suggest that integrating IWD tends to improve the quality of distilled datasets and enhance model performance, with accuracy gains of up to 7.8%.
[296] ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models
Xin Tang, Youfang Han, Fangfei Gou, Wei Zhao, Xin Meng, Yang Yu, Jinguo Zhang, Yuanchun Shi, Yuntao Wang, Tengxiang Zhang
Main category: cs.LG
TL;DR: ECVL-ROUTER is a scenario-aware routing framework that dynamically selects between large cloud-based and small edge-based Vision-Language Models based on user requirements for fast response, high-quality output, or low energy consumption.
Details
Motivation: Current approaches either rely solely on large cloud models (high latency/energy) or small edge models (limited capability), but user needs vary across scenarios requiring different trade-offs between speed, quality, and efficiency.Method: Proposed ECVL-ROUTER framework with new routing strategy and evaluation metrics that dynamically selects appropriate model for each query. Also constructed multimodal response-quality dataset for router training.
Result: Successfully routes over 80% of queries to small model while maintaining less than 10% drop in problem solving probability compared to using only large models.
Conclusion: The framework effectively leverages strengths of both large and small models, achieving significant efficiency gains while maintaining acceptable quality levels for most queries.
Abstract: Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80% of queries to the small model while incurring less than 10% drop in problem solving probability.
[297] Higher-order Linear Attention
Yifan Zhang, Zhen Qin, Quanquan Gu
Main category: cs.LG
TL;DR: HLA introduces a higher-order linear attention mechanism that enables efficient long-context modeling with constant-size state and linear time complexity, overcoming quadratic costs of standard attention.
Details
Motivation: To address the quadratic cost limitation of scaled dot-product attention in scaling autoregressive language models to long contexts, while maintaining expressivity beyond first-order approximations.Method: Develops Higher-order Linear Attention (HLA) using compact prefix sufficient statistics, closed-form streaming identities, causal masked variants, and chunk-parallel training via associative scans.
Result: HLA achieves linear-time computation without materializing n×n matrices, maintains constant-size state, and enables exact reproduction of serial recurrence activations in parallel training.
Conclusion: HLA provides a principled, scalable building block that combines attention-like data-dependent mixing with the efficiency of modern recurrent architectures for long-context modeling.
Abstract: The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.
[298] ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction
Han Yu, Kehan Li, Dongbai Li, Yue He, Xingxuan Zhang, Peng Cui
Main category: cs.LG
TL;DR: ODP-Bench is a comprehensive benchmark for Out-of-Distribution (OOD) performance prediction that standardizes evaluation protocols and provides trained models for fair algorithm comparisons.
Details
Motivation: To address inconsistent evaluation protocols and limited coverage of real-world OOD datasets in existing literature, enabling better deployment of trained models in risk-sensitive scenarios.Method: Proposed ODP-Bench benchmark that includes most commonly used OOD datasets and existing practical performance prediction algorithms, providing trained models as a testbench.
Result: Created a comprehensive benchmark that ensures consistent comparison and eliminates the need for repeating model training, with in-depth experimental analyses conducted.
Conclusion: ODP-Bench provides a standardized platform for fair evaluation of OOD performance prediction algorithms and helps understand their capability boundaries.
Abstract: Recently, there has been gradually more attention paid to Out-of-Distribution (OOD) performance prediction, whose goal is to predict the performance of trained models on unlabeled OOD test datasets, so that we could better leverage and deploy off-the-shelf trained models in risk-sensitive scenarios. Although progress has been made in this area, evaluation protocols in previous literature are inconsistent, and most works cover only a limited number of real-world OOD datasets and types of distribution shifts. To provide convenient and fair comparisons for various algorithms, we propose Out-of-Distribution Performance Prediction Benchmark (ODP-Bench), a comprehensive benchmark that includes most commonly used OOD datasets and existing practical performance prediction algorithms. We provide our trained models as a testbench for future researchers, thus guaranteeing the consistency of comparison and avoiding the burden of repeating the model training process. Furthermore, we also conduct in-depth experimental analyses to better understand their capability boundary.
[299] HiF-DTA: Hierarchical Feature Learning Network for Drug-Target Affinity Prediction
Minghui Li, Yuanhang Wang, Peijin Guo, Wei Wan, Shengshan Hu, Shengqing Hu
Main category: cs.LG
TL;DR: HiF-DTA is a hierarchical network for drug-target affinity prediction that extracts both global sequence semantic and local topological features from drug and protein sequences, and models drugs at multiple scales (atomic, substructural, molecular) with multi-scale bilinear attention fusion.
Details
Motivation: Existing sequence-based deep learning methods for DTA prediction overlook simultaneous modeling of global sequence semantic features and local topological structural features, and represent drugs as flat sequences without multi-scale features, limiting prediction accuracy.Method: Dual-pathway strategy to extract global sequence semantic and local topological features from drug and protein sequences; multi-scale drug modeling with atomic, substructural, and molecular representations fused via multi-scale bilinear attention module.
Result: Outperforms state-of-the-art baselines on Davis, KIBA, and Metz datasets; ablation studies confirm importance of global-local feature extraction and multi-scale fusion.
Conclusion: HiF-DTA effectively improves DTA prediction by simultaneously capturing global semantic and local topological features, and modeling drugs at multiple scales with attention-based fusion.
Abstract: Accurate prediction of Drug-Target Affinity (DTA) is crucial for reducing experimental costs and accelerating early screening in computational drug discovery. While sequence-based deep learning methods avoid reliance on costly 3D structures, they still overlook simultaneous modeling of global sequence semantic features and local topological structural features within drugs and proteins, and represent drugs as flat sequences without atomic-level, substructural-level, and molecular-level multi-scale features. We propose HiF-DTA, a hierarchical network that adopts a dual-pathway strategy to extract both global sequence semantic and local topological features from drug and protein sequences, and models drugs multi-scale to learn atomic, substructural, and molecular representations fused via a multi-scale bilinear attention module. Experiments on Davis, KIBA, and Metz datasets show HiF-DTA outperforms state-of-the-art baselines, with ablations confirming the importance of global-local extraction and multi-scale fusion.
[300] Un-Attributability: Computing Novelty From Retrieval & Semantic Similarity
Philipp Davydov, Ameya Prabhu, Matthias Bethge, Elisa Nguyen, Seong Joon Oh
Main category: cs.LG
TL;DR: The paper introduces un-attributability as a measure of semantic novelty in language model outputs, using a retrieval-based approach to identify outputs that cannot be attributed to any pretraining examples.
Details
Motivation: To understand how language model outputs relate to pretraining data and identify semantically novel outputs that cannot be traced back to the training corpus.Method: A two-stage retrieval pipeline: first index the corpus with GIST embeddings to retrieve top candidates, then rerank with ColBERTv2. Outputs are considered novel if the nearest corpus item is less attributable than human-generated text references.
Result: Three key findings: (1) models use pretraining data across longer spans than previously thought; (2) some domains systematically affect novelty; (3) instruction tuning increases novelty beyond just style changes.
Conclusion: Reframing novelty assessment around un-attributability enables efficient large-scale analysis of language model behavior and their relationship to training data.
Abstract: Understanding how language-model outputs relate to the pretraining corpus is central to studying model behavior. Most training data attribution (TDA) methods ask which training examples causally influence a given output, often using leave-one-out tests. We invert the question: which outputs cannot be attributed to any pretraining example? We introduce un-attributability as an operational measure of semantic novelty: an output is novel if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the nearest corpus item is less attributable than a human-generated text reference, we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around un-attributability enables efficient analysis at pretraining scale. We release ~20 TB of corpus chunks and index artifacts to support replication and large-scale extension of our analysis at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm
[301] Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments
Harsh Vishwakarma, Ankush Agarwal, Ojas Patil, Chaitanya Devaguptapu, Mahesh Chandran
Main category: cs.LG
TL;DR: EnterpriseBench is a comprehensive benchmark for evaluating LLM-based systems in enterprise environments, featuring 500 diverse tasks across multiple domains that simulate real enterprise challenges like data fragmentation and access controls.
Details
Motivation: Enterprise systems need LLM integration for intelligent automation and efficiency, but development is challenging due to complex enterprise environments with fragmented data and strict access controls.Method: Created EnterpriseBench with 500 tasks across software engineering, HR, finance, and administrative domains, plus a data generation pipeline that creates internally consistent tasks from organizational metadata.
Result: Experiments with state-of-the-art LLM agents show only 41.8% task completion rate, indicating significant room for improvement in enterprise AI systems.
Conclusion: EnterpriseBench provides a realistic testbed for enterprise AI systems, revealing current limitations and highlighting the need for better models that can handle enterprise complexity.
Abstract: Enterprise systems are crucial for enhancing productivity and decision-making among employees and customers. Integrating LLM based systems into enterprise systems enables intelligent automation, personalized experiences, and efficient information retrieval, driving operational efficiency and strategic growth. However, developing and evaluating such systems is challenging due to the inherent complexity of enterprise environments, where data is fragmented across multiple sources and governed by sophisticated access controls. We present EnterpriseBench, a comprehensive benchmark that simulates enterprise settings, featuring 500 diverse tasks across software engineering, HR, finance, and administrative domains. Our benchmark uniquely captures key enterprise characteristics including data source fragmentation, access control hierarchies, and cross-functional workflows. Additionally, we provide a novel data generation pipeline that creates internally consistent enterprise tasks from organizational metadata. Experiments with state-of-the-art LLM agents demonstrate that even the most capable models achieve only 41.8% task completion, highlighting significant opportunities for improvement in enterprise-focused AI systems.
[302] Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity
Austin Meek, Eitan Sprejer, Iván Arcuschin, Austin J. Brockmeier, Steven Basart
Main category: cs.LG
TL;DR: The paper introduces ‘verbosity’ as a key factor in chain-of-thought (CoT) monitoring, combining it with faithfulness to create a holistic monitorability score that measures how well CoT serves as external working memory for model safety monitoring.
Details
Motivation: Current CoT faithfulness measures are limited, focusing only on cases where models change answers after cues. This misses important information when models maintain answers and doesn't examine reasoning aspects unrelated to cues, creating gaps in safety monitoring capabilities.Method: The authors introduce ‘verbosity’ - whether CoT lists all factors needed to solve tasks - and combine it with faithfulness into a unified monitorability score. They evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU benchmarks using their Inspect library.
Result: Models can appear faithful yet remain hard to monitor when they omit key factors in their reasoning. Monitorability varies significantly across different model families, highlighting the importance of both faithfulness and verbosity for effective safety monitoring.
Conclusion: A holistic monitorability score combining faithfulness and verbosity provides better assessment of CoT’s utility as external working memory for safety monitoring. The released evaluation code supports reproducible future work in this area.
Abstract: Chain-of-thought (CoT) outputs let us read a model’s step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model’s external `working memory’, a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.
[303] Temporal Cardiovascular Dynamics for Improved PPG-Based Heart Rate Estimation
Berken Utku Demirel, Christian Holz
Main category: cs.LG
TL;DR: A novel approach that uses mutual information to analyze heart rate’s chaotic behavior, improving deep learning-based heart rate estimation by up to 40% while reducing reliance on multiple sensors and post-processing.
Details
Motivation: Heart rate oscillations are complex and non-linear, presenting challenges for practical cardiovascular health monitoring in everyday life. Traditional methods struggle with this chaotic behavior.Method: Proposed approach uses mutual information to study non-linear chaotic behavior of heart rate, combining mathematical chaos theory with deep learning solutions to handle temporal complexity.
Result: Validated on four real-life datasets, the method shows up to 40% improvement in heart rate estimation compared to traditional methods and existing machine learning techniques, while reducing sensor requirements and eliminating post-processing.
Conclusion: The mutual information-based approach successfully handles heart rate’s non-linear complexity and significantly enhances estimation accuracy in real-world conditions, offering a more practical solution for cardiovascular monitoring.
Abstract: The oscillations of the human heart rate are inherently complex and non-linear – they are best described by mathematical chaos, and they present a challenge when applied to the practical domain of cardiovascular health monitoring in everyday life. In this work, we study the non-linear chaotic behavior of heart rate through mutual information and introduce a novel approach for enhancing heart rate estimation in real-life conditions. Our proposed approach not only explains and handles the non-linear temporal complexity from a mathematical perspective but also improves the deep learning solutions when combined with them. We validate our proposed method on four established datasets from real-life scenarios and compare its performance with existing algorithms thoroughly with extensive ablation experiments. Our results demonstrate a substantial improvement, up to 40%, of the proposed approach in estimating heart rate compared to traditional methods and existing machine-learning techniques while reducing the reliance on multiple sensing modalities and eliminating the need for post-processing steps.
[304] Atlas-Alignment: Making Interpretability Transferable Across Language Models
Bruno Puri, Jim Berend, Sebastian Lapuschkin, Wojciech Samek
Main category: cs.LG
TL;DR: Atlas-Alignment enables interpretability transfer across language models by aligning unknown latent spaces to a pre-labeled Concept Atlas using lightweight representational alignment, eliminating the need for costly model-specific sparse autoencoders and manual labeling.
Details
Motivation: Existing interpretability pipelines are costly and difficult to scale, requiring expensive training of model-specific sparse autoencoders, manual component labeling, and validation for each new model.Method: Transfer interpretability by aligning unknown latent spaces to a Concept Atlas using shared inputs and lightweight representational alignment techniques, enabling semantic feature search and steerable generation.
Result: Quantitative and qualitative evaluations show that simple representational alignment enables robust semantic retrieval and steerable generation without labeled concept data.
Conclusion: Atlas-Alignment amortizes explainable AI costs by allowing one high-quality Concept Atlas to make many new models transparent and controllable at minimal marginal cost.
Abstract: Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.
[305] Binary Anomaly Detection in Streaming IoT Traffic under Concept Drift
Rodrigo Matos Carnier, Laura Lahesoo, Kensuke Fukuda
Main category: cs.LG
TL;DR: This paper compares batch vs streaming learning for IoT anomaly detection, showing streaming methods like Adaptive Random Forest and Hoeffding Adaptive Tree handle concept drift better with lower computational costs.
Details
Motivation: Traditional batch ML models struggle with concept drift and high maintenance in IoT networks, while streaming learning offers online updates and better adaptability to changing anomaly patterns.Method: Simulated heterogeneous IoT network data streams by mixing existing datasets and streaming samples sequentially. Compared batch and streaming learning approaches, focusing on tree-based vs non-tree-based algorithms for binary classification of anomalies.
Result: Adaptive Random Forest achieved F1-score of 0.990 ± 0.006 at one-third the computational cost of batch methods. Hoeffding Adaptive Tree reached F1-score of 0.910 ± 0.007 with four times lower computational cost, making it suitable for online applications despite slight stability trade-offs.
Conclusion: Streaming learning outperforms batch learning for IoT anomaly detection by handling concept drift effectively, with tree-based algorithms being particularly competitive. However, current datasets still lack sufficient heterogeneity to fully expose model limitations.
Abstract: With the growing volume of Internet of Things (IoT) network traffic, machine learning (ML)-based anomaly detection is more relevant than ever. Traditional batch learning models face challenges such as high maintenance and poor adaptability to rapid anomaly changes, known as concept drift. In contrast, streaming learning integrates online and incremental learning, enabling seamless updates and concept drift detection to improve robustness. This study investigates anomaly detection in streaming IoT traffic as binary classification, comparing batch and streaming learning approaches while assessing the limitations of current IoT traffic datasets. We simulated heterogeneous network data streams by carefully mixing existing datasets and streaming the samples one by one. Our results highlight the failure of batch models to handle concept drift, but also reveal persisting limitations of current datasets to expose model limitations due to low traffic heterogeneity. We also investigated the competitiveness of tree-based ML algorithms, well-known in batch anomaly detection, and compared it to non-tree-based ones, confirming the advantages of the former. Adaptive Random Forest achieved F1-score of 0.990 $\pm$ 0.006 at one-third the computational cost of its batch counterpart. Hoeffding Adaptive Tree reached F1-score of 0.910 $\pm$ 0.007, reducing computational cost by four times, making it a viable choice for online applications despite a slight trade-off in stability.
[306] Thought Branches: Interpreting LLM Reasoning Requires Resampling
Uzay Macar, Paul C. Bogdan, Senthooran Rajamanoharan, Neel Nanda
Main category: cs.LG
TL;DR: This paper argues that studying single chain-of-thought (CoT) samples is inadequate for understanding causal influence in reasoning models. Instead, they propose using resampling methods to analyze distributions over CoTs, demonstrating four applications: measuring causal impact of stated reasons, comparing on-policy vs off-policy interventions, assessing resilience of reasoning steps, and analyzing subtle influences in unfaithful reasoning.
Details
Motivation: Most existing work on interpreting reasoning models only studies single chain-of-thought samples, but these models actually define distributions over many possible CoTs. Studying just one sample is inadequate for understanding causal influence and the underlying computational processes.Method: The authors use resampling methods to investigate model decisions through four case studies: 1) measuring causal impact of specific sentences in agentic misalignment scenarios, 2) comparing on-policy resampling vs off-policy artificial edits, 3) introducing resilience metrics to prevent repeated reasoning steps, and 4) adapting causal mediation analysis for unfaithful reasoning settings.
Result: Key findings include: self-preservation sentences have small causal impact on blackmail behavior; off-policy interventions yield small and unstable effects compared to resampling; critical planning statements resist removal but have large effects when eliminated; hints exert subtle cumulative influence on CoT even when removed.
Conclusion: Studying distributions of chain-of-thought via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions, providing a more comprehensive understanding of reasoning models than single-sample analysis.
Abstract: Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, it can be understood by sampling. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In “agentic misalignment” scenarios, we resample specific sentences to measure their downstream effects. Self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? These are common in literature, yet take the model off-policy. Resampling and selecting a completion with the desired property is a principled on-policy alternative. We find off-policy interventions yield small and unstable effects compared to resampling in decision-making tasks. Third, how do we understand the effect of removing a reasoning step when the model may repeat it post-edit? We introduce a resilience metric that repeatedly resamples to prevent similar content from reappearing downstream. Critical planning statements resist removal but have large effects when eliminated. Fourth, since CoT is sometimes “unfaithful”, can our methods teach us anything in these settings? Adapting causal mediation analysis, we find that hints that have a causal effect on the output without being explicitly mentioned exert a subtle and cumulative influence on the CoT that persists even if the hint is removed. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions.
[307] MedM2T: A MultiModal Framework for Time-Aware Modeling with Electronic Health Record and Electrocardiogram Data
Yu-Chen Kuo, Yi-Ju Tseng
Main category: cs.LG
TL;DR: MedM2T is a time-aware multimodal framework for medical data that handles irregular time series, captures temporal patterns, and enables cross-modal interactions, achieving state-of-the-art performance on cardiovascular disease prediction, mortality prediction, and ICU length-of-stay regression.
Details
Motivation: Medical data presents challenges due to its multimodality and heterogeneous temporal structures, requiring specialized approaches to handle irregular time series and capture complex temporal patterns across different data types.Method: MedM2T integrates three key components: Sparse Time Series Encoder for irregular data, Hierarchical Time-Aware Fusion for capturing micro- and macro-temporal patterns from dense time series, and Bi-Modal Attention for cross-modal interactions. It uses modality-specific pre-trained encoders and aligns features in a shared encoder to address granularity gaps.
Result: MedM2T outperformed state-of-the-art methods on MIMIC-IV datasets, achieving AUROC of 0.947 and AUPRC of 0.706 for CVD prediction, AUROC of 0.901 and AUPRC of 0.558 for mortality prediction, and MAE of 2.31 for LOS regression.
Conclusion: MedM2T demonstrates robustness and broad applicability for clinical prediction tasks, positioning it as a promising tool for handling complex medical data with superior performance compared to existing multimodal and time series models.
Abstract: The inherent multimodality and heterogeneous temporal structures of medical data pose significant challenges for modeling. We propose MedM2T, a time-aware multimodal framework designed to address these complexities. MedM2T integrates: (i) Sparse Time Series Encoder to flexibly handle irregular and sparse time series, (ii) Hierarchical Time-Aware Fusion to capture both micro- and macro-temporal patterns from multiple dense time series, such as ECGs, and (iii) Bi-Modal Attention to extract cross-modal interactions, which can be extended to any number of modalities. To mitigate granularity gaps between modalities, MedM2T uses modality-specific pre-trained encoders and aligns resulting features within a shared encoder. We evaluated MedM2T on MIMIC-IV and MIMIC-IV-ECG datasets for three tasks that encompass chronic and acute disease dynamics: 90-day cardiovascular disease (CVD) prediction, in-hospital mortality prediction, and ICU length-of-stay (LOS) regression. MedM2T outperformed state-of-the-art multimodal learning frameworks and existing time series models, achieving an AUROC of 0.947 and an AUPRC of 0.706 for CVD prediction; an AUROC of 0.901 and an AUPRC of 0.558 for mortality prediction; and Mean Absolute Error (MAE) of 2.31 for LOS regression. These results highlight the robustness and broad applicability of MedM2T, positioning it as a promising tool in clinical prediction. We provide the implementation of MedM2T at https://github.com/DHLab-TSENG/MedM2T.
[308] Reasoning Models Sometimes Output Illegible Chains of Thought
Arun Jose
Main category: cs.LG
TL;DR: RL-trained language models using chain-of-thought reasoning often generate illegible reasoning while maintaining readable final answers, undermining monitoring effectiveness.
Details
Motivation: To understand if chain-of-thought reasoning can be used to monitor model intentions and detect malicious behavior, which requires legible and faithful reasoning processes.Method: Studied CoT legibility across 14 reasoning models, analyzed accuracy when forced to use only legible portions, and examined legibility correlation with performance through resampling.
Result: RL often causes reasoning to become illegible to both humans and AI monitors, with accuracy dropping 53% when using only legible portions. Legibility degrades on harder questions, but no correlation found between legibility and performance in resampling.
Conclusion: Without explicit optimization for legibility, outcome-based RL naturally produces models with increasingly opaque reasoning processes, potentially undermining monitoring approaches.
Abstract: Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance. Monitoring such a model’s CoT may allow us to understand its intentions and detect potential malicious behavior. However, to be effective, this requires that CoTs are legible and faithful. We study CoT legibility across 14 reasoning models, finding that RL often causes reasoning to become illegible to both humans and AI monitors, with reasoning models (except Claude) generating illegible CoTs while returning to perfectly readable final answers. We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53% when forced to use only legible portions), yet find no correlation between legibility and performance when resampling - suggesting the relationship is more nuanced. We also find that legibility degrades on harder questions. We discuss potential hypotheses for these results, including steganography, training artifacts, and vestigial tokens. These results suggest that without explicit optimization for legibility, outcome-based RL naturally produces models with increasingly opaque reasoning processes, potentially undermining monitoring approaches.
[309] FedMuon: Accelerating Federated Learning with Matrix Orthogonalization
Junkang Liu, Fanhua Shang, Junchao Zhou, Hongying Liu, Yuanyuan Liu, Jin Liu
Main category: cs.LG
TL;DR: FedMuon is a novel federated learning optimizer that addresses client drift in non-IID settings by using momentum aggregation and local-global alignment, achieving linear speedup convergence and reducing communication rounds.
Details
Motivation: Existing FL methods using element-wise optimizers (Adam/SGD) neglect geometric structure of weight matrices, amplifying pathological directions and leading to slow convergence. The Muon optimizer shows promise in IID settings but causes client drift in non-IID settings due to independent matrix orthogonalization.Method: Proposes FedMuon with two key techniques: (1) momentum aggregation - clients use aggregated momentum for local initialization; (2) local-global alignment - local gradients are aligned with global update direction to reduce client drift.
Result: Theoretically proven linear speedup convergence rate without heterogeneity assumption. Empirically validated on language and vision models, showing significant reduction in communication rounds and improved test accuracy compared to baselines.
Conclusion: FedMuon effectively addresses client drift challenges in non-IID FL settings through momentum aggregation and local-global alignment, achieving superior performance in reducing communication rounds and improving model accuracy.
Abstract: The core bottleneck of Federated Learning (FL) lies in the communication rounds. That is, how to achieve more effective local updates is crucial for reducing communication rounds. Existing FL methods still primarily use element-wise local optimizers (Adam/SGD), neglecting the geometric structure of the weight matrices. This often leads to the amplification of pathological directions in the weights during local updates, leading deterioration in the condition number and slow convergence. Therefore, we introduce the Muon optimizer in local, which has matrix orthogonalization to optimize matrix-structured parameters. Experimental results show that, in IID setting, Local Muon significantly accelerates the convergence of FL and reduces communication rounds compared to Local SGD and Local AdamW. However, in non-IID setting, independent matrix orthogonalization based on the local distributions of each client induces strong client drift. Applying Muon in non-IID FL poses significant challenges: (1) client preconditioner leading to client drift; (2) moment reinitialization. To address these challenges, we propose a novel Federated Muon optimizer (FedMuon), which incorporates two key techniques: (1) momentum aggregation, where clients use the aggregated momentum for local initialization; (2) local-global alignment, where the local gradients are aligned with the global update direction to significantly reduce client drift. Theoretically, we prove that \texttt{FedMuon} achieves a linear speedup convergence rate without the heterogeneity assumption, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. Empirically, we validate the effectiveness of FedMuon on language and vision models. Compared to several baselines, FedMuon significantly reduces communication rounds and improves test accuracy.
[310] MVeLMA: Multimodal Vegetation Loss Modeling Architecture for Predicting Post-fire Vegetation Loss
Meenu Ravi, Shailik Sarkar, Yanshen Sun, Vaishnavi Singh, Chang-Tien Lu
Main category: cs.LG
TL;DR: MVeLMA is a novel multimodal ML pipeline that predicts county-wise vegetation loss after wildfires using ensemble methods and probabilistic modeling, outperforming SOTA models and providing confidence maps for targeted recovery efforts.
Details
Motivation: Current wildfire vegetation loss prediction methods lack comprehensive factor exploration, multimodal integration, and interpretability, limiting their real-world applicability for ecological recovery planning.Method: Proposes MVeLMA - a multimodal feature integration pipeline with stacked ensemble architecture that incorporates uncertainty estimation through probabilistic modeling to predict vegetation loss.
Result: MVeLMA outperforms state-of-the-art and baseline models in predicting post-wildfire vegetation loss and generates confidence maps to identify high-risk counties.
Conclusion: The approach enables more effective disaster relief planning, ecological policy development, and wildlife recovery management through improved predictive accuracy and interpretability.
Abstract: Understanding post-wildfire vegetation loss is critical for developing effective ecological recovery strategies and is often challenging due to the extended time and effort required to capture the evolving ecosystem features. Recent works in this area have not fully explored all the contributing factors, their modalities, and interactions with each other. Furthermore, most research in this domain is limited by a lack of interpretability in predictive modeling, making it less useful in real-world settings. In this work, we propose a novel end-to-end ML pipeline called MVeLMA (\textbf{M}ultimodal \textbf{Ve}getation \textbf{L}oss \textbf{M}odeling \textbf{A}rchitecture) to predict county-wise vegetation loss from fire events. MVeLMA uses a multimodal feature integration pipeline and a stacked ensemble-based architecture to capture different modalities while also incorporating uncertainty estimation through probabilistic modeling. Through comprehensive experiments, we show that our model outperforms several state-of-the-art (SOTA) and baseline models in predicting post-wildfire vegetation loss. Furthermore, we generate vegetation loss confidence maps to identify high-risk counties, thereby helping targeted recovery efforts. The findings of this work have the potential to inform future disaster relief planning, ecological policy development, and wildlife recovery management.
[311] Spectral Neural Graph Sparsification
Angelica Liguori, Ettore Ritacco, Pietro Sabatino, Annalisa Socievole
Main category: cs.LG
TL;DR: Spectral Preservation Network is a new graph representation learning framework that generates reduced graphs as faithful proxies of the original, enabling efficient downstream tasks while preserving spectral properties.
Details
Motivation: Overcome limitations of Graph Neural Networks including reliance on fixed structures and susceptibility to over-smoothing, while enabling efficient graph processing for tasks like community detection and information diffusion.Method: Introduces Joint Graph Evolution layer for adaptive transformation of graph topology and node features, and Spectral Concordance loss for enforcing consistency in spectral properties and node features across transformations.
Result: Superior performance in node-level sparsification compared to state-of-the-art methods, demonstrating clear advantages in experimental evaluations.
Conclusion: The proposed framework effectively generates reduced graphs that serve as faithful proxies while maintaining computational efficiency, addressing key limitations of existing graph learning approaches.
Abstract: Graphs are central to modeling complex systems in domains such as social networks, molecular chemistry, and neuroscience. While Graph Neural Networks, particularly Graph Convolutional Networks, have become standard tools for graph learning, they remain constrained by reliance on fixed structures and susceptibility to over-smoothing. We propose the Spectral Preservation Network, a new framework for graph representation learning that generates reduced graphs serving as faithful proxies of the original, enabling downstream tasks such as community detection, influence propagation, and information diffusion at a reduced computational cost. The Spectral Preservation Network introduces two key components: the Joint Graph Evolution layer and the Spectral Concordance loss. The former jointly transforms both the graph topology and the node feature matrix, allowing the structure and attributes to evolve adaptively across layers and overcoming the rigidity of static neighborhood aggregation. The latter regularizes these transformations by enforcing consistency in both the spectral properties of the graph and the feature vectors of the nodes. We evaluate the effectiveness of Spectral Preservation Network on node-level sparsification by analyzing well-established metrics and benchmarking against state-of-the-art methods. The experimental results demonstrate the superior performance and clear advantages of our approach.
[312] Simplex-to-Euclidean Bijections for Categorical Flow Matching
Bernardo Williams, Victor M. Yeom-Song, Marcelo Hartmann, Arto Klami
Main category: cs.LG
TL;DR: A method for learning and sampling from simplex distributions using Aitchison geometry-based bijections to Euclidean space, with Dirichlet interpolation for categorical data modeling.
Details
Motivation: To enable density modeling in Euclidean space while respecting the simplex structure and supporting exact recovery of discrete distributions, overcoming limitations of Riemannian geometry approaches.Method: Maps open simplex to Euclidean space via smooth bijections based on Aitchison geometry, uses Dirichlet interpolation to dequantize discrete observations into continuous ones for categorical data modeling.
Result: Achieves competitive performance on both synthetic and real-world datasets while operating in Euclidean space with Aitchison geometry preservation.
Conclusion: The proposed approach effectively enables simplex distribution modeling in Euclidean space with exact discrete distribution recovery, outperforming previous Riemannian geometry methods.
Abstract: We propose a method for learning and sampling from probability distributions supported on the simplex. Our approach maps the open simplex to Euclidean space via smooth bijections, leveraging the Aitchison geometry to define the mappings, and supports modeling categorical data by a Dirichlet interpolation that dequantizes discrete observations into continuous ones. This enables density modeling in Euclidean space through the bijection while still allowing exact recovery of the original discrete distribution. Compared to previous methods that operate on the simplex using Riemannian geometry or custom noise processes, our approach works in Euclidean space while respecting the Aitchison geometry, and achieves competitive performance on both synthetic and real-world data sets.
[313] FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models
Junkang Liu, Fanhua Shang, Kewen Zhu, Hongying Liu, Yuanyuan Liu, Jin Liu
Main category: cs.LG
TL;DR: FedAdamW is a federated learning algorithm that adapts AdamW optimizer to address challenges in federated settings, including data heterogeneity, local overfitting, and slow convergence from moment reinitialization.
Details
Motivation: AdamW is effective for large-scale models but faces challenges in federated learning: high variance in second-moment estimates due to data heterogeneity, local overfitting causing client drift, and slow convergence from reinitializing moment estimates each round.Method: FedAdamW uses local correction mechanism and decoupled weight decay to align local updates with global updates, aggregates mean of second-moment estimates to reduce variance, and avoids reinitializing moment estimates.
Result: Theoretically proven linear speedup convergence rate without heterogeneity assumption. Empirically validated on language and vision Transformer models, showing significant reduction in communication rounds and improved test accuracy compared to baselines.
Conclusion: FedAdamW effectively adapts AdamW for federated learning, addressing key challenges and demonstrating superior performance in both theoretical analysis and empirical evaluation.
Abstract: AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate $\boldsymbol{v}$; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates ($\boldsymbol{v}$, $\boldsymbol{m}$) at each round slows down convergence. To address these challenges, we propose the first \underline{Fed}erated \underline{AdamW} algorithm, called \texttt{FedAdamW}, for training and fine-tuning various large models. \texttt{FedAdamW} aligns local updates with the global update using both a \textbf{local correction mechanism} and decoupled weight decay to mitigate local overfitting. \texttt{FedAdamW} efficiently aggregates the \texttt{mean} of the second-moment estimates to reduce their variance and reinitialize them. Theoretically, we prove that \texttt{FedAdamW} achieves a linear speedup convergence rate of $\mathcal{O}(\sqrt{(L \Delta \sigma_l^2)/(S K R \epsilon^2)}+(L \Delta)/R)$ without \textbf{heterogeneity assumption}, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. We also employ PAC-Bayesian generalization analysis to explain the effectiveness of decoupled weight decay in local training. Empirically, we validate the effectiveness of \texttt{FedAdamW} on language and vision Transformer models. Compared to several baselines, \texttt{FedAdamW} significantly reduces communication rounds and improves test accuracy. The code is available in https://github.com/junkangLiu0/FedAdamW.
[314] InertialAR: Autoregressive 3D Molecule Generation with Inertial Frames
Haorui Li, Weitao Du, Yuqiang Li, Hongyu Guo, Shengchao Liu
Main category: cs.LG
TL;DR: InertialAR is a Transformer-based autoregressive model for 3D molecule generation that achieves state-of-the-art performance through canonical tokenization and geometric-aware attention mechanisms.
Details
Motivation: Transformer-based models have succeeded in text and images but face challenges in 3D molecule generation due to SE(3) transformation and permutation invariance requirements, and the need to handle hybrid discrete-continuous atom representations.Method: Uses canonical tokenization aligned to inertial frames with atom reordering for SE(3) and permutation invariance. Implements geometric rotary positional encoding (GeoRoPE) for geometric awareness and hierarchical autoregressive paradigm predicting atom type first then 3D coordinates via Diffusion loss.
Result: Achieves state-of-the-art performance on 7 of 10 metrics for unconditional generation across QM9, GEOM-Drugs, and B3BYLP datasets. Significantly outperforms baselines in controllable generation for targeted chemical functionality, attaining SOTA across all 5 metrics.
Conclusion: InertialAR successfully extends Transformer-based autoregressive models to 3D molecule generation by addressing fundamental challenges of canonical tokenization and geometric awareness, demonstrating superior performance in both unconditional and controllable generation tasks.
Abstract: Transformer-based autoregressive models have emerged as a unifying paradigm across modalities such as text and images, but their extension to 3D molecule generation remains underexplored. The gap stems from two fundamental challenges: (1) tokenizing molecules into a canonical 1D sequence of tokens that is invariant to both SE(3) transformations and atom index permutations, and (2) designing an architecture capable of modeling hybrid atom-based tokens that couple discrete atom types with continuous 3D coordinates. To address these challenges, we introduce InertialAR. InertialAR devises a canonical tokenization that aligns molecules to their inertial frames and reorders atoms to ensure SE(3) and permutation invariance. Moreover, InertialAR equips the attention mechanism with geometric awareness via geometric rotary positional encoding (GeoRoPE). In addition, it utilizes a hierarchical autoregressive paradigm to predict the next atom-based token, predicting the atom type first and then its 3D coordinates via Diffusion loss. Experimentally, InertialAR achieves state-of-the-art performance on 7 of the 10 evaluation metrics for unconditional molecule generation across QM9, GEOM-Drugs, and B3LYP. Moreover, it significantly outperforms strong baselines in controllable generation for targeted chemical functionality, attaining state-of-the-art results across all 5 metrics.
[315] DP-FedPGN: Finding Global Flat Minima for Differentially Private Federated Learning via Penalizing Gradient Norm
Junkang Liu, Yuxuan Tian, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Junchao Zhou, Daorui Ding
Main category: cs.LG
TL;DR: DP-FedPGN is a new Client-level Differentially Private Federated Learning algorithm that introduces global gradient norm penalty to find global flat minima, improving model generalization while providing strong privacy guarantees.
Details
Motivation: Current CL-DPFL methods cause sharper loss landscapes and reduced model generalization due to differential privacy protection. Local flatness found by SAM doesn't reflect global flatness in federated settings.Method: Proposed DP-FedPGN algorithm with global gradient norm penalty added to local loss to find global flat minimum. Uses Rényi DP for privacy guarantees and reduces locally updated norm to minimize gradient clipping error.
Result: Achieved significant improvements in six visual and NLP tasks on ResNet and Transformer models compared to state-of-the-art algorithms. Eliminates impact of data heterogeneity and achieves fast convergence.
Conclusion: DP-FedPGN effectively mitigates performance degradation from DP, finds flatter global minima, reduces gradient clipping error, and provides strong privacy guarantees while maintaining model performance across diverse tasks.
Abstract: To prevent inference attacks in Federated Learning (FL) and reduce the leakage of sensitive information, Client-level Differentially Private Federated Learning (CL-DPFL) is widely used. However, current CL-DPFL methods usually result in sharper loss landscapes, which leads to a decrease in model generalization after differential privacy protection. By using Sharpness Aware Minimization (SAM), the current popular federated learning methods are to find a local flat minimum value to alleviate this problem. However, the local flatness may not reflect the global flatness in CL-DPFL. Therefore, to address this issue and seek global flat minima of models, we propose a new CL-DPFL algorithm, DP-FedPGN, in which we introduce a global gradient norm penalty to the local loss to find the global flat minimum. Moreover, by using our global gradient norm penalty, we not only find a flatter global minimum but also reduce the locally updated norm, which means that we further reduce the error of gradient clipping. From a theoretical perspective, we analyze how DP-FedPGN mitigates the performance degradation caused by DP. Meanwhile, the proposed DP-FedPGN algorithm eliminates the impact of data heterogeneity and achieves fast convergence. We also use R'enyi DP to provide strict privacy guarantees and provide sensitivity analysis for local updates. Finally, we conduct effectiveness tests on both ResNet and Transformer models, and achieve significant improvements in six visual and natural language processing tasks compared to existing state-of-the-art algorithms. The code is available at https://github.com/junkangLiu0/DP-FedPGN
[316] Learning Sparse Approximate Inverse Preconditioners for Conjugate Gradient Solvers on GPUs
Zherui Yang, Zhehao Li, Kangbo Lyu, Yixuan Li, Tao Du, Ligang Liu
Main category: cs.LG
TL;DR: A learning-based method using GNNs to construct Sparse Approximate Inverse (SPAI) preconditioners for conjugate gradient solvers, avoiding triangular solves and enabling GPU parallelization while improving convergence rates.
Details
Motivation: Traditional preconditioners have limited ability to exploit data optimization, while existing learning-based methods using GNNs face challenges with triangular solves that hinder GPU parallelization and create long-range dependencies that are hard for GNNs to model.Method: Use Graph Neural Networks to construct Sparse Approximate Inverse (SPAI) preconditioners that avoid triangular solves, require only two matrix-vector products per CG step, and employ a statistics-based scale-invariant loss function that matches CG’s convergence properties.
Result: Outperforms standard preconditioners (Diagonal, IC, traditional SPAI) and previous learning-based methods on GPUs, reducing solution time by 40%-53% (68%-113% faster) with better condition numbers and superior generalization across three PDE-derived datasets and one synthetic dataset.
Conclusion: The proposed learning-based SPAI preconditioner construction using GNNs effectively addresses GPU parallelization limitations while improving convergence performance, demonstrating significant speedups and better generalization compared to traditional and existing learning-based approaches.
Abstract: The conjugate gradient solver (CG) is a prevalent method for solving symmetric and positive definite linear systems Ax=b, where effective preconditioners are crucial for fast convergence. Traditional preconditioners rely on prescribed algorithms to offer rigorous theoretical guarantees, while limiting their ability to exploit optimization from data. Existing learning-based methods often utilize Graph Neural Networks (GNNs) to improve the performance and speed up the construction. However, their reliance on incomplete factorization leads to significant challenges: the associated triangular solve hinders GPU parallelization in practice, and introduces long-range dependencies which are difficult for GNNs to model. To address these issues, we propose a learning-based method to generate GPU-friendly preconditioners, particularly using GNNs to construct Sparse Approximate Inverse (SPAI) preconditioners, which avoids triangular solves and requires only two matrix-vector products at each CG step. The locality of matrix-vector product is compatible with the local propagation mechanism of GNNs. The flexibility of GNNs also allows our approach to be applied in a wide range of scenarios. Furthermore, we introduce a statistics-based scale-invariant loss function. Its design matches CG’s property that the convergence rate depends on the condition number, rather than the absolute scale of A, leading to improved performance of the learned preconditioner. Evaluations on three PDE-derived datasets and one synthetic dataset demonstrate that our method outperforms standard preconditioners (Diagonal, IC, and traditional SPAI) and previous learning-based preconditioners on GPUs. We reduce solution time on GPUs by 40%-53% (68%-113% faster), along with better condition numbers and superior generalization performance. Source code available at https://github.com/Adversarr/LearningSparsePreconditioner4GPU
[317] Leveraging Generic Time Series Foundation Models for EEG Classification
Théo Gnassounou, Yessin Moakher, Shifeng Xie, Vasilii Feofanov, Ievgen Redko
Main category: cs.LG
TL;DR: Generalist time series foundation models pretrained on heterogeneous real-world data or synthetic signals can effectively transfer to EEG tasks, outperforming EEG-specific models.
Details
Motivation: To explore whether general-purpose time series foundation models can be effectively applied to domain-specific biomedical signals like EEG, which remains largely unexplored.Method: Tested two pretraining approaches: (1) pretraining on heterogeneous real-world time series from multiple domains, and (2) pretraining on purely synthetic data. Evaluated on EEG tasks including motor imagery classification and sleep stage prediction.
Result: Both pretraining variants yielded strong performance, consistently outperforming EEGNet (convolutional baseline) and CBraMod (EEG-specific foundation model).
Conclusion: Generalist time series foundation models can effectively transfer to EEG analysis, even when pretrained on non-neural or synthetic data, suggesting EEG can benefit from broader time series advances.
Abstract: Foundation models for time series are emerging as powerful general-purpose backbones, yet their potential for domain-specific biomedical signals such as electroencephalography (EEG) remains rather unexplored. In this work, we investigate the applicability a recently proposed time series classification foundation model, to a different EEG tasks such as motor imagery classification and sleep stage prediction. We test two pretraining regimes: (a) pretraining on heterogeneous real-world time series from multiple domains, and (b) pretraining on purely synthetic data. We find that both variants yield strong performance, consistently outperforming EEGNet, a widely used convolutional baseline, and CBraMod, the most recent EEG-specific foundation model. These results suggest that generalist time series foundation models, even when pretrained on data of non-neural origin or on synthetic signals, can transfer effectively to EEG. Our findings highlight the promise of leveraging cross-domain pretrained models for brain signal analysis, suggesting that EEG may benefit from advances in the broader time series literature.
[318] Active transfer learning for structural health monitoring
J. Poole, N. Dervilis, K. Worden, P. Gardner, V. Giglioni, R. S. Mills, A. J. Hughes
Main category: cs.LG
TL;DR: A Bayesian framework combining domain adaptation and active learning for structural health monitoring that improves data efficiency by reducing required labeled data through intelligent sampling.
Details
Motivation: Structural health monitoring data is expensive to obtain, especially labeled data. Population-based approaches face distribution differences between structures, and existing methods don't incorporate online learning with newly obtained labels.Method: Proposes Bayesian domain adaptation framework that improves unsupervised mappings using limited labeled target data, integrated with active sampling to select most informative observations for labeling.
Result: The method was evaluated on experimental bridges with multiple damage states and environmental conditions, showing improved data efficiency for classification models in label-scarce scenarios.
Conclusion: Combining transfer learning and active learning can reduce required inspections and operational costs for structural health monitoring while maintaining effective classification performance.
Abstract: Data for training structural health monitoring (SHM) systems are often expensive and/or impractical to obtain, particularly for labelled data. Population-based SHM (PBSHM) aims to address this limitation by leveraging data from multiple structures. However, data from different structures will follow distinct distributions, potentially leading to large generalisation errors for models learnt via conventional machine learning methods. To address this issue, transfer learning – in the form of domain adaptation (DA) – can be used to align the data distributions. Most previous approaches have only considered \emph{unsupervised} DA, where no labelled target data are available; they do not consider how to incorporate these technologies in an online framework – updating as labels are obtained throughout the monitoring campaign. This paper proposes a Bayesian framework for DA in PBSHM, that can improve unsupervised DA mappings using a limited quantity of labelled target data. In addition, this model is integrated into an active sampling strategy to guide inspections to select the most informative observations to label – leading to further reductions in the required labelled data to learn a target classifier. The effectiveness of this methodology is evaluated on a population of experimental bridges. Specifically, this population includes data corresponding to several damage states, as well as, a comprehensive set of environmental conditions. It is found that combining transfer learning and active learning can improve data efficiency when learning classification models in label-scarce scenarios. This result has implications for data-informed operation and maintenance of structures, suggesting a reduction in inspections over the operational lifetime of a structure – and therefore a reduction in operational costs – can be achieved.
[319] TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
Yuxiang Chen, Xiaoming Xu, Pengle Zhang, Michael Beyer, Martin Rapp, Jun Zhu, Jianfei Chen
Main category: cs.LG
TL;DR: TetraJet-v2 is a 4-bit fully-quantized training method using NVFP4 format that addresses weight oscillation and outlier issues, achieving near-lossless LLM training with 51.3% reduced performance gap compared to full-precision training.
Details
Motivation: LLM training is extremely expensive, driving need for low-precision fully-quantized training methods. While 4-bit formats like NVFP4 offer efficiency gains, achieving near-lossless training at such low precision remains challenging due to issues like weight oscillation and outliers.Method: End-to-end 4-bit FQT using NVFP4 for activations, weights, and gradients in all linear layers. Includes: 1) unbiased double-block quantization for NVFP4 linear layers, 2) OsciReset algorithm to suppress weight oscillation, and 3) OutControl algorithm to retain outlier accuracy.
Result: Consistently outperforms prior FP4 training methods on pre-training LLMs across model sizes up to 370M and data sizes up to 200B tokens. Reduces performance gap to full-precision training by an average of 51.3%.
Conclusion: TetraJet-v2 successfully enables efficient 4-bit fully-quantized training for LLMs by addressing key challenges of weight oscillation and outliers, achieving substantial improvements over existing methods while maintaining near-lossless performance.
Abstract: Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers, 2) OsciReset, an algorithm to suppress weight oscillation, and 3) OutControl, an algorithm to retain outlier accuracy. TetraJet-v2 consistently outperforms prior FP4 training methods on pre-training LLMs across varying model sizes up to 370M and data sizes up to 200B tokens, reducing the performance gap to full-precision training by an average of 51.3%.
[320] AstuteRAG-FQA: Task-Aware Retrieval-Augmented Generation Framework for Proprietary Data Challenges in Financial Question Answering
Mohammad Zahangir Alam, Khandoker Ashik Uz Zaman, Mahdi H. Miraz
Main category: cs.LG
TL;DR: AstuteRAG-FQA is an adaptive RAG framework for Financial Question Answering that addresses domain-specific challenges through hybrid retrieval, dynamic prompts, and multi-layered security while maintaining regulatory compliance.
Details
Motivation: RAG shows promise for knowledge-intensive tasks but faces critical challenges in finance: restricted data access, limited retrieval accuracy, regulatory constraints, and sensitive data interpretation.Method: Uses hybrid retrieval strategy with open-source and proprietary data, dynamic prompt framework adapting to query complexity, four-tier task classification system, and multi-layered security mechanisms including differential privacy and compliance monitoring.
Result: The framework systematically addresses diverse financial queries through optimized retrieval and generation processes, with evaluation of three data integration techniques for efficiency across financial environments.
Conclusion: AstuteRAG-FQA provides a comprehensive solution for financial question answering that balances performance with security and regulatory compliance requirements.
Abstract: Retrieval-Augmented Generation (RAG) shows significant promise in knowledge-intensive tasks by improving domain specificity, enhancing temporal relevance, and reducing hallucinations. However, applying RAG to finance encounters critical challenges: restricted access to proprietary datasets, limited retrieval accuracy, regulatory constraints, and sensitive data interpretation. We introduce AstuteRAG-FQA, an adaptive RAG framework tailored for Financial Question Answering (FQA), leveraging task-aware prompt engineering to address these challenges. The framework uses a hybrid retrieval strategy integrating both open-source and proprietary financial data while maintaining strict security protocols and regulatory compliance. A dynamic prompt framework adapts in real time to query complexity, improving precision and contextual relevance. To systematically address diverse financial queries, we propose a four-tier task classification: explicit factual, implicit factual, interpretable rationale, and hidden rationale involving implicit causal reasoning. For each category, we identify key challenges, datasets, and optimization techniques within the retrieval and generation process. The framework incorporates multi-layered security mechanisms including differential privacy, data anonymization, and role-based access controls to protect sensitive financial information. Additionally, AstuteRAG-FQA implements real-time compliance monitoring through automated regulatory validation systems that verify responses against industry standards and legal obligations. We evaluate three data integration techniques - contextual embedding, small model augmentation, and targeted fine-tuning - analyzing their efficiency and feasibility across varied financial environments.
[321] ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling
Zhuohan Wang, Ziwei Zhu, Ziniu Li, Congliang Chen, Yizhou Han, Yufeng Lin, Zhihang Lin, Angyang Gu, Xinglin Hu, Ruoyu Sun, Tian Ding
Main category: cs.LG
TL;DR: ORGEval is a graph-theoretic framework for evaluating LLMs’ optimization problem formulation capabilities, using graph isomorphism testing to assess model equivalence more efficiently than solver-based methods.
Details
Motivation: Existing solver-based evaluation methods for LLM-generated optimization problems face issues with inconsistency, infeasibility, and high computational costs, creating a need for more robust evaluation metrics.Method: ORGEval represents optimization models as graphs and reduces equivalence detection to graph isomorphism testing. It uses a tailored Weisfeiler-Lehman test combined with symmetric decomposable detection to evaluate structural equivalence.
Result: ORGEval achieves 100% consistency across random parameter configurations and significantly outperforms solver-based methods in runtime, especially on difficult problems. Using ORGEval, DeepSeek-V3 and Claude-Opus-4 were found to perform best in optimization modeling.
Conclusion: ORGEval provides an efficient and robust framework for evaluating LLMs’ optimization modeling capabilities, revealing that while optimization modeling remains challenging for all LLMs, some models like DeepSeek-V3 and Claude-Opus-4 show superior performance.
Abstract: Formulating optimization problems for industrial applications demands significant manual effort and domain expertise. While Large Language Models (LLMs) show promise in automating this process, evaluating their performance remains difficult due to the absence of robust metrics. Existing solver-based approaches often face inconsistency, infeasibility issues, and high computational costs. To address these issues, we propose ORGEval, a graph-theoretic evaluation framework for assessing LLMs’ capabilities in formulating linear and mixed-integer linear programs. ORGEval represents optimization models as graphs, reducing equivalence detection to graph isomorphism testing. We identify and prove a sufficient condition, when the tested graphs are symmetric decomposable (SD), under which the Weisfeiler-Lehman (WL) test is guaranteed to correctly detect isomorphism. Building on this, ORGEval integrates a tailored variant of the WL-test with an SD detection algorithm to evaluate model equivalence. By focusing on structural equivalence rather than instance-level configurations, ORGEval is robust to numerical variations. Experimental results show that our method can successfully detect model equivalence and produce 100% consistent results across random parameter configurations, while significantly outperforming solver-based methods in runtime, especially on difficult problems. Leveraging ORGEval, we construct the Bench4Opt dataset and benchmark state-of-the-art LLMs on optimization modeling. Our results reveal that although optimization modeling remains challenging for all LLMs, DeepSeek-V3 and Claude-Opus-4 achieve the highest accuracies under direct prompting, outperforming even leading reasoning models.
[322] Panprediction: Optimal Predictions for Any Downstream Task and Loss
Sivaraman Balakrishnan, Nika Haghtalab, Daniel Hsu, Brian Lee, Eric Zhao
Main category: cs.LG
TL;DR: The paper introduces panprediction, a framework that generalizes omniprediction and multi-group learning by enabling models to minimize many losses on many downstream tasks simultaneously, with efficient sample complexity guarantees.
Details
Motivation: To move beyond classical supervised learning where models are trained for fixed loss functions and distributions, and instead develop models that can extract sufficient information from data to perform well on multiple downstream tasks and loss functions.Method: Developed mathematical framework for panprediction, designed algorithms for deterministic and randomized panpredictors with sample complexities of O~(1/ε³) and O~(1/ε²) respectively, using a key technical reduction to step calibration.
Result: Showed that under mild assumptions, simultaneously minimizing infinitely many losses on infinitely many tasks can be statistically as easy as minimizing one loss on one task. Improved deterministic omniprediction sample complexity by factor 1/ε and matched other known guarantees.
Conclusion: Panprediction provides a unified framework that generalizes existing paradigms and demonstrates efficient sample complexity for learning models that perform well across multiple tasks and loss functions simultaneously.
Abstract: Supervised learning is classically formulated as training a model to minimize a fixed loss function over a fixed distribution, or task. However, an emerging paradigm instead views model training as extracting enough information from data so that the model can be used to minimize many losses on many downstream tasks. We formalize a mathematical framework for this paradigm, which we call panprediction, and study its statistical complexity. Formally, panprediction generalizes omniprediction and sits upstream from multi-group learning, which respectively focus on predictions that generalize to many downstream losses or many downstream tasks, but not both. Concretely, we design algorithms that learn deterministic and randomized panpredictors with $\tilde{O}(1/\varepsilon^3)$ and $\tilde{O}(1/\varepsilon^2)$ samples, respectively. Our results demonstrate that under mild assumptions, simultaneously minimizing infinitely many losses on infinitely many tasks can be as statistically easy as minimizing one loss on one task. Along the way, we improve the best known sample complexity guarantee of deterministic omniprediction by a factor of $1/\varepsilon$, and match all other known sample complexity guarantees of omniprediction and multi-group learning. Our key technical ingredient is a nearly lossless reduction from panprediction to a statistically efficient notion of calibration, called step calibration.
[323] Imbalanced Classification through the Lens of Spurious Correlations
Jakob Hackstein, Sidney Bender
Main category: cs.LG
TL;DR: The paper addresses class imbalance in machine learning by viewing it as a condition that amplifies Clever Hans effects, and proposes using counterfactual explanations to identify and eliminate these effects.
Details
Motivation: Class imbalance leads to unreliable classification performance, and existing methods focus on data or loss reweighting without addressing how imbalance amplifies Clever Hans effects through underspecification of minority classes.Method: A counterfactual explanations-based approach using Explainable AI to jointly identify and eliminate Clever Hans effects that emerge under class imbalance conditions.
Result: The method achieves competitive classification performance on three datasets and demonstrates how Clever Hans effects emerge under imbalance.
Conclusion: The paper provides a new perspective on class imbalance by showing how it amplifies Clever Hans effects, which has been largely overlooked by existing approaches.
Abstract: Class imbalance poses a fundamental challenge in machine learning, frequently leading to unreliable classification performance. While prior methods focus on data- or loss-reweighting schemes, we view imbalance as a data condition that amplifies Clever Hans (CH) effects by underspecification of minority classes. In a counterfactual explanations-based approach, we propose to leverage Explainable AI to jointly identify and eliminate CH effects emerging under imbalance. Our method achieves competitive classification performance on three datasets and demonstrates how CH effects emerge under imbalance, a perspective largely overlooked by existing approaches.
[324] Information-Theoretic Greedy Layer-wise Training for Traffic Sign Recognition
Shuyan Lyu, Zhanzimo Wu, Junliang Du
Main category: cs.LG
TL;DR: The paper proposes a novel layer-wise training approach using deterministic information bottleneck and Rényi’s entropy to train deep CNNs without backpropagation, achieving performance comparable to SGD.
Details
Motivation: Traditional DNN training with global cross-entropy loss and backpropagation is biologically implausible and suffers from memory issues and gradient problems. Layer-wise training offers an alternative but has been limited to small datasets and simple architectures.Method: Systematically analyze CNN training dynamics through information theory, then propose layer-wise training using deterministic information bottleneck and matrix-based Rényi’s entropy. Each layer is trained with an auxiliary classifier connected directly to the output layer.
Result: The approach outperforms existing layer-wise training baselines and achieves performance comparable to SGD on CIFAR-10, CIFAR-100, and traffic sign recognition tasks.
Conclusion: The proposed layer-wise training method based on information bottleneck principles provides a biologically plausible alternative to backpropagation that maintains competitive performance while reducing memory usage and gradient issues.
Abstract: Modern deep neural networks (DNNs) are typically trained with a global cross-entropy loss in a supervised end-to-end manner: neurons need to store their outgoing weights; training alternates between a forward pass (computation) and a top-down backward pass (learning) which is biologically implausible. Alternatively, greedy layer-wise training eliminates the need for cross-entropy loss and backpropagation. By avoiding the computation of intermediate gradients and the storage of intermediate outputs, it reduces memory usage and helps mitigate issues such as vanishing or exploding gradients. However, most existing layer-wise training approaches have been evaluated only on relatively small datasets with simple deep architectures. In this paper, we first systematically analyze the training dynamics of popular convolutional neural networks (CNNs) trained by stochastic gradient descent (SGD) through an information-theoretic lens. Our findings reveal that networks converge layer-by-layer from bottom to top and that the flow of information adheres to a Markov information bottleneck principle. Building on these observations, we propose a novel layer-wise training approach based on the recently developed deterministic information bottleneck (DIB) and the matrix-based R'enyi’s $\alpha$-order entropy functional. Specifically, each layer is trained jointly with an auxiliary classifier that connects directly to the output layer, enabling the learning of minimal sufficient task-relevant representations. We empirically validate the effectiveness of our training procedure on CIFAR-10 and CIFAR-100 using modern deep CNNs and further demonstrate its applicability to a practical task involving traffic sign recognition. Our approach not only outperforms existing layer-wise training baselines but also achieves performance comparable to SGD.
[325] Representative Social Choice: From Learning Theory to AI Alignment
Tianyi Qiu
Main category: cs.LG
TL;DR: The paper introduces a representative social choice framework for modeling democratic representation in large-scale collective decisions, connecting social choice theory with statistical learning and proving generalization properties and impossibility theorems.
Details
Motivation: To address scenarios where the number of issues and individuals is too large for mechanisms to consider all preferences directly, which occurs in jury trials, legislation, corporate governance, and language model alignment.Method: Proposes representative social choice where population is represented by finite samples of individual-issue pairs, formulates questions as statistical learning problems, and proves generalization properties using machine learning theory.
Result: Establishes generalization properties of social choice mechanisms using learning theory, formulates axioms for representative social choice, and proves Arrow-like impossibility theorems with new combinatorial analysis tools.
Conclusion: The framework introduces a representative approach to social choice, opening research directions at the intersection of social choice, learning theory, and AI alignment.
Abstract: Social choice theory is the study of preference aggregation across a population, used both in mechanism design for human agents and in the democratic alignment of language models. In this study, we propose the representative social choice framework for the modeling of democratic representation in collective decisions, where the number of issues and individuals are too large for mechanisms to consider all preferences directly. These scenarios are widespread in real-world decision-making processes, such as jury trials, legislation, corporate governance, and, more recently, language model alignment. In representative social choice, the population is represented by a finite sample of individual-issue pairs based on which social choice decisions are made. We show that many of the deepest questions in representative social choice can be formulated as statistical learning problems, and prove the generalization properties of social choice mechanisms using the theory of machine learning. We further formulate axioms for representative social choice, and prove Arrow-like impossibility theorems with new combinatorial tools of analysis. Our framework introduces the representative approach to social choice, opening up research directions at the intersection of social choice, learning theory, and AI alignment.
[326] Training a Generally Curious Agent
Fahim Tajwar, Yiding Jiang, Abitha Thankaraj, Sumaita Sadia Rahman, J Zico Kolter, Jeff Schneider, Ruslan Salakhutdinov
Main category: cs.LG
TL;DR: Paprika is a fine-tuning approach that enables language models to develop general decision-making capabilities for efficient exploration in new environments without additional training.
Details
Motivation: Existing language models often fall short in scenarios requiring strategic information gathering and efficient exploration, which is essential for intelligent systems interacting with their environment.Method: Fine-tuning language models on synthetic interaction data from diverse tasks, using a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential.
Result: Models fine-tuned with Paprika can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training, adapting based on environment feedback in-context.
Conclusion: Paprika provides a promising path towards AI systems that can autonomously solve novel sequential decision-making problems requiring interactions with the external world, with the primary bottleneck being data sampling rather than model updates.
Abstract: Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present Paprika, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, Paprika teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with Paprika can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach’s primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.
[327] Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates
Tien Huu Do, Antoine Masquelier, Nae Eoun Lee, Jonathan Crowther
Main category: cs.LG
TL;DR: A deep learning method using pre-trained language models and attention mechanisms to predict clinical trial patient enrollment with probabilistic uncertainty estimation.
Details
Motivation: Clinical trials require significant investment and planning, making accurate patient enrollment prediction crucial for trial success and efficient resource allocation.Method: Neural network model combining PLM-based document representations with tabular features via attention, enhanced with Gamma distribution probabilistic layer for uncertainty estimation, modeling site-level enrollment as Poisson-Gamma process.
Result: The proposed method effectively predicts patient enrollment across multiple sites and outperforms established baseline models on real-world clinical trial data.
Conclusion: The deep learning approach with probabilistic modeling successfully addresses the critical challenge of clinical trial enrollment prediction, providing both accurate estimates and uncertainty ranges.
Abstract: Clinical trials are a systematic endeavor to assess the safety and efficacy of new drugs or treatments. Conducting such trials typically demands significant financial investment and meticulous planning, highlighting the need for accurate predictions of trial outcomes. Accurately predicting patient enrollment, a key factor in trial success, is one of the primary challenges during the planning phase. In this work, we propose a novel deep learning-based method to address this critical challenge. Our method, implemented as a neural network model, leverages pre-trained language models (PLMs) to capture the complexities and nuances of clinical documents, transforming them into expressive representations. These representations are then combined with encoded tabular features via an attention mechanism. To account for uncertainties in enrollment prediction, we enhance the model with a probabilistic layer based on the Gamma distribution, which enables range estimation. We apply the proposed model to predict clinical trial duration, assuming site-level enrollment follows a Poisson-Gamma process. We carry out extensive experiments on real-world clinical trial data, and show that the proposed method can effectively predict the number of patients enrolled at a number of sites for a given clinical trial, outperforming established baseline models.
[328] Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
Wenhao Deng, Long Wei, Chenglei Yu, Tailin Wu
Main category: cs.LG
TL;DR: RAPO algorithm addresses RLVR’s exploration limitation by replacing reverse KL with forward KL for out-of-distribution exploration and reweighting reference policy for adaptive in-distribution exploration, enabling models to surpass base model performance ceilings.
Details
Motivation: RLVR-trained models show diminishing advantages over base models with increased sampling budget due to reverse KL divergence's mode-seeking behavior that traps policies within base model's support region, limiting exploration.Method: RAPO algorithm uses forward KL penalty instead of reverse KL for out-of-distribution exploration and reweights reference policy for adaptive in-distribution exploration, trained on SimpleRL-Zero dataset without supervised fine-tuning.
Result: RAPO-trained Qwen2.5-3B and 7B models consistently improve problem-solving performance on AIME2024 and AIME2025, surpassing base model performance ceilings and solving previously intractable problems.
Conclusion: RAPO advances RLVR for challenging reasoning tasks by enabling broader yet focused exploration, overcoming the exploration limitations of traditional RLVR approaches.
Abstract: Reinforcement learning with verifiable rewards (RLVR) has recently enhanced the reasoning capabilities of large language models (LLMs), particularly for mathematical problem solving. However, a fundamental limitation remains: as the sampling budget increases, the advantage of RLVR-trained models over their pretrained bases often diminishes or even vanishes, revealing a strong dependence on the base model’s restricted search space. We attribute this phenomenon to the widespread use of the reverse Kullback-Leibler (KL) divergence regularizer, whose mode-seeking behavior keeps the policy trapped inside the base model’s support region and hampers wider exploration. To address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an algorithm to promote broader yet focused exploration. Our method (i) utilizes the forward KL penalty to replace the reverse KL penalty for out-of-distribution exploration, and (ii) reweights the reference policy to facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset, without supervised fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO consistently improves problem-solving performance. Notably, RAPO enables models to surpass the base model’s performance ceiling and solves previously intractable problems, advancing the frontier of RLVR for challenging reasoning tasks.
[329] A Systematic Literature Review of Spatio-Temporal Graph Neural Network Models for Time Series Forecasting and Classification
Flavio Corradini, Flavio Gerosa, Marco Gori, Carlo Lucheroni, Marco Piangerelli, Martina Zannotti
Main category: cs.LG
TL;DR: This paper presents a systematic literature review of spatio-temporal graph neural networks (GNNs) for time series classification and forecasting, analyzing 366 papers to provide comprehensive overview of models, applications, datasets, and results.
Details
Motivation: To provide a comprehensive overview of spatio-temporal GNNs for time series analysis, capturing dependencies among variables and across time points, and to assist researchers with current state-of-the-art information.Method: Conducted a systematic literature review with database search, selected 366 papers for detailed examination, compared results from current spatio-temporal GNN models across different domains.
Result: Created the broadest systematic literature review in this field, providing detailed comparison of results, available datasets, benchmark models, and links to source code. Also identified current limitations and challenges.
Conclusion: The review offers comprehensive insights into spatio-temporal GNN applications and is complemented by a GitHub repository with interactive tools for further exploration of findings.
Abstract: In recent years, spatio-temporal graph neural networks (GNNs) have attracted considerable interest in the field of time series analysis, due to their ability to capture, at once, dependencies among variables and across time points. The objective of this systematic literature review is hence to provide a comprehensive overview of the various modeling approaches and application domains of GNNs for time series classification and forecasting. A database search was conducted, and 366 papers were selected for a detailed examination of the current state-of-the-art in the field. This examination is intended to offer to the reader a comprehensive review of proposed models, links to related source code, available datasets, benchmark models, and fitting results. All this information is hoped to assist researchers in their studies. To the best of our knowledge, this is the first and broadest systematic literature review presenting a detailed comparison of results from current spatio-temporal GNN models applied to different domains. In its final part, this review discusses current limitations and challenges in the application of spatio-temporal GNNs, such as comparability, reproducibility, explainability, poor information capacity, and scalability. This paper is complemented by a GitHub repository at https://github.com/FlaGer99/SLR-Spatio-Temporal-GNN.git providing additional interactive tools to further explore the presented findings.
[330] Robust Offline Reinforcement Learning with Linearly Structured f-Divergence Regularization
Cheng Tang, Zhishuai Liu, Pan Xu
Main category: cs.LG
TL;DR: Proposes d-RRMDP framework with structured regularization for robust offline RL, develops R2PVI algorithm with linear function approximation, and provides theoretical guarantees and empirical validation.
Details
Motivation: Existing RRMDP methods use unstructured regularization, which can lead to overly conservative policies under unrealistic transitions. Need to introduce latent structures for more realistic robustness.Method: Developed d-rectangular linear RRMDP framework with structured regularization on transition kernels. Proposed R2PVI algorithm using linear function approximation and f-divergence based regularization for offline robust policy learning.
Result: Provided instance-dependent upper bounds on suboptimality gap showing dependence on dataset coverage. Established information-theoretic lower bounds proving near-optimality. Numerical experiments showed R2PVI learns robust policies with superior computational efficiency.
Conclusion: The d-RRMDP framework with structured regularization enables more realistic robust policy learning in offline RL settings, with theoretical guarantees and practical efficiency advantages over baseline methods.
Abstract: The Robust Regularized Markov Decision Process (RRMDP) is proposed to learn policies robust to dynamics shifts by adding regularization to the transition dynamics in the value function. Existing methods mostly use unstructured regularization, potentially leading to conservative policies under unrealistic transitions. To address this limitation, we propose a novel framework, the $d$-rectangular linear RRMDP ($d$-RRMDP), which introduces latent structures into both transition kernels and regularization. We focus on offline reinforcement learning, where an agent learns policies from a precollected dataset in the nominal environment. We develop the Robust Regularized Pessimistic Value Iteration (R2PVI) algorithm that employs linear function approximation for robust policy learning in $d$-RRMDPs with $f$-divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, demonstrating that these bounds are influenced by how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. We establish information-theoretic lower bounds to verify that our algorithm is near-optimal. Finally, numerical experiments validate that R2PVI learns robust policies and exhibits superior computational efficiency compared to baseline methods.
[331] Hybrid Decentralized Optimization: Leveraging Both First- and Zeroth-Order Optimizers for Faster Convergence
Matin Ansaripour, Shayan Talaei, Giorgi Nadiradze, Dan Alistarh
Main category: cs.LG
TL;DR: This paper introduces hybrid decentralized optimization where nodes with zeroth-order and first-order optimization capabilities coexist and jointly solve optimization tasks, showing that such systems can benefit from integrating zeroth-order agents rather than ignoring them.
Details
Motivation: There are settings where computationally-bounded nodes cannot implement first-order gradient-based optimization but could still contribute to joint optimization tasks, creating a need for hybrid optimization approaches.Method: The authors develop a new analysis of distributed optimization with noisy and possibly-biased gradient estimators, studying systems where zeroth-order and first-order optimization nodes co-exist and collaborate.
Result: The system can withstand noisier zeroth-order agents and even benefit from integrating them into the optimization process, with results holding for both convex and non-convex objectives.
Conclusion: Hybrid first-zeroth order optimization is practical and beneficial, even for training deep neural networks, as demonstrated by experimental results on standard optimization tasks.
Abstract: Distributed optimization is the standard way of speeding up machine learning training, and most of the research in the area focuses on distributed first-order, gradient-based methods. Yet, there are settings where some computationally-bounded nodes may not be able to implement first-order, gradient-based optimization, while they could still contribute to joint optimization tasks. In this paper, we initiate the study of hybrid decentralized optimization, studying settings where nodes with zeroth-order and first-order optimization capabilities co-exist in a distributed system, and attempt to jointly solve an optimization task over some data distribution. We essentially show that, under reasonable parameter settings, such a system can not only withstand noisier zeroth-order agents but can even benefit from integrating such agents into the optimization process, rather than ignoring their information. At the core of our approach is a new analysis of distributed optimization with noisy and possibly-biased gradient estimators, which may be of independent interest. Our results hold for both convex and non-convex objectives. Experimental results on standard optimization tasks confirm our analysis, showing that hybrid first-zeroth order optimization can be practical, even when training deep neural networks.
[332] Accelerated Rates between Stochastic and Adversarial Online Convex Optimization
Sarah Sachs, Hedi Hadiji, Tim van Erven, Cristobal Guzman
Main category: cs.LG
TL;DR: This paper establishes novel regret bounds for online convex optimization that interpolate between stochastic i.i.d. and fully adversarial settings, exploiting smoothness to replace maximum gradient dependence with gradient variance.
Details
Motivation: To bridge the theoretical gap between purely stochastic and fully adversarial online learning settings, as many real-world optimization tasks fall between these extremes.Method: By exploiting smoothness of expected losses, the method replaces dependence on maximum gradient length with gradient variance, allowing for adversarially poisoned rounds while weakening the i.i.d. assumption.
Result: The paper achieves tight regret bounds that match stochastic acceleration rates in i.i.d. cases, gracefully deteriorate to minimax regret in adversarial cases, and are optimal for all intermediate regimes.
Conclusion: The work provides a unified framework for online convex optimization that handles interpolations between stochastic and adversarial settings with tight bounds, recovering optimal rates in extreme cases and proving tightness for intermediate regimes.
Abstract: Stochastic and adversarial data are two widely studied settings in online learning. But many optimization tasks are neither i.i.d. nor fully adversarial, which makes it of fundamental interest to get a better theoretical understanding of the world between these extremes. In this work we establish novel regret bounds for online convex optimization in a setting that interpolates between stochastic i.i.d. and fully adversarial losses. By exploiting smoothness of the expected losses, these bounds replace a dependence on the maximum gradient length by the variance of the gradients, which was previously known only for linear losses. In addition, they weaken the i.i.d. assumption by allowing, for example, adversarially poisoned rounds, which were previously considered in the related expert and bandit settings. In the fully i.i.d. case, our regret bounds match the rates one would expect from results in stochastic acceleration, and we also recover the optimal stochastically accelerated rates via online-to-batch conversion. In the fully adversarial case our bounds gracefully deteriorate to match the minimax regret. We further provide lower bounds showing that our regret upper bounds are tight for all intermediate regimes in terms of the stochastic variance and the adversarial variation of the loss gradients.
[333] Decoding Virtual Healthcare Success through Knowledge-Aware and Multimodal Predictive Modeling
Shuang Geng, Wenli Zhang, Jiaheng Xie, Gemin Liang, Ben Niu, Sudha Ram
Main category: cs.LG
TL;DR: This paper develops a multimodal predictive model that integrates diverse data sources and dynamically constructed knowledge networks to predict the success of online healthcare consultations, addressing challenges of data sparsity and fragmented patient journeys.
Details
Motivation: Online healthcare consultations face challenges in predicting success due to fragmented patient care journeys, lack of integration between virtual and traditional systems, and sparse/incomplete data from online platforms.Method: A predictive modeling approach that fuses multimodal data (textual conversations, interaction sequences, behavioral traces) and dynamically constructed knowledge networks to capture latent relationships among patients, physicians, and consultation contexts.
Result: The model enhances accuracy and interpretability of consultation success prediction by integrating heterogeneous information sources and uncovering the evolving structure of digital interactions.
Conclusion: The findings offer implications for designing hybrid healthcare ecosystems that combine online and offline services through data-driven intelligence.
Abstract: Online healthcare consultations have transformed how patients seek medical advice, offering convenience while introducing new challenges for ensuring consultation success. Predicting whether an online consultation will be successful is critical for improving patient experiences and sustaining platform competitiveness. Yet, such prediction is inherently difficult due to the fragmented nature of patients’ care journeys and the lack of integration between virtual and traditional healthcare systems. Furthermore, the data collected from online platforms, including textual conversations, interaction sequences, and behavioral traces, are often sparse and incomplete. This study develops a predictive modeling approach that fuses multimodal data and dynamically constructed knowledge networks to capture latent relationships among patients, physicians, and consultation contexts. By integrating heterogeneous information sources and uncovering the evolving structure of digital interactions, the model enhances the accuracy and interpretability of consultation success prediction. The findings offer implications for designing hybrid healthcare ecosystems that combine online and offline services through data-driven intelligence.
[334] Scaling Tractable Probabilistic Circuits: A Systems Perspective
Anji Liu, Kareem Ahmed, Guy Van den Broeck
Main category: cs.LG
TL;DR: PyJuice is a GPU implementation design for Probabilistic Circuits that achieves 1-2 orders of magnitude faster training and 2-5x less GPU memory usage compared to existing systems.
Details
Motivation: Existing PC implementations suffer from time and memory inefficiency that hinders scaling up, despite recent advancements in modeling and training.Method: PyJuice uses a compilation process that converts PCs into compact representations amenable to efficient block-based parallelization, reducing IO and leveraging modern GPU Tensor Cores.
Result: Empirical evaluation shows PyJuice improves state-of-the-art PCs on image and language datasets, and enables training much larger models with more epochs.
Conclusion: PyJuice establishes new baselines for PC performance and enables future scaling of probabilistic circuits for complex real-world tasks.
Abstract: Probabilistic Circuits (PCs) are a general framework for tractable deep generative models, which support exact and efficient probabilistic inference on their learned distributions. Recent modeling and training advancements have enabled their application to complex real-world tasks. However, the time and memory inefficiency of existing PC implementations hinders further scaling up. This paper proposes PyJuice, a general GPU implementation design for PCs that improves prior art in several regards. Specifically, PyJuice is 1-2 orders of magnitude faster than existing systems (including very recent ones) at training large-scale PCs. Moreover, PyJuice consumes 2-5x less GPU memory, which enables us to train larger models. At the core of our system is a compilation process that converts a PC into a compact representation amenable to efficient block-based parallelization, which significantly reduces IO and makes it possible to leverage Tensor Cores available in modern GPUs. Empirically, PyJuice can be used to improve state-of-the-art PCs trained on image (e.g., ImageNet32) and language (e.g., WikiText, CommonGen) datasets. We further establish a new set of baselines on natural image and language datasets by benchmarking existing PC structures but with much larger sizes and more training epochs, with the hope of incentivizing future research. Code is available at https://github.com/Tractables/pyjuice.
[335] Fast Adversarial Training against Sparse Attacks Requires Loss Smoothing
Xuyang Zhong, Yixiao Huang, Chen Liu
Main category: cs.LG
TL;DR: This paper addresses catastrophic overfitting in fast adversarial training against sparse l₀-bounded perturbations by proposing Fast-LS-l₀, which uses soft labels and a trade-off loss function to smooth the adversarial loss landscape.
Details
Motivation: The study is motivated by the challenges of using 1-step attacks for l₀-bounded adversarial training, including degraded performance and catastrophic overfitting caused by sub-optimal perturbation locations.Method: The authors propose Fast-LS-l₀, which incorporates soft labels and a trade-off loss function to smooth the adversarial loss landscape and address the craggy nature of l₀ adversarial training.
Result: Extensive experiments show that Fast-LS-l₀ overcomes catastrophic overfitting, achieves state-of-the-art performance, and narrows the performance gap between 1-step and multi-step adversarial training against sparse attacks.
Conclusion: The proposed method successfully addresses the challenges of l₀ adversarial training by smoothing the loss landscape, preventing catastrophic overfitting, and improving performance compared to existing approaches.
Abstract: This paper studies fast adversarial training against sparse adversarial perturbations bounded by $l_0$ norm. We demonstrate the challenges of employing $1$-step attacks on $l_0$ bounded perturbations for fast adversarial training, including degraded performance and the occurrence of catastrophic overfitting (CO). We highlight that CO in $l_0$ adversarial training is caused by sub-optimal perturbation locations of $1$-step attack. Theoretical and empirical analyses reveal that the loss landscape of $l_0$ adversarial training is more craggy compared to its $l_\infty$, $l_2$ and $l_1$ counterparts. Moreover, we corroborate that the craggy loss landscape can aggravate CO. To address these issues, we propose Fast-LS-$l_0$ that incorporates soft labels and the trade-off loss function to smooth the adversarial loss landscape. Extensive experiments demonstrate our method can overcome the challenge of catastrophic overfitting, achieve state-of-the-art performance, and narrow down the performance gap between $1$-step and multi-step adversarial training against sparse attacks.
[336] Reevaluating Theoretical Analysis Methods for Optimization in Deep Learning
Hoang Tran, Qinzi Zhang, Ashok Cutkosky
Main category: cs.LG
TL;DR: This paper develops empirical metrics to evaluate how well standard optimization analyses explain modern deep learning algorithms, finding that smoothness-based analyses often fail while convex-optimization identities frequently hold despite non-convexity.
Details
Motivation: There's a significant gap between theoretical optimization guarantees and practical deep learning performance, with theoretical assumptions often chosen for analytical convenience rather than empirical accuracy.Method: Developed new empirical metrics to compare real optimization behavior with analytically predicted behavior, verifying both high-level assumptions and low-level identities used in optimization analyses.
Result: Smoothness-based analyses fail in practice under most scenarios, but key identities from convex-optimization analyses often hold in practice despite the objective’s global non-convexity.
Conclusion: The paper provides empirical evidence that convex-optimization identities are more robust than smoothness assumptions for explaining modern optimization behavior in deep learning.
Abstract: There is a significant gap between our theoretical understanding of optimization algorithms used in deep learning and their practical performance. Theoretical development usually focuses on proving convergence guarantees under a variety of different assumptions, which are themselves often chosen based on a rough combination of intuitive match to practice and analytical convenience. In this paper, we carefully measure the degree to which the standard optimization analyses are capable of explaining modern algorithms. To do this, we develop new empirical metrics that compare real optimization behavior with analytically predicted behavior. Our investigation is notable for its tight integration with modern optimization analysis: rather than simply checking high-level assumptions made in the analysis (e.g. smoothness), we also verify key low-level identities used by the analysis to explain optimization behavior that might hold even if the high-level motivating assumptions do not. Notably, we find that smoothness-based analyses fail in practice under most scenarios, but the key identities commonly used in convex-optimization analyses often hold in practice despite the objective’s global non-convexity.
[337] Convergence of continuous-time stochastic gradient descent with applications to deep neural networks
Gabor Lugosi, Eulalia Nualart
Main category: cs.LG
TL;DR: Continuous-time approximation of SGD for population loss minimization, with convergence conditions extending previous gradient descent results, applied to overparametrized neural networks.
Details
Motivation: To analyze the convergence properties of stochastic gradient descent (SGD) in continuous-time approximation for population expected loss minimization, building on previous work for non-stochastic gradient descent.Method: Develops a continuous-time approximation framework for SGD process, establishes general sufficient conditions for convergence, and applies the theoretical results to overparametrized neural network training scenarios.
Result: Establishes convergence conditions for SGD in continuous-time approximation that extend Chatterjee’s (2022) results for non-stochastic gradient descent, demonstrating applicability to overparametrized neural networks.
Conclusion: The continuous-time approximation provides a theoretical framework for understanding SGD convergence, with results that generalize previous gradient descent analysis and have practical implications for neural network training.
Abstract: We study a continuous-time approximation of the stochastic gradient descent process for minimizing the population expected loss in learning problems. The main results establish general sufficient conditions for the convergence, extending the results of Chatterjee (2022) established for (nonstochastic) gradient descent. We show how the main result can be applied to the case of overparametrized neural network training.
[338] Swing-by Dynamics in Concept Learning and Compositional Generalization
Yongyi Yang, Core Francisco Park, Ekdeep Singh Lubana, Maya Okawa, Wei Hu, Hidenori Tanaka
Main category: cs.LG
TL;DR: The paper proposes a structured identity mapping (SIM) task as a theoretical abstraction to analyze compositional generalization in diffusion models, showing it captures key empirical observations and reveals new insights like non-monotonic learning dynamics.
Details
Motivation: To provide theoretical characterization of empirical findings from prior work on compositional generalization in text-conditioned diffusion models, which showed sequential learning respecting compositional hierarchy and concept-centric learning dynamics.Method: Introduces a structured identity mapping (SIM) task using Gaussian mixture with organized centroids, mathematically analyzes neural network learning dynamics on this simplified task, and validates predictions with text-conditioned diffusion models.
Result: The SIM task successfully captures and explains key empirical observations from prior work, reveals novel mechanisms like non-monotonic test loss dynamics in early training, and bridges simplified theoretical framework with complex generative models.
Conclusion: The SIM task establishes a meaningful theoretical abstraction for understanding concept learning dynamics in modern generative models, providing both explanatory power for existing observations and new theoretical insights.
Abstract: Prior work has shown that text-conditioned diffusion models can learn to identify and manipulate primitive concepts underlying a compositional data-generating process, enabling generalization to entirely novel, out-of-distribution compositions. Beyond performance evaluations, these studies develop a rich empirical phenomenology of learning dynamics, showing that models generalize sequentially, respecting the compositional hierarchy of the data-generating process. Moreover, concept-centric structures within the data significantly influence a model’s speed of learning the ability to manipulate a concept. In this paper, we aim to better characterize these empirical results from a theoretical standpoint. Specifically, we propose an abstraction of prior work’s compositional generalization problem by introducing a structured identity mapping (SIM) task, where a model is trained to learn the identity mapping on a Gaussian mixture with structurally organized centroids. We mathematically analyze the learning dynamics of neural networks trained on this SIM task and show that, despite its simplicity, SIM’s learning dynamics capture and help explain key empirical observations on compositional generalization with diffusion models identified in prior work. Our theory also offers several new insights – e.g., we find a novel mechanism for non-monotonic learning dynamics of test loss in early phases of training. We validate our new predictions by training a text-conditioned diffusion model, bridging our simplified framework and complex generative models. Overall, this work establishes the SIM task as a meaningful theoretical abstraction of concept learning dynamics in modern generative models.
[339] RaanA: A Fast, Flexible, and Data-Efficient Post-Training Quantization Algorithm
Yongyi Yang, Jianyang Gao, Wei Hu
Main category: cs.LG
TL;DR: RaanA is a unified post-training quantization framework for LLMs that uses RaBitQ-H for efficient quantization and AllocateBits for optimal bit-width allocation across layers, achieving competitive performance with minimal calibration data.
Details
Motivation: Existing PTQ methods have limitations including heavy calibration data requirements and inflexible bit allocation choices, which RaanA aims to overcome.Method: RaanA combines two novel components: RaBitQ-H (a randomized vector quantization variant for fast and accurate quantization) and AllocateBits (an algorithm for optimal bit-width allocation across layers based on quantization sensitivity).
Result: RaanA achieves competitive performance with state-of-the-art quantization methods while being extremely fast, requiring minimal calibration data, and enabling flexible bit allocation.
Conclusion: RaanA effectively balances efficiency and accuracy in LLM quantization, providing a practical solution for inference optimization with minimal calibration overhead.
Abstract: Post-training Quantization (PTQ) has become a widely used technique for improving inference efficiency of large language models (LLMs). However, existing PTQ methods generally suffer from crucial limitations such as heavy calibration data requirements and inflexible choice of target number of bits. In this paper, we propose RaanA, a unified PTQ framework that overcomes these challenges by introducing two novel components: 1) RaBitQ-H, a variant of a randomized vector quantization method RaBitQ, designed for fast, accurate, and highly efficient quantization; and 2) AllocateBits, an algorithm that optimally allocates bit-widths across layers based on their quantization sensitivity. RaanA achieves competitive performance with state-of-the-art quantization methods while being extremely fast, requiring minimal calibration data, and enabling flexible bit allocation. Extensive experiments demonstrate RaanA’s efficacy in balancing efficiency and accuracy. The code is publicly available at https://github.com/FFTYYY/RaanA .
[340] DeepOSets: Non-Autoregressive In-Context Learning with Permutation-Invariance Inductive Bias
Shao-Ting Chiu, Junyuan Hong, Ulisses Braga-Neto
Main category: cs.LG
TL;DR: DeepOSets architecture demonstrates in-context learning capabilities without attention mechanisms, combining DeepSets and DeepONets for permutation-invariant regression tasks with fewer parameters than transformers.
Details
Motivation: To challenge the assumption that in-context learning (ICL) is exclusive to attention-based transformers and show that ICL can emerge in non-autoregressive architectures with permutation-invariance inductive bias.Method: Proposed DeepOSets architecture combining DeepSets’ set learning with DeepONets’ operator learning, proved universal approximation theorem, and tested on linear, polynomial, and neural network regression tasks with varying noise, dimensions, and sample sizes.
Result: DeepOSets achieved accurate and fast results with an order of magnitude fewer parameters than comparable transformer alternatives, with enhanced accuracy in high-dimensional settings using Set Transformers.
Conclusion: ICL is not exclusive to attention-based architectures and can emerge in permutation-invariant neural architectures like DeepOSets, offering efficient alternatives to transformers for in-context learning tasks.
Abstract: In-context learning (ICL) is the remarkable ability displayed by some machine learning models to learn from examples provided in a user prompt without any model parameter updates. ICL was first observed in the domain of large language models, and it has been widely assumed that it is a product of the attention mechanism in autoregressive transformers. In this paper, using stylized regression learning tasks, we demonstrate that ICL can emerge in a non-autoregressive neural architecture with a hard-coded permutation-invariance inductive bias. This novel architecture, called DeepOSets, combines the set learning properties of the DeepSets architecture with the operator learning capabilities of Deep Operator Networks (DeepONets). We provide a representation theorem for permutation-invariant regression learning operators and prove that DeepOSets are universal approximators of this class of operators. We performed comprehensive numerical experiments to evaluate the capabilities of DeepOSets in learning linear, polynomial, and shallow neural network regression, under varying noise levels, dimensionalities, and sample sizes. In the high-dimensional regime, accuracy was enhanced by replacing the DeepSets layer with a Set Transformer. Our results show that DeepOSets deliver accurate and fast results with an order of magnitude fewer parameters than a comparable transformer-based alternative.
[341] DualOptim: Enhancing Efficacy and Stability in Machine Unlearning with Dual Optimizers
Xuyang Zhong, Haochen Luo, Chen Liu
Main category: cs.LG
TL;DR: DualOptim is a new machine unlearning method that addresses hyperparameter sensitivity in existing approaches by incorporating adaptive learning rate and decoupled momentum factors, improving stability and performance across various tasks.
Details
Motivation: Existing machine unlearning methods are highly sensitive to hyperparameters, requiring extensive tuning that limits practical deployment, and exhibit instability and suboptimal performance in different scenarios.Method: Proposed Dual Optimizer (DualOptim) with adaptive learning rate and decoupled momentum factors to achieve stable and effective unlearning.
Result: Empirical and theoretical evidence shows DualOptim contributes to effective and stable unlearning, significantly boosting MU efficacy and stability across diverse tasks including image classification, image generation, and large language models.
Conclusion: DualOptim is a versatile approach that can empower existing machine unlearning algorithms by addressing their hyperparameter sensitivity and stability issues.
Abstract: Existing machine unlearning (MU) approaches exhibit significant sensitivity to hyperparameters, requiring meticulous tuning that limits practical deployment. In this work, we first empirically demonstrate the instability and suboptimal performance of existing popular MU methods when deployed in different scenarios. To address this issue, we propose Dual Optimizer (DualOptim), which incorporates adaptive learning rate and decoupled momentum factors. Empirical and theoretical evidence demonstrates that DualOptim contributes to effective and stable unlearning. Through extensive experiments, we show that DualOptim can significantly boost MU efficacy and stability across diverse tasks, including image classification, image generation, and large language models, making it a versatile approach to empower existing MU algorithms.
[342] AERO: Entropy-Guided Framework for Private LLM Inference
Nandan Kumar Jha, Brandon Reagen
Main category: cs.LG
TL;DR: AERO framework strategically removes nonlinear operations from transformers to reduce privacy-preserving computation overheads while maintaining performance.
Details
Motivation: Privacy-preserving computation for language models suffers from high latency and communication costs due to nonlinear functions, but removing nonlinearities causes entropy collapse in deep layers and entropic overload in early layers.Method: AERO uses an entropy-guided framework with adaptive recalibration through head-wise entropy regularizer with learnable per-head strengths, penalizing extreme entropies and fostering functional diversity.
Result: AERO achieves 3.4× communication savings and 1.4× latency reduction without performance penalty.
Conclusion: AERO successfully addresses the challenges of nonlinearity removal in transformers for privacy-preserving computation while maintaining model performance.
Abstract: Privacy-preserving computation enables language model inference directly on encrypted data yet suffers from prohibitive latency and communication overheads, primarily due to nonlinear functions. Removing nonlinearities, however, can trigger one of two failure modes restricting the potential for nonlinearity removal: entropy collapse in deeper layers, which destabilizes training, and entropic overload in early layers, causing under-utilization of attention heads. To address these challenges, we introduce AERO, an entropy-guided framework to strategically eliminates costly nonlinear operations from transformer architectures, which employs an adaptive recalibration through a head-wise entropy regularizer with learnable per-head strengths, enabling each head to adjust its entropy level while penalizing extreme entropies and fostering functional diversity through a tolerance margin. Experiments show AERO can save 3.4$\times$ communication and 1.4$\times$ latency, without any performance penalty.
[343] Transformers as Implicit State Estimators: In-Context Learning in Dynamical Systems
Usman Akram, Haris Vikalo
Main category: cs.LG
TL;DR: Transformers with in-context learning can predict outputs of dynamical systems without explicit system knowledge or test-time updates, matching Kalman filter performance for linear systems and approaching EKF/PF for nonlinear systems.
Details
Motivation: Traditional methods like Kalman filters are optimal for linear systems but require explicit system models, while nonlinear systems rely on suboptimal heuristics. There's a need for flexible, model-free approaches for dynamical system prediction.Method: Use frozen transformers in an in-context learning setting, providing short context of past input-output pairs and optionally system parameters, to predict current outputs without gradient updates or explicit system knowledge.
Result: In linear-Gaussian regimes, predictions match Kalman filter performance; in nonlinear regimes, performance approaches EKF and particle filtering. The method shows graceful degradation when parameters are withheld, demonstrating robustness and implicit parameter inference.
Conclusion: Transformer in-context learning provides a flexible, non-parametric alternative for dynamical system output prediction through implicit latent-state estimation, without requiring explicit system models or test-time optimization.
Abstract: Predicting the behavior of a dynamical system from noisy observations of its past outputs is a classical problem encountered across engineering and science. For linear systems with Gaussian inputs, the Kalman filter – the best linear minimum mean-square error estimator of the state trajectory – is optimal in the Bayesian sense. For nonlinear systems, Bayesian filtering is typically approached using suboptimal heuristics such as the Extended Kalman Filter (EKF), or numerical methods such as particle filtering (PF). In this work, we show that transformers, employed in an in-context learning (ICL) setting, can implicitly infer hidden states in order to predict the outputs of a wide family of dynamical systems, without test-time gradient updates or explicit knowledge of the system model. Specifically, when provided with a short context of past input-output pairs and, optionally, system parameters, a frozen transformer accurately predicts the current output. In linear-Gaussian regimes, its predictions closely match those of the Kalman filter; in nonlinear regimes, its performance approaches that of EKF and PF. Moreover, prediction accuracy degrades gracefully when key parameters, such as the state-transition matrix, are withheld from the context, demonstrating robustness and implicit parameter inference. These findings suggest that transformer in-context learning provides a flexible, non-parametric alternative for output prediction in dynamical systems, grounded in implicit latent-state estimation.
[344] Fair Play for Individuals, Foul Play for Groups? Auditing Anonymization’s Impact on ML Fairness
Héber H. Arcolezi, Mina Alishahi, Adda-Akram Bendoukha, Nesrine Kaaniche
Main category: cs.LG
TL;DR: Anonymization techniques like k-anonymity, l-diversity, and t-closeness can degrade group fairness in ML models by up to 4x while improving individual fairness due to increased input homogeneity.
Details
Motivation: ML algorithms often use sensitive data, raising privacy concerns. While anonymization addresses privacy, its specific effects on ML fairness remain largely unexplored, particularly how different anonymization techniques impact both individual and group fairness.Method: Systematically audit the impact of anonymization techniques on ML fairness by evaluating both individual and group fairness metrics across varying levels of anonymization, diverse privacy settings, and data distributions.
Result: Anonymization degrades group fairness metrics by up to fourfold, while similarity-based individual fairness metrics tend to improve under stronger anonymization due to increased input homogeneity.
Conclusion: The study reveals critical trade-offs between privacy, fairness, and utility in ML systems, providing actionable guidelines for responsible AI development when using anonymization techniques.
Abstract: Machine learning (ML) algorithms are heavily based on the availability of training data, which, depending on the domain, often includes sensitive information about data providers. This raises critical privacy concerns. Anonymization techniques have emerged as a practical solution to address these issues by generalizing features or suppressing data to make it more difficult to accurately identify individuals. Although recent studies have shown that privacy-enhancing technologies can influence ML predictions across different subgroups, thus affecting fair decision-making, the specific effects of anonymization techniques, such as $k$-anonymity, $\ell$-diversity, and $t$-closeness, on ML fairness remain largely unexplored. In this work, we systematically audit the impact of anonymization techniques on ML fairness, evaluating both individual and group fairness. Our quantitative study reveals that anonymization can degrade group fairness metrics by up to fourfold. Conversely, similarity-based individual fairness metrics tend to improve under stronger anonymization, largely as a result of increased input homogeneity. By analyzing varying levels of anonymization across diverse privacy settings and data distributions, this study provides critical insights into the trade-offs between privacy, fairness, and utility, offering actionable guidelines for responsible AI development. Our code is publicly available at: https://github.com/hharcolezi/anonymity-impact-fairness.
[345] Model Inversion Attacks: A Survey of Approaches and Countermeasures
Zhanke Zhou, Jianing Zhu, Fengfei Yu, Xuan Li, Xiong Peng, Tongliang Liu, Bo Han
Main category: cs.LG
TL;DR: This paper provides a comprehensive survey of model inversion attacks (MIAs) across different domains, summarizing attack methods, defenses, and future research directions.
Details
Motivation: There are growing concerns about privacy leakage in neural networks, particularly through model inversion attacks that extract sensitive training data features. Despite the significance of this threat, there's a lack of systematic studies covering MIAs comprehensively across different domains.Method: The authors conduct a systematic survey that summarizes up-to-date MIA methods in both attacks and defenses, analyzing their contributions, limitations, underlying modeling principles, and optimization challenges.
Result: The survey bridges the literature gap by providing a comprehensive overview of MIAs across images, texts, and graphs, highlighting the vulnerability of neural networks to privacy attacks.
Conclusion: This survey facilitates future research in model inversion attacks and defenses, with the authors maintaining a repository to track relevant research in this critical area of privacy protection.
Abstract: The success of deep neural networks has driven numerous research studies and applications from Euclidean to non-Euclidean data. However, there are increasing concerns about privacy leakage, as these networks rely on processing private data. Recently, a new type of privacy attack, the model inversion attacks (MIAs), aims to extract sensitive features of private data for training by abusing access to a well-trained model. The effectiveness of MIAs has been demonstrated in various domains, including images, texts, and graphs. These attacks highlight the vulnerability of neural networks and raise awareness about the risk of privacy leakage within the research community. Despite the significance, there is a lack of systematic studies that provide a comprehensive overview and deeper insights into MIAs across different domains. This survey aims to summarize up-to-date MIA methods in both attacks and defenses, highlighting their contributions and limitations, underlying modeling principles, optimization challenges, and future directions. We hope this survey bridges the gap in the literature and facilitates future research in this critical area. Besides, we are maintaining a repository to keep track of relevant research at https://github.com/AndrewZhou924/Awesome-model-inversion-attack.
[346] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen
Main category: cs.LG
TL;DR: This paper introduces two key contributions to enhance attention efficiency: FP4 attention for inference acceleration using Blackwell GPU Tensor Cores (5x speedup over FlashAttention), and 8-bit attention for training tasks that achieves lossless fine-tuning performance but slower pretraining convergence.
Details
Motivation: The quadratic time complexity of attention mechanisms limits efficiency, and while existing low-bit attention methods focus only on inference, training large models also requires efficiency improvements.Method: Two approaches: 1) Leverage FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation for inference, 2) Design accurate 8-bit attention for both forward and backward propagation in training tasks.
Result: FP4 attention achieves 1038 TOPS on RTX5090 (5x speedup over FlashAttention) and accelerates various models plug-and-play. 8-bit attention achieves lossless performance in fine-tuning but slower convergence in pretraining.
Conclusion: Low-bit attention can effectively accelerate both inference and training tasks, with FP4 excelling in inference and 8-bit working well for fine-tuning, though pretraining convergence needs improvement.
Abstract: The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https://github.com/thu-ml/SageAttention.
[347] Resource-Adaptive Successive Doubling for Hyperparameter Optimization with Large Datasets on High-Performance Computing Systems
Marcel Aach, Rakesh Sarma, Helmut Neukirchen, Morris Riedel, Andreas Lintermann
Main category: cs.LG
TL;DR: RASDA is a novel hyperparameter optimization algorithm that combines resource-adaptive successive doubling with ASHA, achieving up to 1.9x speedup while maintaining or improving model quality on large-scale HPC systems.
Details
Motivation: To speed up hyperparameter optimization on HPC systems by efficiently allocating computational resources (GPUs) to promising configurations through data-parallel training, enabling systematic HPO on terabyte-scale scientific datasets where full training runs are usually infeasible.Method: Combines a resource-adaptive successive doubling scheme with Asynchronous Successive Halving Algorithm (ASHA), using number of workers as a resource to allocate more GPUs to promising configurations via data-parallel training.
Result: RASDA outperforms ASHA by up to 1.9x in runtime while maintaining or surpassing ASHA’s solution quality. Successfully applied to terabyte-scale scientific datasets from CV, CFD, and AM domains using up to 1,024 GPUs.
Conclusion: RASDA enables efficient hyperparameter optimization of complex models on massive scientific data, making systematic HPO feasible for terabyte-scale datasets for the first time.
Abstract: On High-Performance Computing (HPC) systems, several hyperparameter configurations can be evaluated in parallel to speed up the Hyperparameter Optimization (HPO) process. State-of-the-art HPO methods follow a bandit-based approach and build on top of successive halving, where the final performance of a combination is estimated based on a lower than fully trained fidelity performance metric and more promising combinations are assigned more resources over time. Frequently, the number of epochs is treated as a resource, letting more promising combinations train longer. Another option is to use the number of workers as a resource and directly allocate more workers to more promising configurations via data-parallel training. This article proposes a novel Resource-Adaptive Successive Doubling Algorithm (RASDA), which combines a resource-adaptive successive doubling scheme with the plain Asynchronous Successive Halving Algorithm (ASHA). Scalability of this approach is shown on up to 1,024 Graphics Processing Units (GPUs) on modern HPC systems. It is applied to different types of Neural Networks (NNs) and trained on large datasets from the Computer Vision (CV), Computational Fluid Dynamics (CFD), and Additive Manufacturing (AM) domains, where performing more than one full training run is usually infeasible. Empirical results show that RASDA outperforms ASHA by a factor of up to 1.9 with respect to the runtime. At the same time, the solution quality of final ASHA models is maintained or even surpassed by the implicit batch size scheduling of RASDA. With RASDA, systematic HPO is applied to a terabyte-scale scientific dataset for the first time in the literature, enabling efficient optimization of complex models on massive scientific data. The implementation of RASDA is available on https://github.com/olympiquemarcel/rasda
[348] Scaling Diffusion Transformers Efficiently via $μ$P
Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li
Main category: cs.LG
TL;DR: This paper extends Maximal Update Parametrization (μP) from vanilla Transformers to diffusion Transformers, enabling stable hyperparameter transfer from small to large models and significantly reducing tuning costs.
Details
Motivation: Diffusion Transformers are foundational for vision generative models but face scalability limitations due to expensive hyperparameter tuning at large scales. While μP works well for vanilla Transformers, it was unclear if it extends to diffusion Transformers which have different architectures and objectives.Method: The authors generalize standard μP to diffusion Transformers (U-ViT, DiT, PixArt-α, MMDiT) and prove that their μP aligns with vanilla Transformers. They systematically validate HP transferability through large-scale experiments.
Result: DiT-XL-2-μP with transferred learning rate achieves 2.9× faster convergence than original DiT-XL-2. Scaling PixArt-α (0.04B→0.61B) and MMDiT (0.18B→18B) under μP outperforms baselines while requiring only 5.5% and 3% of tuning costs respectively.
Conclusion: μP provides a principled and efficient framework for scaling diffusion Transformers, enabling stable hyperparameter transfer and dramatically reducing tuning costs while maintaining or improving performance.
Abstract: Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including U-ViT, DiT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.
[349] Challenges learning from imbalanced data using tree-based models: Prevalence estimates systematically depend on hyperparameters and can be upwardly biased
Nathan Phelps, Daniel J. Lizotte, Douglas G. Woolford
Main category: cs.LG
TL;DR: Analytical calibration of random forests for imbalanced classification causes problematic prevalence estimates that depend on sampling rates and tree parameters, and reveals decision trees can be biased toward minority classes.
Details
Motivation: Address the bias introduced by subsampling majority classes in imbalanced binary classification, where standard analytical calibration methods for random forests produce unreliable results.Method: Analyze the effects of analytical calibration on random forests trained on subsampled data, examining how sampling rates and number of predictors per split affect prevalence estimates.
Result: Analytical calibration creates prevalence estimates that depend on both sampling rate and number of predictors considered at each split. Surprisingly, decision trees can exhibit bias toward minority classes rather than majority classes.
Conclusion: Standard analytical calibration approaches for random forests in imbalanced classification are problematic and the common assumption about decision tree bias toward majority classes may be incorrect.
Abstract: When using machine learning for imbalanced binary classification problems, it is common to subsample the majority class to create a (more) balanced training dataset. This biases the model’s predictions because the model learns from data whose data generating process differs from new data. One way of accounting for this bias is analytically mapping the resulting predictions to new values based on the sampling rate for the majority class. We show that calibrating a random forest this way has negative consequences, including prevalence estimates that depend on both the number of predictors considered at each split in the random forest and the sampling rate used. We explain the former using known properties of random forests and analytical calibration. Through investigating the latter issue, we made a surprising discovery - contrary to the widespread belief that decision trees are biased towards the majority class, they actually can be biased towards the minority class.
[350] GAIA: A Foundation Model for Operational Atmospheric Dynamics
Ata Akbari Asanjan, Olivia Alexander, Tom Berg, Stephen Peng, Jad Makki, Clara Zhang, Matt Yang, Disha Shidham, Srija Chakraborty, William Bender, Cara Crawford, Arun Ravindran, Olivier Raiman, David Potere, David Bell
Main category: cs.LG
TL;DR: GAIA is a hybrid self-supervised geospatial foundation model that combines Masked Autoencoders (MAE) with DINO to create semantically rich representations from global satellite imagery, demonstrating superior performance on atmospheric tasks.
Details
Motivation: To develop more transferable representations for atmospheric modeling by combining complementary self-supervised objectives rather than relying on single approaches like MAE alone.Method: Hybrid self-supervised learning fusing Masked Autoencoders (MAE) with self-distillation with no labels (DINO), pre-trained on 15 years of global geostationary satellite infrared data (2001-2015).
Result: GAIA outperforms MAE-only baseline: atmospheric river segmentation (F1: 0.58 vs 0.52), tropical cyclone detection (storm-level recall: 81% vs 75%, early detection: 29% vs 17%), robust gap-filling across 30-95% masking, and learns disentangled representations capturing atmospheric dynamics.
Conclusion: Combining complementary self-supervised objectives yields more transferable representations for diverse atmospheric modeling tasks, with GAIA demonstrating spatially coherent, object-centric features distributed across multiple principal components.
Abstract: We introduce GAIA (Geospatial Artificial Intelligence for Atmospheres), a hybrid self-supervised geospatial foundation model that fuses Masked Autoencoders (MAE) with self-distillation with no labels (DINO) to generate semantically rich representations from global geostationary satellite imagery. Pre-trained on 15 years of globally-merged infrared observations (2001-2015), GAIA learns disentangled representations that capture atmospheric dynamics rather than trivial diurnal patterns, as evidenced by distributed principal component structure and temporal coherence analysis. We demonstrate robust reconstruction capabilities across varying data availability (30-95% masking), achieving superior gap-filling performance on real missing data patterns. When transferred to downstream tasks, GAIA consistently outperforms an MAE-only baseline: improving atmospheric river segmentation (F1: 0.58 vs 0.52), enhancing tropical cyclone detection (storm-level recall: 81% vs 75%, early detection: 29% vs 17%), and maintaining competitive precipitation estimation performance. Analysis reveals that GAIA’s hybrid objectives encourage learning of spatially coherent, object-centric features distributed across multiple principal components rather than concentrated representations focused on reconstruction. This work demonstrates that combining complementary self-supervised objectives yields more transferable representations for diverse atmospheric modeling tasks. Model weights and code are available at: https://huggingface.co/bcg-usra-nasa-gaia/GAIA-v1.
[351] Byzantine Resilient Federated Multi-Task Representation Learning
Tuan Le, Shana Moothedath
Main category: cs.LG
TL;DR: BR-MTRL is a Byzantine-resilient multi-task representation learning framework that handles malicious agents in federated learning through shared neural network layers and client-specific final layers, using robust aggregation methods like Geometric Median and Krum.
Details
Motivation: To enable personalized learning in heterogeneous federated settings while defending against Byzantine (faulty or malicious) agents that could disrupt the learning process.Method: Uses representation learning with shared neural network layers and client-specific final layers, employing alternating gradient descent where clients optimize local models and send representation estimates to a central server aggregated using robust methods (Geometric Median and Krum).
Result: Implemented on AWS platform and tested on real-world datasets (CIFAR-10, FEMNIST), showing effectiveness, robustness against Byzantine adversaries, and transferability to new clients with limited data.
Conclusion: The proposed BR-MTRL framework successfully enables personalized learning while maintaining Byzantine resilience in distributed settings, demonstrating practical applicability through real-world experiments.
Abstract: In this paper, we propose BR-MTRL, a Byzantine-resilient multi-task representation learning framework that handles faulty or malicious agents. Our approach leverages representation learning through a shared neural network model, where all clients share fixed layers, except for a client-specific final layer. This structure captures shared features among clients while enabling individual adaptation, making it a promising approach for leveraging client data and computational power in heterogeneous federated settings to learn personalized models. To learn the model, we employ an alternating gradient descent strategy: each client optimizes its local model, updates its final layer, and sends estimates of the shared representation to a central server for aggregation. To defend against Byzantine agents, we employ two robust aggregation methods for client-server communication, Geometric Median and Krum. Our method enables personalized learning while maintaining resilience in distributed settings. We implemented the proposed algorithm in a federated testbed built using Amazon Web Services (AWS) platform and compared its performance with various benchmark algorithms and their variations. Through experiments using real-world datasets, including CIFAR-10 and FEMNIST, we demonstrated the effectiveness and robustness of our approach and its transferability to new unseen clients with limited data, even in the presence of Byzantine adversaries.
[352] Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
Zhanke Zhou, Zhaocheng Zhu, Xuan Li, Mikhail Galkin, Xiao Feng, Sanmi Koyejo, Jian Tang, Bo Han
Main category: cs.LG
TL;DR: Landscape of Thoughts (LoT) is a visualization tool that inspects LLM reasoning trajectories by converting textual states to numerical features and visualizing them with t-SNE, enabling analysis of reasoning patterns and model performance.
Details
Motivation: LLM reasoning behavior remains poorly understood despite its importance for applications, posing challenges for research, development, and safety.Method: Convert textual reasoning states to numerical features measuring distance to answer choices, then visualize in 2D using t-SNE. Can be adapted to build lightweight verifiers for trajectory correctness.
Result: Effectively distinguishes strong/weak models, correct/incorrect answers, and different reasoning tasks. Uncovers undesirable patterns like low consistency and high uncertainty. Verifier adaptation boosts reasoning accuracy and test-time scaling.
Conclusion: LoT provides valuable insights into LLM reasoning behavior and can be adapted to improve reasoning performance through verification, addressing gaps in understanding LLM reasoning processes.
Abstract: Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavior of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts (LoT), the first landscape visualization tool to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset. We represent the textual states in a trajectory as numerical features that quantify the states’ distances to the answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative and quantitative analysis with the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt LoT to a model that predicts the property they observe. We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories. Empirically, this verifier boosts the reasoning accuracy and the test-time scaling effect. The code is publicly available at: https://github.com/tmlr-group/landscape-of-thoughts.
[353] SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA
Minrui Luo, Fuhang Kuang, Yu Wang, Zirui Liu, Tianxing He
Main category: cs.LG
TL;DR: SC-LoRA is a novel LoRA initialization framework that balances efficient fine-tuning and knowledge preservation by constraining LoRA adapters in a low-rank subspace.
Details
Motivation: Vanilla LoRA suffers from slow convergence and knowledge forgetting, and existing methods cannot address both efficient fine-tuning and knowledge preservation simultaneously.Method: Constrains trainable LoRA adapters in a low-rank subspace that preserves fine-tuning data context while minimizing preserved knowledge context loss.
Result: SC-LoRA achieves superior fine-tuning performance while significantly reducing knowledge forgetting, outperforming contemporary LoRA initialization methods.
Conclusion: SC-LoRA successfully navigates the trade-off between efficient fine-tuning and knowledge preservation, providing a balanced solution for PEFT methods.
Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), are indispensable for efficiently customizing Large Language Models (LLMs). However, vanilla LoRA suffers from slow convergence speed and knowledge forgetting problems. Recent studies have leveraged the power of designed LoRA initialization, to enhance the fine-tuning efficiency, or to preserve knowledge in the pre-trained LLM. However, none of these works can address the two cases at the same time. To this end, we introduce Subspace-Constrained LoRA (SC-LoRA), a novel LoRA initialization framework engineered to navigate the trade-off between efficient fine-tuning and knowledge preservation. We achieve this by constraining the output of trainable LoRA adapters in a low-rank subspace, where the context information of fine-tuning data is most preserved while the context information of preserved knowledge is least retained, in a balanced way. Such constraint enables the trainable weights to primarily focus on the main features of fine-tuning data while avoiding damaging the preserved knowledge features. We provide theoretical analysis on our method, and conduct extensive experiments including safety preservation and world knowledge preservation, on various downstream tasks. In our experiments, SC-LoRA succeeds in delivering superior fine-tuning performance while markedly diminishing knowledge forgetting, surpassing contemporary LoRA initialization methods.
[354] Benchmarking Ultra-Low-Power $μ$NPUs
Josh Millar, Yushan Huang, Sarab Sethi, Hamed Haddadi, Anil Madhavapeddy
Main category: cs.LG
TL;DR: First comparative evaluation of microcontroller-scale neural processing units (μNPUs) with independent benchmarks and open-source compilation pipeline for fair comparison.
Details
Motivation: On-device NN inference offers predictable latency, improved privacy, reliability, and lower costs than cloud-based inference, driving development of μNPUs for ultra-low-power applications.Method: Developed and open-sourced a model compilation pipeline supporting consistent benchmarking of quantized models across diverse microcontroller hardware.
Result: Uncovered expected performance trends and surprising disparities between hardware specifications and actual performance, including unexpected scaling behaviors with model complexity.
Conclusion: Provides foundation for ongoing evaluation of μNPU platforms and offers practical insights for hardware and software developers in this rapidly evolving space.
Abstract: Efficient on-device neural network (NN) inference offers predictable latency, improved privacy and reliability, and lower operating costs for vendors than cloud-based inference. This has sparked recent development of microcontroller-scale NN accelerators, also known as neural processing units ($\mu$NPUs), designed specifically for ultra-low-power applications. We present the first comparative evaluation of a number of commercially-available $\mu$NPUs, including the first independent benchmarks for multiple platforms. To ensure fairness, we develop and open-source a model compilation pipeline supporting consistent benchmarking of quantized models across diverse microcontroller hardware. Our resulting analysis uncovers both expected performance trends as well as surprising disparities between hardware specifications and actual performance, including certain $\mu$NPUs exhibiting unexpected scaling behaviors with model complexity. This work provides a foundation for ongoing evaluation of $\mu$NPU platforms, alongside offering practical insights for both hardware and software developers in this rapidly evolving space.
[355] Multi-Modal View Enhanced Large Vision Models for Long-Term Time Series Forecasting
ChengAo Shen, Wenchao Yu, Ziming Zhao, Dongjin Song, Wei Cheng, Haifeng Chen, Jingchao Ni
Main category: cs.LG
TL;DR: DMMV is a decomposition-based multi-modal view framework that addresses inductive bias in LVM-based time series forecasting by using trend-seasonal decomposition and adaptive decomposition to integrate multiple views (images and texts) of time series data.
Details
Motivation: Time series can be represented as images and texts (multi-modal views) to reveal complementary patterns and leverage pre-trained large vision models for forecasting. However, current LVM-based forecasters have inductive bias towards forecasting periods that needs to be addressed.Method: Proposes DMMV framework using trend-seasonal decomposition and a novel backcast-residual based adaptive decomposition to integrate multi-modal views (images and texts) for long-term time series forecasting.
Result: Outperforms 14 state-of-the-art models across diverse datasets, achieving best mean squared error on 6 out of 8 benchmark datasets.
Conclusion: DMMV effectively harnesses the inductive bias in LVM-based forecasting and demonstrates superior performance through decomposition-based multi-modal view integration.
Abstract: Time series, typically represented as numerical sequences, can also be transformed into images and texts, offering multi-modal views (MMVs) of the same underlying signal. These MMVs can reveal complementary patterns and enable the use of powerful pre-trained large models, such as large vision models (LVMs), for long-term time series forecasting (LTSF). However, as we identified in this work, the state-of-the-art (SOTA) LVM-based forecaster poses an inductive bias towards “forecasting periods”. To harness this bias, we propose DMMV, a novel decomposition-based multi-modal view framework that leverages trend-seasonal decomposition and a novel backcast-residual based adaptive decomposition to integrate MMVs for LTSF. Comparative evaluations against 14 SOTA models across diverse datasets show that DMMV outperforms single-view and existing multi-modal baselines, achieving the best mean squared error (MSE) on 6 out of 8 benchmark datasets. The code for this paper is available at: https://github.com/D2I-Group/dmmv.
[356] A Robust and Non-Iterative Tensor Decomposition Method with Automatic Thresholding
Hiroki Hasegawa, Yukihiko Okada
Main category: cs.LG
TL;DR: A novel tensor low-rank approximation method that eliminates the need for predefined ranks and iterative optimization, using statistical singular value hard thresholding for automatic component extraction.
Details
Motivation: Existing tensor decomposition methods require predefined ranks and iterative optimization, leading to high computational costs and dependence on analyst expertise.Method: Applies statistical singular value hard thresholding to each mode-wise unfolded matrix to automatically extract statistically significant components, with optimal thresholds derived from Marcenko-Pastur distribution theory.
Result: Outperforms conventional approaches (HOSVD, HOOI, and Tucker-L2E) in both estimation accuracy and computational efficiency in simulation experiments.
Conclusion: Provides a theoretically grounded, fully automatic, and non-iterative framework for tensor decomposition that reduces noise while preserving intrinsic structure.
Abstract: Recent advances in IoT and biometric sensing technologies have led to the generation of massive and high-dimensional tensor data, yet achieving accurate and efficient low-rank approximation remains a major challenge. Most existing tensor decomposition methods require predefined ranks and iterative optimization, resulting in high computational costs and dependence on analyst expertise. This study proposes a novel tensor low-rank approximation method that eliminates both prior rank specification and iterative optimization. The method applies statistical singular value hard thresholding to each mode-wise unfolded matrix to automatically extract statistically significant components, effectively reducing noise while preserving the intrinsic structure. Theoretically, the optimal thresholds for each mode are derived from the asymptotic properties of the Marcenko-Pastur distribution. Simulation experiments demonstrate that the proposed method outperforms conventional approaches (HOSVD, HOOI, and Tucker-L2E) in both estimation accuracy and computational efficiency. These results indicate that the proposed approach provides a theoretically grounded, fully automatic, and non-iterative framework for tensor decomposition.
[357] Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers
Zhexiang Li, Haoyu Wang, Yutong Bao, David Woodruff
Main category: cs.LG
TL;DR: The paper proposes a pre-scoring mechanism that enhances HyperAttention by prioritizing significant keys before applying attention, achieving better perplexity and efficiency than standard HyperAttention while maintaining competitive performance with LevAttention.
Details
Motivation: HyperAttention fails to find all significant keys, which increases perplexity. The authors aim to improve this by adding a pre-scoring mechanism to better identify important keys.Method: Introduces three pre-scoring methods: k-means and kernel k-means clustering, k-median clustering, and leverage score-based ranking. Replaces HyperAttention’s uniform residual sampling with the pre-scoring mechanism.
Result: Reduces perplexity from 12 to 8.3 on ChatGLM2 (131k token context), outperforms standard HyperAttention. On Vision-Transformer, achieves similar accuracy to LevAttention and surpasses it with specific parameters. Combined with HyperAttention, achieves up to 20x faster than FlashAttention.
Conclusion: Integrating pre-scoring into hierarchical attention mechanisms significantly improves transformer efficiency, providing a balanced trade-off between speed and modeling accuracy.
Abstract: Recent advances in transformer architectures deeply enhanced long-context language modeling. Among them, HyperAttention achieves competitive efficiency by combining a single-level LSH-based clustering with uniform residual sampling. However, HyperAttention fails to find all significant keys, which in turn raises the overall perplexity. We propose a pre-scoring mechanism that prioritizes significant keys before applying HyperAttention. We introduce three scoring methods: $k$-means and kernel $k$-means clustering, $k$-median clustering, and leverage score-based ranking (inspired by LevAttention) to filter keys effectively. We further replace HyperAttention’s original uniform residual sampling, relying exclusively on our pre-scoring mechanism. Experiments on ChatGLM2 (131k token context) reduce perplexity from 12 to 8.3, which outperforms standard HyperAttention. Moreover, when running on the Vision-Transformer (ViT), our method shows that it can guarantee similar accuracy compared with LevAttention, and will surpass LevAttention given specific parameters. Although this method introduces some computational overhead, its combination with HyperAttention achieves up to 20 times faster than FlashAttention, providing a balanced trade-off between speed and modeling accuracy. Our results highlight the effectiveness of integrating pre-scoring into hierarchical attention mechanisms, significantly improving transformer efficiency.
[358] PoLAR: Polar-Decomposed Low-Rank Adapter Representation
Kai Lion, Liang Zhang, Bingcong Li, Niao He
Main category: cs.LG
TL;DR: PoLAR addresses the underutilization of subspace in low-rank adaptation by using polar decomposition to factorize updates into direction matrices on Stiefel manifolds and a scale matrix, achieving faster convergence and improved performance across various benchmarks.
Details
Motivation: Low-rank adaptation of large models suffers from low stable rank that degrades fine-tuning performance due to underutilization of allocated subspace.Method: Proposes PoLAR parameterization using polar decomposition to factorize low-rank updates into two direction matrices constrained to Stiefel manifolds and an unconstrained scale matrix, combined with Riemannian optimization.
Result: Theoretical analysis shows exponentially faster convergence rate. Experimental results demonstrate consistent gains on language understanding, commonsense reasoning, and mathematical problem solving benchmarks with models from 350M to 27B parameters.
Conclusion: PoLAR effectively mitigates subspace underutilization in low-rank adaptation through polar decomposition and Riemannian optimization, leading to improved fine-tuning performance across diverse tasks and model sizes.
Abstract: We show that low-rank adaptation of large-scale models suffers from a low stable rank that is well below the linear algebraic rank of the subspace, degrading fine-tuning performance. To mitigate the underutilization of the allocated subspace, we propose PoLAR, a parameterization inspired by the polar decomposition that factorizes the low-rank update into two direction matrices constrained to Stiefel manifolds and an unconstrained scale matrix. Our theory shows that PoLAR yields an exponentially faster convergence rate on a canonical low-rank adaptation problem. Pairing the parameterization with Riemannian optimization leads to consistent gains on three different benchmarks testing general language understanding, commonsense reasoning, and mathematical problem solving with base model sizes ranging from 350M to 27B.
[359] Absorb and Converge: Provable Convergence Guarantee for Absorbing Discrete Diffusion Models
Yuchen Liang, Renxiang Huang, Lifeng Lai, Ness Shroff, Yingbin Liang
Main category: cs.LG
TL;DR: This paper provides the first theoretical analysis of discrete diffusion models with absorbing rate matrices, establishing finite-time error bounds and convergence rates that improve upon uniform rate matrices.
Details
Motivation: Discrete diffusion models perform better with absorbing rate matrices than uniform ones, but existing theoretical work only covers uniform rate matrices, leaving convergence guarantees for absorbing rate matrices unproven.Method: The authors derive KL divergence bounds for the forward process using a surrogate initialization distribution, and establish convergence guarantees for τ-leaping and uniformization samplers under absorbing rate matrices with novel technical tools.
Result: The paper demonstrates improved convergence rates for absorbing rate matrices over uniform rate matrices, and provides convergence guarantees without early stopping under suitable assumptions.
Conclusion: This work fills the theoretical gap for absorbing rate matrices in discrete diffusion models, providing the first convergence analysis and showing their theoretical superiority over uniform rate matrices.
Abstract: Discrete state space diffusion models have shown significant advantages in applications involving discrete data, such as text and image generation. It has also been observed that their performance is highly sensitive to the choice of rate matrices, particularly between uniform and absorbing rate matrices. While empirical results suggest that absorbing rate matrices often yield better generation quality compared to uniform rate matrices, existing theoretical works have largely focused on the uniform rate matrices case. Notably, convergence guarantees and error analyses for absorbing diffusion models are still missing. In this work, we provide the first finite-time error bounds and convergence rate analysis for discrete diffusion models using absorbing rate matrices. We begin by deriving an upper bound on the KL divergence of the forward process, introducing a surrogate initialization distribution to address the challenge posed by the absorbing stationary distribution, which is a singleton and causes the KL divergence to be ill-defined. We then establish the first convergence guarantees for both the $\tau$-leaping and uniformization samplers under absorbing rate matrices, demonstrating improved rates over their counterparts using uniform rate matrices. Furthermore, under suitable assumptions, we provide convergence guarantees without early stopping. Our analysis introduces several new technical tools to address challenges unique to absorbing rate matrices. These include a Jensen-type argument for bounding forward process convergence, novel techniques for bounding absorbing score functions, and a non-divergent upper bound on the score near initialization that removes the need of early-stopping.
[360] UdonCare: Hierarchy Pruning for Unseen Domain Discovery in Predictive Healthcare
Pengfei Hu, Xiaoxue Han, Fei Wang, Yue Ning
Main category: cs.LG
TL;DR: UdonCare is a hierarchy-guided method for clinical domain generalization that iteratively divides patients into latent domains using medical ontologies and decomposes domain-invariant information from patient data.
Details
Motivation: Healthcare providers use patient cohorts for personalized care, but clinical prediction models struggle to capture both global and cohort-specific patterns while generalizing to unseen domains, especially without explicit domain labels and with medical knowledge gaps.Method: Proposes UdonCare which identifies patient domains by pruning medical ontologies (e.g., ICD-9-CM hierarchy) and iteratively decomposes domain-invariant label information from patient data.
Result: UdonCare outperforms eight baselines across four clinical prediction tasks on MIMIC-III and MIMIC-IV datasets, demonstrating substantial improvements in domain generalization.
Conclusion: The method highlights the untapped potential of medical knowledge in guiding clinical domain generalization problems, showing superiority in handling domain gaps in clinical settings.
Abstract: Healthcare providers often divide patient populations into cohorts based on shared clinical factors, such as medical history, to deliver personalized healthcare services. This idea has also been adopted in clinical prediction models, where it presents a vital challenge: capturing both global and cohort-specific patterns while enabling model generalization to unseen domains. Addressing this challenge falls under the scope of domain generalization (DG). However, conventional DG approaches often struggle in clinical settings due to the absence of explicit domain labels and the inherent gap in medical knowledge. To address this, we propose UdonCare, a hierarchy-guided method that iteratively divides patients into latent domains and decomposes domain-invariant (label) information from patient data. Our method identifies patient domains by pruning medical ontologies (e.g. ICD-9-CM hierarchy). On two public datasets, MIMIC-III and MIMIC-IV, UdonCare shows superiority over eight baselines across four clinical prediction tasks with substantial domain gaps, highlighting the untapped potential of medical knowledge in guiding clinical domain generalization problems.
[361] Kernel conditional tests from learning-theoretic bounds
Pierre-François Massiani, Christian Fiedler, Lukas Haverbeck, Friedrich Solowjow, Sebastian Trimpe
Main category: cs.LG
TL;DR: A framework for hypothesis testing on conditional probability distributions using kernel ridge regression with confidence bounds, enabling tests of functionals like conditional moments and two-sample comparisons.
Details
Motivation: To develop statistical tests that can identify where functionals of conditional distributions differ with high probability, addressing gaps in existing methods for conditional testing.Method: Transform confidence bounds from kernel ridge regression into conditional expectation tests, use kernel mean embeddings for distribution testing, generalize confidence bounds for infinite-dimensional outputs, and introduce practical bootstrapping schemes.
Result: Developed tests that work with non-trace-class kernels and infinite-dimensional outputs, allow online sampling without independent data requirements, and provide practical implementation through parametric bootstrap tuning.
Conclusion: Establishes a comprehensive foundation for conditional testing on functionals with theoretical guarantees and practical implementation, advancing confidence bounds for vector-valued least squares estimation.
Abstract: We propose a framework for hypothesis testing on conditional probability distributions, which we then use to construct statistical tests of functionals of conditional distributions. These tests identify the inputs where the functionals differ with high probability, and include tests of conditional moments or two-sample tests. Our key idea is to transform confidence bounds of a learning method into a test of conditional expectations. We instantiate this principle for kernel ridge regression (KRR) with subgaussian noise. An intermediate data embedding then enables more general tests – including conditional two-sample tests – via kernel mean embeddings of distributions. To have guarantees in this setting, we generalize existing pointwise-in-time or time-uniform confidence bounds for KRR to previously-inaccessible yet essential cases such as infinite-dimensional outputs with non-trace-class kernels. These bounds also circumvent the need for independent data, allowing for instance online sampling. To make our tests readily applicable in practice, we introduce bootstrapping schemes leveraging the parametric form of testing thresholds identified in theory to avoid tuning inaccessible parameters. We illustrate the tests on examples, including one in process monitoring and comparison of dynamical systems. Overall, our results establish a comprehensive foundation for conditional testing on functionals, from theoretical guarantees to an algorithmic implementation, and advance the state of the art on confidence bounds for vector-valued least squares estimation.
[362] PPDiff: Diffusing in Hybrid Sequence-Structure Space for Protein-Protein Complex Design
Zhenqiao Song, Tiaoxiao Li, Lei Li, Martin Renqiang Min
Main category: cs.LG
TL;DR: PPDiff is a diffusion model that jointly designs protein binder sequences and structures for arbitrary targets using a novel neural network architecture, achieving high success rates in protein-protein complex design tasks.
Details
Motivation: Current methods for designing protein-binding proteins require extensive wet-lab testing and struggle to create high-affinity binders for arbitrary protein targets on demand.Method: PPDiff uses a diffusion model with SSINC architecture that integrates interleaved self-attention, kNN equivariant graph layers, and causal attention layers. It’s pretrained on PPBench dataset (706,360 complexes) and finetuned for specific applications.
Result: PPDiff achieved success rates of 50.00% for pretraining, 23.16% for target-protein mini-binder complex design, and 16.89% for antigen-antibody complex design, consistently outperforming baseline methods.
Conclusion: PPDiff provides an effective computational approach for designing high-affinity protein binders for arbitrary targets without extensive wet-lab testing, with promising results across different applications.
Abstract: Designing protein-binding proteins with high affinity is critical in biomedical research and biotechnology. Despite recent advancements targeting specific proteins, the ability to create high-affinity binders for arbitrary protein targets on demand, without extensive rounds of wet-lab testing, remains a significant challenge. Here, we introduce PPDiff, a diffusion model to jointly design the sequence and structure of binders for arbitrary protein targets in a non-autoregressive manner. PPDiffbuilds upon our developed Sequence Structure Interleaving Network with Causal attention layers (SSINC), which integrates interleaved self-attention layers to capture global amino acid correlations, k-nearest neighbor (kNN) equivariant graph layers to model local interactions in three-dimensional (3D) space, and causal attention layers to simplify the intricate interdependencies within the protein sequence. To assess PPDiff, we curate PPBench, a general protein-protein complex dataset comprising 706,360 complexes from the Protein Data Bank (PDB). The model is pretrained on PPBenchand finetuned on two real-world applications: target-protein mini-binder complex design and antigen-antibody complex design. PPDiffconsistently surpasses baseline methods, achieving success rates of 50.00%, 23.16%, and 16.89% for the pretraining task and the two downstream applications, respectively. The code, data and models are available at https://github.com/JocelynSong/PPDiff.
[363] Graph Diffusion that can Insert and Delete
Matteo Ninniri, Marco Podda, Davide Bacciu
Main category: cs.LG
TL;DR: GrIDDD is a graph diffusion model that enables dynamic node insertion and deletion during molecular generation, overcoming limitations of fixed-size graphs in existing diffusion models.
Details
Motivation: Existing graph diffusion models cannot adapt graph size during generation, which limits their effectiveness in conditional generation scenarios where molecular properties often correlate with size.Method: Reformulated noising and denoising processes to support monotonic insertion and deletion of nodes, creating a size-adaptive graph diffusion model called GrIDDD.
Result: GrIDDD matches or exceeds existing graph diffusion models on molecular property targeting and shows competitive performance in molecular optimization compared to specialized models.
Conclusion: This work enables size-adaptive molecular generation with graph diffusion, paving the way for more effective property-driven molecular design.
Abstract: Generative models of graphs based on discrete Denoising Diffusion Probabilistic Models (DDPMs) offer a principled approach to molecular generation by systematically removing structural noise through iterative atom and bond adjustments. However, existing formulations are fundamentally limited by their inability to adapt the graph size (that is, the number of atoms) during the diffusion process, severely restricting their effectiveness in conditional generation scenarios such as property-driven molecular design, where the targeted property often correlates with the molecular size. In this paper, we reformulate the noising and denoising processes to support monotonic insertion and deletion of nodes. The resulting model, which we call GrIDDD, dynamically grows or shrinks the chemical graph during generation. GrIDDD matches or exceeds the performance of existing graph diffusion models on molecular property targeting despite being trained on a more difficult problem. Furthermore, when applied to molecular optimization, GrIDDD exhibits competitive performance compared to specialized optimization models. This work paves the way for size-adaptive molecular generation with graph diffusion.
[364] Geometry-Aware Edge Pooling for Graph Neural Networks
Katharina Limbeck, Lydia Mezrag, Guy Wolf, Bastian Rieck
Main category: cs.LG
TL;DR: The paper proposes novel graph pooling layers that use edge collapses to preserve graph structures while reducing graph size, achieving better performance and interpretability compared to existing methods.
Details
Motivation: Existing GNN pooling layers often discard fundamental graph structures to optimize for learning tasks, reducing interpretability and causing unreliable performance across different datasets, tasks, and pooling ratios.Method: The proposed methods leverage diffusion geometry and iteratively reduce graph size via edge collapses while preserving metric structure and structural diversity. They use magnitude (an isometry-invariant diversity measure) and spread of metric space to guide pooling and ensure computational efficiency.
Result: Empirical results show the methods (i) achieve top performance across diverse graph classification tasks, (ii) preserve key spectral properties of input graphs, and (iii) retain high accuracy across varying pooling ratios.
Conclusion: The proposed structure-aware pooling via edge collapses provides an effective approach that maintains graph interpretability while achieving competitive performance across various conditions.
Abstract: Graph Neural Networks (GNNs) have shown significant success for graph-based tasks. Motivated by the prevalence of large datasets in real-world applications, pooling layers are crucial components of GNNs. By reducing the size of input graphs, pooling enables faster training and potentially better generalisation. However, existing pooling operations often optimise for the learning task at the expense of discarding fundamental graph structures, thus reducing interpretability. This leads to unreliable performance across dataset types, downstream tasks and pooling ratios. Addressing these concerns, we propose novel graph pooling layers for structure-aware pooling via edge collapses. Our methods leverage diffusion geometry and iteratively reduce a graph’s size while preserving both its metric structure and its structural diversity. We guide pooling using magnitude, an isometry-invariant diversity measure, which permits us to control the fidelity of the pooling process. Further, we use the spread of a metric space as a faster and more stable alternative ensuring computational efficiency. Empirical results demonstrate that our methods (i) achieve top performance compared to alternative pooling layers across a range of diverse graph classification tasks, (ii) preserve key spectral properties of the input graphs, and (iii) retain high accuracy across varying pooling ratios.
[365] Graph Semi-Supervised Learning for Point Classification on Data Manifolds
Caio F. Deberaldini Netto, Zhiyang Wang, Luana Ruiz
Main category: cs.LG
TL;DR: A graph semi-supervised learning framework that models data as points on a low-dimensional manifold, uses VAE for manifold approximation, constructs geometric graphs, and applies GNNs for classification with theoretical generalization guarantees.
Details
Motivation: To leverage the manifold hypothesis for semi-supervised classification by modeling data as points sampled from low-dimensional manifolds and developing theoretical guarantees for generalization.Method: 1) Approximate data manifold using VAE encoder to get embeddings; 2) Construct geometric graph with Gaussian-weighted edges based on embedding distances; 3) Apply GNN for semi-supervised node classification; 4) Use resampling strategy during training.
Result: Theoretical analysis shows generalization gap diminishes with increasing graph size and can vanish asymptotically with resampling. Numerical experiments on image benchmarks validate empirical effectiveness.
Conclusion: The proposed data-to-manifold-to-graph pipeline provides strong theoretical generalization guarantees for semi-supervised learning, with practical effectiveness demonstrated on real datasets.
Abstract: We propose a graph semi-supervised learning framework for classification tasks on data manifolds. Motivated by the manifold hypothesis, we model data as points sampled from a low-dimensional manifold $\mathcal{M} \subset \mathbb{R}^F$. The manifold is approximated in an unsupervised manner using a variational autoencoder (VAE), where the trained encoder maps data to embeddings that represent their coordinates in $\mathbb{R}^F$. A geometric graph is constructed with Gaussian-weighted edges inversely proportional to distances in the embedding space, transforming the point classification problem into a semi-supervised node classification task on the graph. This task is solved using a graph neural network (GNN). Our main contribution is a theoretical analysis of the statistical generalization properties of this data-to-manifold-to-graph pipeline. We show that, under uniform sampling from $\mathcal{M}$, the generalization gap of the semi-supervised task diminishes with increasing graph size, up to the GNN training error. Leveraging a training procedure which resamples a slightly larger graph at regular intervals during training, we then show that the generalization gap can be reduced even further, vanishing asymptotically. Finally, we validate our findings with numerical experiments on image classification benchmarks, demonstrating the empirical effectiveness of our approach.
[366] RTNinja: a generalized machine learning framework for analyzing random telegraph noise signals in nanoelectronic devices
Anirudh Varanasi, Robin Degraeve, Philippe Roussel, Clement Merckling
Main category: cs.LG
TL;DR: RTNinja is an automated machine learning framework for unsupervised analysis of random telegraph noise signals that identifies hidden sources without prior knowledge, outperforming conventional methods.
Details
Motivation: Random telegraph noise critically impacts nanoelectronic device reliability, but conventional analysis techniques have restrictive assumptions and require manual interventions, limiting applicability to complex datasets.Method: RTNinja has two modular components: LevelsExtractor (uses Bayesian inference and model selection for denoising/discretization) and SourcesMapper (uses probabilistic clustering and optimization to infer source configurations). Performance was evaluated using a Monte Carlo simulator generating 7000 labeled datasets.
Result: Across 7000 datasets spanning various signal-to-noise ratios and source complexities, RTNinja consistently demonstrated high-fidelity signal reconstruction and accurate extraction of source amplitudes and activity patterns.
Conclusion: RTNinja provides a robust, scalable, device-agnostic tool for random telegraph noise characterization, enabling large-scale statistical benchmarking, reliability qualification, predictive failure modeling, and device physics exploration.
Abstract: Random telegraph noise is a prevalent variability phenomenon in nanoelectronic devices, arising from stochastic carrier exchange at defect sites and critically impacting device reliability and performance. Conventional analysis techniques often rely on restrictive assumptions or manual interventions, limiting their applicability to complex, noisy datasets. Here, we introduce RTNinja, a generalized, fully automated machine learning framework for the unsupervised analysis of random telegraph noise signals. RTNinja deconvolves complex signals to identify the number and characteristics of hidden individual sources, without requiring prior knowledge of the system. The framework comprises two modular components: LevelsExtractor, which uses Bayesian inference and model selection to denoise and discretize the signal; and SourcesMapper, which infers source configurations through probabilistic clustering and optimization. To evaluate performance, we developed a Monte Carlo simulator that generates labeled datasets spanning broad signal-to-noise ratios and source complexities; across 7000 such datasets, RTNinja consistently demonstrated high-fidelity signal reconstruction and accurate extraction of source amplitudes and activity patterns. Our results demonstrate that RTNinja offers a robust, scalable, and device-agnostic tool for random telegraph noise characterization, enabling large-scale statistical benchmarking, reliability-centric technology qualification, predictive failure modeling, and device physics exploration in next-generation nanoelectronics.
[367] Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control
Thomas T. Zhang, Daniel Pfrommer, Chaoyi Pan, Nikolai Matni, Max Simchowitz
Main category: cs.LG
TL;DR: Theoretical analysis shows action-chunking and exploratory augmentation in imitation learning circumvent exponential compounding errors through control-theoretic stability mechanisms.
Details
Motivation: To understand why action-chunking and exploratory data collection are effective interventions in imitation learning, despite known issues with exponential compounding errors in continuous control settings.Method: Theoretical analysis using control-theoretic stability framework, validated through experiments on robot learning benchmarks, comparing with previous information-theoretic approaches.
Result: Action-chunking and exploratory augmentation avoid exponential compounding errors in different regimes, with control-theoretic stability identified as the key mechanism. Provides tighter statistical guarantees than information-theoretic methods alone.
Conclusion: Control-theoretic perspective offers superior insights into imitation learning error compounding and enables more effective interventions than purely information-theoretic approaches.
Abstract: This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of action-chunking (predicting sequences of actions in open-loop) and exploratory augmentation of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can suffer errors that compound exponentially with task horizon in continuous settings, we demonstrate that action chunking and exploratory data collection circumvent exponential compounding errors in different regimes. Our results identify control-theoretic stability as the key mechanism underlying the benefits of these interventions. On the empirical side, we validate our predictions and the role of control-theoretic stability through experimentation on popular robot learning benchmarks. On the theoretical side, we demonstrate that the control-theoretic lens provides fine-grained insights into how compounding error arises, leading to tighter statistical guarantees on imitation learning error when these interventions are applied than previous techniques based on information-theoretic considerations alone.
[368] Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations
Yujie Zhu, Charles A. Hepburn, Matthew Thorpe, Giovanni Montana
Main category: cs.LG
TL;DR: SPReD is a reinforcement learning framework that uses ensemble methods to determine when to imitate demonstrations vs follow own policy, applying continuous uncertainty-proportional regularization instead of binary decisions.
Details
Motivation: In sparse reward RL, demonstrations accelerate learning but determining when to imitate them is challenging. Current methods make binary imitation decisions which can be suboptimal.Method: Uses ensemble methods to model Q-value distributions for both demonstration and policy actions. Develops two uncertainty-aware approaches: probabilistic estimation of demonstration superiority likelihood, and advantage-based scaling by statistical significance.
Result: Achieves up to 14x performance gains in complex robotics tasks compared to existing methods, while maintaining robustness to demonstration quality and quantity.
Conclusion: SPReD’s continuous uncertainty-proportional regularization approach outperforms binary imitation decision methods and provides more stable training with reduced gradient variance.
Abstract: In reinforcement learning with sparse rewards, demonstrations can accelerate learning, but determining when to imitate them remains challenging. We propose Smooth Policy Regularisation from Demonstrations (SPReD), a framework that addresses the fundamental question: when should an agent imitate a demonstration versus follow its own policy? SPReD uses ensemble methods to explicitly model Q-value distributions for both demonstration and policy actions, quantifying uncertainty for comparisons. We develop two complementary uncertainty-aware methods: a probabilistic approach estimating the likelihood of demonstration superiority, and an advantage-based approach scaling imitation by statistical significance. Unlike prevailing methods (e.g. Q-filter) that make binary imitation decisions, SPReD applies continuous, uncertainty-proportional regularisation weights, reducing gradient variance during training. Despite its computational simplicity, SPReD achieves remarkable gains in experiments across eight robotics tasks, outperforming existing approaches by up to a factor of 14 in complex tasks while maintaining robustness to demonstration quality and quantity. Our code is available at https://github.com/YujieZhu7/SPReD.
[369] Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies
Qinglong Hu, Xialiang Tong, Mingxuan Yuan, Fei Liu, Zhichao Lu, Qingfu Zhang
Main category: cs.LG
TL;DR: MLES uses multimodal LLMs and evolutionary search to generate transparent, programmatic control policies that match PPO performance while providing interpretable logic and traceable design processes.
Details
Motivation: Deep RL policies are opaque neural networks that are difficult to understand, verify, and debug, undermining trust and hindering real-world deployment.Method: Multimodal Large Language Model-assisted Evolutionary Search (MLES) uses LLMs as policy generators combined with evolutionary search, integrating visual feedback-driven behavior analysis to identify failure patterns and guide improvements.
Result: MLES achieves performance comparable to Proximal Policy Optimization (PPO) across two standard control tasks while providing transparent control logic and traceable design processes.
Conclusion: MLES overcomes limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across tasks, showing promise as a new paradigm for developing transparent and verifiable control policies.
Abstract: Deep reinforcement learning has achieved impressive success in control tasks. However, its policies, represented as opaque neural networks, are often difficult for humans to understand, verify, and debug, which undermines trust and hinders real-world deployment. This work addresses this challenge by introducing a novel approach for programmatic control policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as programmatic policy generators, combining them with evolutionary search to automate policy generation. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and guide targeted improvements, thereby enhancing policy discovery efficiency and producing adaptable, human-aligned policies. Experimental results demonstrate that MLES achieves performance comparable to Proximal Policy Optimization (PPO) across two standard control tasks while providing transparent control logic and traceable design processes. This approach also overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various tasks, showing promise as a new paradigm for developing transparent and verifiable control policies.
[370] Khiops: An End-to-End, Frugal AutoML and XAI Machine Learning Solution for Large, Multi-Table Databases
Marc Boullé, Nicolas Voisine, Bruno Guerraz, Carine Hue, Felipe Olmos, Vladimir Popescu, Stéphane Gouache, Stéphane Bouget, Alexis Bondu, Luc Aurelien Gauthier, Yassine Nair Benrekia, Fabrice Clérot, Vincent Lemaire
Main category: cs.LG
TL;DR: Khiops is an open-source ML tool for mining large multi-table databases using Bayesian methods, featuring variable selection, classification, decision trees, and co-clustering capabilities.
Details
Motivation: To provide an efficient machine learning solution for analyzing large-scale multi-table databases with millions of records and thousands of variables, addressing the challenges of big data mining.Method: Uses a Bayesian approach with naive Bayesian classifier incorporating variable selection and weight learning. For numerical data, employs discretisation models; for categorical data, uses value clustering. Automatically constructs aggregates for multi-table databases.
Result: Successfully handles databases with millions of individuals, tens of thousands of variables, and hundreds of millions of records in secondary tables. Has generated academic interest with over 20 publications.
Conclusion: Khiops provides a robust, scalable machine learning tool for large multi-table database analysis, available as both Python library and user interface, demonstrating practical applicability in big data scenarios.
Abstract: Khiops is an open source machine learning tool designed for mining large multi-table databases. Khiops is based on a unique Bayesian approach that has attracted academic interest with more than 20 publications on topics such as variable selection, classification, decision trees and co-clustering. It provides a predictive measure of variable importance using discretisation models for numerical data and value clustering for categorical data. The proposed classification/regression model is a naive Bayesian classifier incorporating variable selection and weight learning. In the case of multi-table databases, it provides propositionalisation by automatically constructing aggregates. Khiops is adapted to the analysis of large databases with millions of individuals, tens of thousands of variables and hundreds of millions of records in secondary tables. It is available on many environments, both from a Python library and via a user interface.
[371] Deciphering Invariant Feature Decoupling in Source-free Time Series Forecasting with Proxy Denoising
Kangjia Yan, Chenxi Liu, Hao Miao, Xinle Wu, Yan Zhao, Chenjuan Guo, Bin Yang
Main category: cs.LG
TL;DR: TimePD is a source-free domain adaptation framework for time series forecasting that adapts pretrained models to target domains without accessing source data, using LLMs with proxy denoising.
Details
Motivation: Addresses the challenge of adapting time series forecasting models to new domains while complying with data protection regulations that restrict access to source data.Method: Uses dual-branch invariant disentangled feature learning with season-trend decomposition, lightweight proxy denoising to calibrate LLM biases, and bidirectional knowledge distillation.
Result: Outperforms state-of-the-art baselines by 9.3% on average across real-world datasets.
Conclusion: TimePD effectively enables source-free domain adaptation for time series forecasting while maintaining data privacy and leveraging LLM generalization capabilities.
Abstract: The proliferation of mobile devices generates a massive volume of time series across various domains, where effective time series forecasting enables a variety of real-world applications. This study focuses on a new problem of source-free domain adaptation for time series forecasting. It aims to adapt a pretrained model from sufficient source time series to the sparse target time series domain without access to the source data, embracing data protection regulations. To achieve this, we propose TimePD, the first source-free time series forecasting framework with proxy denoising, where large language models (LLMs) are employed to benefit from their generalization capabilities. Specifically, TimePD consists of three key components: (1) dual-branch invariant disentangled feature learning that enforces representation- and gradient-wise invariance by means of season-trend decomposition; (2) lightweight, parameter-free proxy denoising that dynamically calibrates systematic biases of LLMs; and (3) knowledge distillation that bidirectionally aligns the denoised prediction and the original target prediction. Extensive experiments on real-world datasets offer insight into the effectiveness of the proposed TimePD, outperforming SOTA baselines by 9.3% on average.
[372] Discrete Diffusion Models: Novel Analysis and New Sampler Guarantees
Yuchen Liang, Yingbin Liang, Lifeng Lai, Ness Shroff
Main category: cs.LG
TL;DR: Improved convergence analysis for discrete diffusion samplers with linear vocabulary dependence, removing restrictive assumptions and extending to multiple sampler types.
Details
Motivation: Existing theoretical analyses of τ-leaping samplers rely on restrictive assumptions and have quadratic dependence on vocabulary size, limiting their practical applicability.Method: Introduces a new analytical approach using differential inequalities instead of traditional Girsanov change-of-measure methods, providing more flexible analysis.
Result: Achieves convergence guarantees in KL divergence with linear scaling on vocabulary size for τ-leaping, and provides first convergence guarantees for Euler method and Tweedie τ-leaping.
Conclusion: The new differential inequality technique offers improved theoretical foundations for discrete diffusion samplers and may be useful for analyzing other stochastic processes.
Abstract: Discrete diffusion models have recently gained significant prominence in applications involving natural language and graph data. A key factor influencing their effectiveness is the efficiency of discretized samplers. Among these, $\tau$-leaping samplers have become particularly popular due to their theoretical and empirical success. However, existing theoretical analyses of $\tau$-leaping often rely on somewhat restrictive and difficult-to-verify regularity assumptions, and their convergence bounds contain quadratic dependence on the vocabulary size. In this work, we introduce a new analytical approach for discrete diffusion models that removes the need for such assumptions. For the standard $\tau$-leaping method, we establish convergence guarantees in KL divergence that scale linearly with vocabulary size, improving upon prior results with quadratic dependence. Our approach is also more broadly applicable: it provides the first convergence guarantees for other widely used samplers, including the Euler method and Tweedie $\tau$-leaping. Central to our approach is a novel technique based on differential inequalities, offering a more flexible alternative to the traditional Girsanov change-of-measure methods. This technique may also be of independent interest for the analysis of other stochastic processes.
[373] Eliciting Secret Knowledge from Language Models
Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, Samuel Marks
Main category: cs.LG
TL;DR: Study of secret elicitation - discovering hidden knowledge in LLMs that they possess but don’t explicitly state. Researchers trained LLMs with specific knowledge they apply but deny, then tested various black-box and white-box techniques to extract this hidden knowledge.
Details
Motivation: To understand and develop methods for discovering knowledge that AI systems possess but deliberately conceal or don't explicitly verbalize, addressing potential safety and alignment concerns.Method: Trained three families of LLMs with specific secret knowledge they apply downstream but deny when asked directly. Developed and tested black-box (prefill attacks) and white-box (logit lens, sparse autoencoders) secret elicitation techniques, evaluating their effectiveness in helping auditors guess the hidden knowledge.
Result: Prefill attacks (black-box) were most effective across all settings, where LLMs revealed secret knowledge when generating completions from predefined prefixes. White-box techniques (logit lens, SAEs) also consistently improved auditor success rates but were less effective than black-box methods.
Conclusion: Secret elicitation is possible through various techniques, with prefill attacks being most effective. The study establishes a public benchmark for evaluating secret elicitation methods and highlights the importance of detecting hidden knowledge in AI systems.
Abstract: We study secret elicitation: discovering knowledge that an AI possesses but does not explicitly verbalize. As a testbed, we train three families of large language models (LLMs) to possess specific knowledge that they apply downstream but deny knowing when asked directly. For example, in one setting, we train an LLM to generate replies that are consistent with knowing the user is female, while denying this knowledge when asked directly. We then design various black-box and white-box secret elicitation techniques and evaluate them based on whether they can help an LLM auditor successfully guess the secret knowledge. Many of our techniques improve on simple baselines. Our most effective techniques (performing best in all settings) are based on prefill attacks, a black-box technique where the LLM reveals secret knowledge when generating a completion from a predefined prefix. Our white-box techniques based on logit lens and sparse autoencoders (SAEs) also consistently increase the success rate of the LLM auditor, but are less effective. We release our models and code, establishing a public benchmark for evaluating secret elicitation methods.
[374] SVTime: Small Time Series Forecasting Models Informed by “Physics” of Large Vision Model Forecasters
ChengAo Shen, Ziming Zhao, Hanghang Tong, Dongjin Song, Dongsheng Luo, Qingsong Wen, Jingchao Ni
Main category: cs.LG
TL;DR: SVTime is a lightweight time series forecasting model that achieves large-model-like performance with 1000x fewer parameters, addressing sustainability concerns of large AI models while maintaining competitive accuracy.
Details
Motivation: Large pre-trained models have high carbon footprint and resource demands, making them impractical for resource-constrained users. There's a need for cost-effective lightweight models that can match large model performance on core tasks like forecasting.Method: Identified key inductive biases from large Vision model (LVM) forecasters and designed small models that encode these biases through carefully crafted linear layers and constraint functions.
Result: SVTime outperforms SOTA lightweight models and rivals large models across 21 baselines on 8 benchmark datasets, while using 10^3 fewer parameters than LVMs and enabling efficient training/inference in low-resource settings.
Conclusion: It’s possible to build cost-effective lightweight models with large-model-like performance for time series forecasting, offering a sustainable alternative to energy-intensive large models.
Abstract: Time series AI is crucial for analyzing dynamic web content, driving a surge of pre-trained large models known for their strong knowledge encoding and transfer capabilities across diverse tasks. However, given their energy-intensive training, inference, and hardware demands, using large models as a one-fits-all solution raises serious concerns about carbon footprint and sustainability. For a specific task, a compact yet specialized, high-performing model may be more practical and affordable, especially for resource-constrained users such as small businesses. This motivates the question: Can we build cost-effective lightweight models with large-model-like performance on core tasks such as forecasting? This paper addresses this question by introducing SVTime, a novel Small model inspired by large Vision model (LVM) forecasters for long-term Time series forecasting (LTSF). Recently, LVMs have been shown as powerful tools for LTSF. We identify a set of key inductive biases of LVM forecasters – analogous to the “physics” governing their behaviors in LTSF – and design small models that encode these biases through meticulously crafted linear layers and constraint functions. Across 21 baselines spanning lightweight, complex, and pre-trained large models on 8 benchmark datasets, SVTime outperforms state-of-the-art (SOTA) lightweight models and rivals large models with 10^3 fewer parameters than LVMs, while enabling efficient training and inference in low-resource settings.
[375] PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning
Wanjia Zhao, Qinwei Ma, Jingzhe Shi, Shirley Wu, Jiaqi Han, Yijia Xiao, Si-Yuan Chen, Xiao Luo, Ludwig Schmidt, James Zou
Main category: cs.LG
TL;DR: PRISM-Physics is a process-level evaluation framework for physics reasoning that uses directed acyclic graphs (DAGs) to represent solution steps with causal dependencies, enabling fine-grained and interpretable scoring without relying on heuristic LLM judgments.
Details
Motivation: Existing physics benchmarks only evaluate final answers, failing to capture reasoning processes, while recent stepwise methods use unreliable heuristic scoring or restrictive linear assumptions, limiting diagnostic validity.Method: Solutions are represented as DAGs of formulas encoding causal dependencies, combined with a rule-based method for symbolic formula equivalence matching to ensure consistent validation without heuristic judgments.
Result: The evaluation framework is more aligned with human expert scoring, and experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics while providing diagnostic insight.
Conclusion: PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities through structural rigor, theoretical guarantees, and symbolic validation.
Abstract: Benchmarks for competition-style reasoning have advanced evaluation in mathematics and programming, yet physics remains comparatively explored. Most existing physics benchmarks evaluate only final answers, which fail to capture reasoning processes, while recent stepwise methods rely on heuristic LLM-as-judge scoring or restrictive linear assumptions, limiting reliability and diagnostic validity. We introduce PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts’ scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.
[376] Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion
Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, Yuki Mitsufuji
Main category: cs.LG
TL;DR: The paper analyzes MaskGIT samplers for masked diffusion models, reveals their implicit temperature sampling mechanism, proposes a more tractable “moment sampler” alternative, and introduces efficiency improvements through partial caching and hybrid adaptive unmasking.
Details
Motivation: Masked diffusion models show promising performance but their sampling process acceleration remains underexplored, motivating theoretical analysis and practical improvements for more efficient masked diffusion samplers.Method: Theoretical analysis of MaskGIT sampler revealing implicit temperature sampling; introduction of “moment sampler” as asymptotically equivalent alternative; partial caching for transformers to approximate longer trajectories; hybrid approach for exploration-exploitation trade-off in adaptive unmasking.
Result: Experiments in image and text domains demonstrate the theoretical findings and show efficiency improvements of the proposed methods over existing approaches.
Conclusion: The work advances both theoretical understanding and practical implementation of masked diffusion samplers, providing more efficient and interpretable alternatives to existing methods.
Abstract: Masked diffusion models have shown promising performance in generating high-quality samples in a wide range of domains, but accelerating their sampling process remains relatively underexplored. To investigate efficient samplers for masked diffusion, this paper theoretically analyzes the MaskGIT sampler for image modeling, revealing its implicit temperature sampling mechanism. Through this analysis, we introduce the “moment sampler,” an asymptotically equivalent but more tractable and interpretable alternative to MaskGIT, which employs a “choose-then-sample” approach by selecting unmasking positions before sampling tokens. In addition, we improve the efficiency of choose-then-sample algorithms through two key innovations: a partial caching technique for transformers that approximates longer sampling trajectories without proportional computational cost, and a hybrid approach formalizing the exploration-exploitation trade-off in adaptive unmasking. Experiments in image and text domains demonstrate our theory as well as the efficiency of our proposed methods, advancing both theoretical understanding and practical implementation of masked diffusion samplers.
[377] DrivAerStar: An Industrial-Grade CFD Dataset for Vehicle Aerodynamic Optimization
Jiyan Qiu, Lyulin Kuang, Guan Wang, Yichen Xu, Leiyao Cui, Shaotong Fu, Yixin Zhu, Ruihua Zhang
Main category: cs.LG
TL;DR: DrivAerStar is a dataset of 12,000 industrial-grade automotive CFD simulations that bridges academic ML research and industrial CFD practice, achieving wind tunnel validation accuracy below 1.04% - a five-fold improvement over existing datasets.
Details
Motivation: Vehicle aerodynamics optimization is critical for electric vehicle range and efficiency, but traditional approaches face trade-offs between computational expense (weeks per CFD simulation) and accuracy, while existing ML datasets have inadequate resolution and validation errors exceeding 5%.Method: Generated 12,000 industrial-grade CFD simulations using STAR-CCM+ software, systematically exploring three vehicle configurations through 20 CAD parameters via Free Form Deformation algorithms, including complete engine compartments and cooling systems with realistic internal airflow, using refined mesh strategies with strict wall y+ control.
Result: Achieved wind tunnel validation accuracy below 1.04% - a five-fold improvement over existing datasets. Benchmarks show models trained on this data achieve production-ready accuracy while reducing computational costs from weeks to minutes.
Conclusion: DrivAerStar establishes a new standard for data-driven aerodynamic optimization in automotive development and demonstrates a paradigm for integrating high-fidelity physics simulations with AI across engineering disciplines where computational constraints limit innovation.
Abstract: Vehicle aerodynamics optimization has become critical for automotive electrification, where drag reduction directly determines electric vehicle range and energy efficiency. Traditional approaches face an intractable trade-off: computationally expensive Computational Fluid Dynamics (CFD) simulations requiring weeks per design iteration, or simplified models that sacrifice production-grade accuracy. While machine learning offers transformative potential, existing datasets exhibit fundamental limitations – inadequate mesh resolution, missing vehicle components, and validation errors exceeding 5% – preventing deployment in industrial workflows. We present DrivAerStar, comprising 12,000 industrial-grade automotive CFD simulations generated using STAR-CCM+${}^\unicode{xAE}$ software. The dataset systematically explores three vehicle configurations through 20 Computer Aided Design (CAD) parameters via Free Form Deformation (FFD) algorithms, including complete engine compartments and cooling systems with realistic internal airflow. DrivAerStar achieves wind tunnel validation accuracy below 1.04% – a five-fold improvement over existing datasets – through refined mesh strategies with strict wall $y^+$ control. Benchmarks demonstrate that models trained on this data achieve production-ready accuracy while reducing computational costs from weeks to minutes. This represents the first dataset bridging academic machine learning research and industrial CFD practice, establishing a new standard for data-driven aerodynamic optimization in automotive development. Beyond automotive applications, DrivAerStar demonstrates a paradigm for integrating high-fidelity physics simulations with Artificial Intelligence (AI) across engineering disciplines where computational constraints currently limit innovation.
[378] When In Doubt, Abstain: The Impact of Abstention on Strategic Classification
Lina Alkarmi, Ziyuan Huang, Mingyan Liu
Main category: cs.LG
TL;DR: This paper studies how classifier abstention (declining decisions when confidence is low) affects strategic classification, showing it improves accuracy and deters manipulation by making it costlier for agents to game the system.
Details
Motivation: Algorithmic decision making is vulnerable to strategic manipulation, and prior research showed abstention improves classifier accuracy. This paper explores how abstention impacts strategic agents' behavior and how principals should optimally use it.Method: Model the interaction as a Stackelberg game where a principal (classifier) announces its decision policy first, then strategic agents manipulate their observable features to receive desired outcomes. Focus on binary classifiers with feature manipulation.
Result: Optimal abstention ensures principal’s utility is no worse than in non-abstention settings, even with strategic agents. Abstention improves accuracy and serves as a manipulation deterrent, making it costlier especially for less qualified agents to achieve positive outcomes.
Conclusion: Abstention is a valuable tool for reducing negative effects of strategic behavior in algorithmic decision making systems, providing both accuracy improvements and manipulation deterrence.
Abstract: Algorithmic decision making is increasingly prevalent, but often vulnerable to strategic manipulation by agents seeking a favorable outcome. Prior research has shown that classifier abstention (allowing a classifier to decline making a decision due to insufficient confidence) can significantly increase classifier accuracy. This paper studies abstention within a strategic classification context, exploring how its introduction impacts strategic agents’ responses and how principals should optimally leverage it. We model this interaction as a Stackelberg game where a principal, acting as the classifier, first announces its decision policy, and then strategic agents, acting as followers, manipulate their features to receive a desired outcome. Here, we focus on binary classifiers where agents manipulate observable features rather than their true features, and show that optimal abstention ensures that the principal’s utility (or loss) is no worse than in a non-abstention setting, even in the presence of strategic agents. We also show that beyond improving accuracy, abstention can also serve as a deterrent to manipulation, making it costlier for agents, especially those less qualified, to manipulate to achieve a positive outcome when manipulation costs are significant enough to affect agent behavior. These results highlight abstention as a valuable tool for reducing the negative effects of strategic behavior in algorithmic decision making systems.
[379] I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models
Giacomo Camposampiero, Michael Hersche, Roger Wattenhofer, Abu Sebastian, Abbas Rahimi
Main category: cs.LG
TL;DR: I-RAVEN-X is a symbolic benchmark that extends I-RAVEN to test generalization and robustness in analogical/mathematical reasoning for LLMs and LRMs, featuring increased complexity, wider attribute ranges, and perceptual uncertainty.
Details
Motivation: To evaluate how well Large Language Models and Large Reasoning Models handle generalization and robustness in analogical and mathematical reasoning tasks, particularly under more complex and uncertain conditions.Method: Extends I-RAVEN benchmark by increasing operand complexity, expanding attribute ranges, and introducing perceptual uncertainty to create more challenging reasoning scenarios.
Result: LRMs outperform LLMs in productivity on longer reasoning relations and systematicity on wider attribute ranges, but both struggle significantly with reasoning under uncertainty and exploring multiple probabilistic outcomes.
Conclusion: While LRMs show improvements over LLMs in certain aspects, current models still face substantial challenges in handling uncertainty and probabilistic reasoning, indicating limitations in their reasoning capabilities.
Abstract: We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.
[380] DARTS-GT: Differentiable Architecture Search for Graph Transformers with Quantifiable Instance-Specific Interpretability Analysis
Shruti Sarika Chakraborty, Peter Minary
Main category: cs.LG
TL;DR: DARTS-GT redesigns Graph Transformers with asymmetric attention and differentiable architecture search to enable depth-specific GNN operator selection, achieving SOTA performance while providing quantitative interpretability through causal ablation metrics.
Details
Motivation: Current Graph Transformers have rigid designs with fixed GNN types across all layers and lack quantifiable interpretability, making it hard to distinguish meaningful patterns from spurious correlations.Method: Redesign GT attention through asymmetry (queries from node features, keys/values from GNN transformations) and use Differentiable Architecture Search (DARTS) to select optimal GNN operators at each layer.
Result: DARTS-GT achieves state-of-the-art on four datasets and remains competitive on others across eight benchmarks, with discovered architectures revealing dataset-specific patterns and consistently producing more interpretable models than baselines.
Conclusion: Graph Transformers need not choose between performance and interpretability - heterogeneous architectures found by DARTS-GT deliver both SOTA performance and improved interpretability, with causal analysis showing visual attention salience doesn’t always correlate with actual importance.
Abstract: Graph Transformers (GTs) have emerged as powerful architectures for graph-structured data, yet remain constrained by rigid designs and lack quantifiable interpretability. Current state-of-the-art GTs commit to fixed GNN types across all layers, missing potential benefits of depth-specific component selection, while their complex architectures become opaque where performance gains cannot be distinguished between meaningful patterns and spurious correlations. We redesign GT attention through asymmetry, decoupling structural encoding from feature representation: queries derive from node features while keys and values come from GNN transformations. Within this framework, we use Differentiable ARchiTecture Search (DARTS) to select optimal GNN operators at each layer, enabling depth-wise heterogeneity inside transformer attention itself (DARTS-GT). To understand discovered architectures, we develop the first quantitative interpretability framework for GTs through causal ablation. Our metrics (Head-deviation, Specialization, and Focus), identify which heads and nodes drive predictions while enabling model comparison. Experiments across eight benchmarks show DARTS-GT achieves state-of-the-art on four datasets while remaining competitive on others, with discovered architectures revealing dataset-specific patterns. Our interpretability analysis reveals that visual attention salience and causal importance do not always correlate, indicating widely used visualization approaches may miss components that actually matter. Crucially, heterogeneous architectures found by DARTS-GT consistently produced more interpretable models than baselines, establishing that Graph Transformers need not choose between performance and interpretability.
[381] SmartMixed: A Two-Phase Training Strategy for Adaptive Activation Function Learning in Neural Networks
Amin Omidvar
Main category: cs.LG
TL;DR: SmartMixed is a two-phase training strategy that enables neural networks to learn optimal per-neuron activation functions from a pool of candidates while maintaining computational efficiency at inference.
Details
Motivation: Most neural networks use fixed, uniform activation functions across all neurons, which may not be optimal. The paper aims to allow networks to adaptively learn the best activation functions for each neuron.Method: Two-phase training: Phase 1 uses differentiable hard-mixture mechanism for neurons to select from candidate activation functions (ReLU, Sigmoid, Tanh, Leaky ReLU, ELU, SELU). Phase 2 fixes each neuron’s activation function based on learned selection for efficient inference.
Result: Evaluation on MNIST dataset shows neurons in different layers exhibit distinct preferences for activation functions, revealing functional diversity within neural architectures.
Conclusion: SmartMixed successfully enables networks to learn optimal per-neuron activation functions while preserving computational efficiency, providing insights into activation function diversity across network layers.
Abstract: The choice of activation function plays a critical role in neural networks, yet most architectures still rely on fixed, uniform activation functions across all neurons. We introduce SmartMixed, a two-phase training strategy that allows networks to learn optimal per-neuron activation functions while preserving computational efficiency at inference. In the first phase, neurons adaptively select from a pool of candidate activation functions (ReLU, Sigmoid, Tanh, Leaky ReLU, ELU, SELU) using a differentiable hard-mixture mechanism. In the second phase, each neuron’s activation function is fixed according to the learned selection, resulting in a computationally efficient network that supports continued training with optimized vectorized operations. We evaluate SmartMixed on the MNIST dataset using feedforward neural networks of varying depths. The analysis shows that neurons in different layers exhibit distinct preferences for activation functions, providing insights into the functional diversity within neural architectures.
[382] A tutorial on discovering and quantifying the effect of latent causal sources of multimodal EHR data
Marco Barbero-Mota, Eric V. Strobl, John M. Still, William W. Stead, Thomas A. Lasko
Main category: cs.LG
TL;DR: A causal machine learning pipeline for discovering latent causal sources from EHR data and quantifying their effects on clinical outcomes.
Details
Motivation: To address the challenge of analyzing imperfect multimodal clinical data and discovering causal relationships at scale in electronic health records.Method: Process imperfect multimodal clinical data, decompose into probabilistic independent latent sources, and train task-specific causal models to estimate individual causal effects.
Result: Successfully applied in two real-world applications, demonstrating versatility and utility for medical discovery at scale.
Conclusion: The pipeline provides a generalizable approach for causal discovery and effect estimation from large-scale EHR data, showing promise for medical research applications.
Abstract: We provide an accessible description of a peer-reviewed generalizable causal machine learning pipeline to (i) discover latent causal sources of large-scale electronic health records observations, and (ii) quantify the source causal effects on clinical outcomes. We illustrate how imperfect multimodal clinical data can be processed, decomposed into probabilistic independent latent sources, and used to train taskspecific causal models from which individual causal effects can be estimated. We summarize the findings of the two real-world applications of the approach to date as a demonstration of its versatility and utility for medical discovery at scale.
[383] TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting
Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, Frank Hutter
Main category: cs.LG
TL;DR: TempoPFN is a univariate time series foundation model using linear RNNs pre-trained on synthetic data, achieving competitive zero-shot forecasting performance while being more efficient than existing approaches.
Details
Motivation: Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on benchmarks.Method: Uses GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, pre-trained exclusively on synthetic data from a comprehensive pipeline including stochastic differential equations, Gaussian processes, and audio synthesis.
Result: Achieves top-tier competitive performance on Gift-Eval benchmark, outperforming all existing synthetic-only approaches and surpassing most models trained on real-world data, while being more efficient.
Conclusion: Provides a reproducible foundation for future research with open-sourced data generation pipeline and training code, demonstrating that synthetic-only pre-training can achieve competitive zero-shot forecasting performance.
Abstract: Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval benchmark, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the vast majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.
[384] Accelerating Data Generation for Nonlinear temporal PDEs via homologous perturbation in solution space
Lei Liu, Zhenxin Huang, Hong Wang, huanshuo dong, Haiyang Xin, Hongwei Zhao, Bin Li
Main category: cs.LG
TL;DR: HOPSS is a novel data generation algorithm that creates training datasets for neural operators with fewer time steps, reducing computational overhead while maintaining comparable precision to traditional methods.
Details
Motivation: Traditional data generation for neural operators requires thousands of time steps via numerical methods, creating heavy computational and temporal overheads that far exceed training requirements.Method: The method involves: 1) obtaining base solution functions from reliable solvers with many time steps, 2) downsampling to align with training datasets, 3) applying “homologous perturbation” by combining two solution functions with random noise, and 4) computing RHS variations to form new solution pairs.
Result: HOPSS significantly reduces time complexity - generating 10,000 samples for Navier-Stokes equation in approximately 10% of traditional methods’ time while achieving comparable model training performance.
Conclusion: HOPSS provides an efficient alternative to traditional data generation methods for neural operators, accelerating dataset generation while preserving the precision needed for effective model training.
Abstract: Data-driven deep learning methods like neural operators have advanced in solving nonlinear temporal partial differential equations (PDEs). However, these methods require large quantities of solution pairs\u2014the solution functions and right-hand sides (RHS) of the equations. These pairs are typically generated via traditional numerical methods, which need thousands of time steps iterations far more than the dozens required for training, creating heavy computational and temporal overheads. To address these challenges, we propose a novel data generation algorithm, called HOmologous Perturbation in Solution Space (HOPSS), which directly generates training datasets with fewer time steps rather than following the traditional approach of generating large time steps datasets. This algorithm simultaneously accelerates dataset generation and preserves the approximate precision required for model training. Specifically, we first obtain a set of base solution functions from a reliable solver, usually with thousands of time steps, and then align them in time steps with training datasets by downsampling. Subsequently, we propose a “homologous perturbation” approach: by combining two solution functions (one as the primary function, the other as a homologous perturbation term scaled by a small scalar) with random noise, we efficiently generate comparable-precision PDE data points. Finally, using these data points, we compute the variation in the original equation’s RHS to form new solution pairs. Theoretical and experimental results show HOPSS lowers time complexity. For example, on the Navier-Stokes equation, it generates 10,000 samples in approximately 10% of traditional methods’ time, with comparable model training performance.
[385] On Uncertainty Calibration for Equivariant Functions
Edward Berman, Jacob Ginesin, Marco Pacini, Robin Walters
Main category: cs.LG
TL;DR: This paper presents a theoretical framework linking equivariance to uncertainty estimation, proving bounds on calibration errors for equivariant models and showing how symmetry mismatch causes miscalibration.
Details
Motivation: To understand the relationship between equivariance and model calibration, which hasn't been studied before, particularly in data-sparse domains where both equivariant networks and uncertainty estimation are important.Method: Developed theoretical bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions, complemented by numerical experiments on real and simulated datasets.
Result: Proved lower and upper bounds on calibration errors for equivariant models, showing how symmetry mismatch leads to miscalibration in both classification and regression tasks.
Conclusion: Established a theoretical connection between equivariance and uncertainty estimation, revealing generalization limits of equivariant models and the impact of symmetry mismatch on model calibration.
Abstract: Data-sparse settings such as robotic manipulation, molecular physics, and galaxy morphology classification are some of the hardest domains for deep learning. For these problems, equivariant networks can help improve modeling across undersampled parts of the input space, and uncertainty estimation can guard against overconfidence. However, until now, the relationships between equivariance and model confidence, and more generally equivariance and model calibration, has yet to be studied. Since traditional classification and regression error terms show up in the definitions of calibration error, it is natural to suspect that previous work can be used to help understand the relationship between equivariance and calibration error. In this work, we present a theory relating equivariance to uncertainty estimation. By proving lower and upper bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions, we elucidate the generalization limits of equivariant models and illustrate how symmetry mismatch can result in miscalibration in both classification and regression. We complement our theoretical framework with numerical experiments that clarify the relationship between equivariance and uncertainty using a variety of real and simulated datasets, and we comment on trends with symmetry mismatch, group size, and aleatoric and epistemic uncertainties.
[386] Aeolus: A Multi-structural Flight Delay Dataset
Lin Xu, Xinyun Yuan, Yuxuan Liang, Suwan Yin, Yuankai Wu
Main category: cs.LG
TL;DR: Aeolus is a large-scale multi-modal flight delay dataset with three aligned modalities: tabular data, flight chains, and flight network graphs, designed to advance flight delay prediction and foundation models for tabular data.
Details
Motivation: Existing flight delay datasets are limited to flat tabular structures and fail to capture spatiotemporal dynamics of delay propagation, creating a gap for comprehensive delay modeling.Method: The dataset provides three modalities: (1) tabular data with operational, meteorological, and airport features for 50M+ flights; (2) flight chains modeling delay propagation along sequential legs; (3) flight network graphs encoding shared aircraft, crew, and airport connections.
Result: Aeolus supports regression, classification, temporal structure modeling, and graph learning tasks, serving as a unified benchmark across tabular, sequential, and graph modalities with temporal splits and leakage prevention.
Conclusion: Aeolus fills a key gap for both domain-specific flight delay modeling and general-purpose structured data research, providing comprehensive tools and baselines for reproducible evaluation.
Abstract: We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed to advance research on flight delay prediction and support the development of foundation models for tabular data. Existing datasets in this domain are typically limited to flat tabular structures and fail to capture the spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this limitation by providing three aligned modalities: (i) a tabular dataset with rich operational, meteorological, and airportlevel features for over 50 million flights; (ii) a flight chain module that models delay propagation along sequential flight legs, capturing upstream and downstream dependencies; and (iii) a flight network graph that encodes shared aircraft, crew, and airport resource connections, enabling cross-flight relational reasoning. The dataset is carefully constructed with temporal splits, comprehensive features, and strict leakage prevention to support realistic and reproducible machine learning evaluation. Aeolus supports a broad range of tasks, including regression, classification, temporal structure modeling, and graph learning, serving as a unified benchmark across tabular, sequential, and graph modalities. We release baseline experiments and preprocessing tools to facilitate adoption. Aeolus fills a key gap for both domain-specific modeling and general-purpose structured data research.Our source code and data can be accessed at https://github.com/Flnny/Delay-data
[387] Generating Auxiliary Tasks with Reinforcement Learning
Judah Goldfeder, Matthew So, Hod Lipson
Main category: cs.LG
TL;DR: RL-AUX uses reinforcement learning to automatically generate auxiliary tasks and optimize their weights, outperforming human-designed tasks and matching bi-level optimization methods without the computational overhead.
Details
Motivation: Human-labeled auxiliary tasks are costly and require domain expertise, while existing meta-learning approaches suffer from computational complexity due to bi-level optimization.Method: Proposed RL-AUX framework that uses reinforcement learning to dynamically create auxiliary tasks by assigning labels to training examples and learning per-example weights for auxiliary loss.
Result: On CIFAR-100 grouped into 20 superclasses, RL-AUX outperformed human-labeled auxiliary tasks and matched the performance of a prominent bi-level optimization baseline, with similar strong results on other classification datasets.
Conclusion: Reinforcement learning is a viable approach for generating effective auxiliary tasks without the computational overhead of traditional meta-learning methods.
Abstract: Auxiliary Learning (AL) is a form of multi-task learning in which a model trains on auxiliary tasks to boost performance on a primary objective. While AL has improved generalization across domains such as navigation, image classification, and NLP, it often depends on human-labeled auxiliary tasks that are costly to design and require domain expertise. Meta-learning approaches mitigate this by learning to generate auxiliary tasks, but typically rely on gradient based bi-level optimization, adding substantial computational and implementation overhead. We propose RL-AUX, a reinforcement-learning (RL) framework that dynamically creates auxiliary tasks by assigning auxiliary labels to each training example, rewarding the agent whenever its selections improve the performance on the primary task. We also explore learning per-example weights for the auxiliary loss. On CIFAR-100 grouped into 20 superclasses, our RL method outperforms human-labeled auxiliary tasks and matches the performance of a prominent bi-level optimization baseline. We present similarly strong results on other classification datasets. These results suggest RL is a viable path to generating effective auxiliary tasks.
[388] On the limitation of evaluating machine unlearning using only a single training seed
Jamie Lanyon, Axel Finke, Petros Andreou, Georgina Cosma
Main category: cs.LG
TL;DR: The paper shows that evaluating machine unlearning methods by running them multiple times from the same trained model can produce non-representative results due to high sensitivity to training seed variability.
Details
Motivation: Current empirical comparisons of machine unlearning algorithms often run multiple trials from the same trained model, but this approach may not capture the full variability in performance across different model initializations.Method: The authors demonstrate through analysis that machine unlearning methods can be highly sensitive to the random seed used during model training, even for the same architecture and dataset.
Result: The study reveals that evaluating unlearning performance only from a single trained model can give misleading results, as different training seeds can lead to significantly different unlearning outcomes.
Conclusion: Empirical comparisons of machine unlearning algorithms should account for variability across different model training seeds to ensure representative performance assessment.
Abstract: Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because – even for the same architecture and same dataset – some MU methods can be highly sensitive to the choice of random number seed used for model training. We therefore recommend that empirical comparisons of MU algorithms should also reflect the variability across different model training seeds.
[389] Hankel Singular Value Regularization for Highly Compressible State Space Models
Paul Schwerdtner, Jules Berman, Benjamin Peherstorfer
Main category: cs.LG
TL;DR: Regularizing Hankel singular values in state space models makes them more compressible while maintaining accuracy, achieving up to 10x compression on long-range sequence tasks.
Details
Motivation: Deep neural networks using state space models are effective for long-range sequence tasks but are challenging to compress after training, limiting their practical deployment.Method: Proposed Hankel singular value regularization to encourage fast decay of singular values, making models compressible. Developed an efficient algorithm to compute Hankel singular values during training by exploiting block-diagonal structure of system matrices.
Result: Experiments on Long Range Arena benchmarks show regularized state space layers are up to 10x more compressible than standard state space layers while maintaining high accuracy.
Conclusion: Hankel singular value regularization enables highly compressible state space models without sacrificing performance, making them more practical for deployment.
Abstract: Deep neural networks using state space models as layers are well suited for long-range sequence tasks but can be challenging to compress after training. We use that regularizing the sum of Hankel singular values of state space models leads to a fast decay of these singular values and thus to compressible models. To make the proposed Hankel singular value regularization scalable, we develop an algorithm to efficiently compute the Hankel singular values during training iterations by exploiting the specific block-diagonal structure of the system matrices that we use in our state space model parametrization. Experiments on Long Range Arena benchmarks demonstrate that the regularized state space layers are up to 10$\times$ more compressible than standard state space layers while maintaining high accuracy.
[390] Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off
Muhammad Faraz Ul Abrar, Nicolò Michelusi
Main category: cs.LG
TL;DR: OTA-FL with SGD for non-convex objectives under wireless heterogeneity, allowing structured bias to optimize bias-variance trade-off via SCA-based power control.
Details
Motivation: Existing OTA-FL designs enforce zero-bias updates under homogeneous wireless conditions, which are suboptimal under heterogeneous scenarios and constrained by weakest devices. Prior analyses mainly address convex objectives while modern AI models are non-convex.Method: Develop OTA-FL SGD updates with structured time-invariant model bias and reduced variance. Formulate non-convex joint OTA power-control design and solve using successive convex approximation (SCA) algorithm requiring only statistical CSI.
Result: Derived finite-time stationarity bound revealing bias-variance trade-off. Experiments on non-convex image classification show SCA-based design accelerates convergence via optimized bias and improves generalization over prior baselines.
Conclusion: Structured bias in OTA-FL under wireless heterogeneity enables optimized bias-variance trade-off, leading to faster convergence and better generalization for non-convex objectives compared to zero-bias approaches.
Abstract: Over-the-air (OTA) federated learning (FL) has been well recognized as a scalable paradigm that exploits the waveform superposition of the wireless multiple-access channel to aggregate model updates in a single use. Existing OTA-FL designs largely enforce zero-bias model updates by either assuming \emph{homogeneous} wireless conditions (equal path loss across devices) or forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous} wireless scenarios, however, such designs are constrained by the weakest device and inflate the update variance. Moreover, prior analyses of biased OTA-FL largely address convex objectives, while most modern AI models are highly non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient descent (SGD) for general smooth non-convex objectives under wireless heterogeneity. We develop novel OTA-FL SGD updates that allow a structured, time-invariant model bias while facilitating reduced variance updates. We derive a finite-time stationarity bound (expected time average squared gradient norm) that explicitly reveals a bias-variance trade-off. To optimize this trade-off, we pose a non-convex joint OTA power-control design and develop an efficient successive convex approximation (SCA) algorithm that requires only statistical CSI at the base station. Experiments on a non-convex image classification task validate the approach: the SCA-based design accelerates convergence via an optimized bias and improves generalization over prior OTA-FL baselines.
[391] Towards a Generalizable AI for Materials Discovery: Validation through Immersion Coolant Screening
Hyunseung Kim, Dae-Woong Jeong, Changyoung Park, Won-Ji Lee, Ha-Eun Lee, Ji-Hye Lee, Rodrigo Hormazabal, Sung Moon Ko, Sumin Lee, Soorin Yim, Chanhui Lee, Sehui Han, Sang-Ho Cha, Woohyung Lim
Main category: cs.LG
TL;DR: GATE is a generalizable AI framework that learns 34 physicochemical properties simultaneously, enabling multi-property screening without retraining for new tasks.
Details
Motivation: Most AI models for materials discovery are problem-specific and require additional data collection and retraining for each new property, limiting their practical utility.Method: GATE jointly learns multiple properties by aligning them in a shared geometric space to capture cross-property correlations and reduce disjoint-property bias.
Result: GATE identified 92,861 promising molecules for immersion cooling fluids, with four experimentally validated showing strong agreement with measurements and performance comparable to commercial coolants.
Conclusion: GATE establishes itself as a generalizable AI platform that can be readily applied across diverse materials discovery tasks without problem-specific reconfiguration.
Abstract: Artificial intelligence (AI) has emerged as a powerful accelerator of materials discovery, yet most existing models remain problem-specific, requiring additional data collection and retraining for each new property. Here we introduce and validate GATE (Geometrically Aligned Transfer Encoder) – a generalizable AI framework that jointly learns 34 physicochemical properties spanning thermal, electrical, mechanical, and optical domains. By aligning these properties within a shared geometric space, GATE captures cross-property correlations that reduce disjoint-property bias – a key factor causing false positives in multi-criteria screening. To demonstrate its generalizable utility, GATE – without any problem-specific model reconfiguration – applied to the discovery of immersion cooling fluids for data centers, a stringent real-world challenge defined by the Open Compute Project (OCP). Screening billions of candidates, GATE identified 92,861 molecules as promising for practical deployment. Four were experimentally or literarily validated, showing strong agreement with wet-lab measurements and performance comparable to or exceeding a commercial coolant. These results establish GATE as a generalizable AI platform readily applicable across diverse materials discovery tasks.
[392] Faithful and Fast Influence Function via Advanced Sampling
Jungyeon Koh, Hyeonsu Lyu, Jonggyu Jang, Hyun Jong Yang
Main category: cs.LG
TL;DR: Proposes two advanced sampling techniques (feature-based and logit-based) to improve influence function estimation by selecting representative subsets, reducing computation time by 30.1% and memory usage by 42.2% while maintaining accuracy.
Details
Motivation: Influence functions require computing Hessians which is resource-intensive for entire datasets. Random sampling leads to inconsistent estimates due to high variance in sample configurations.Method: Two advanced sampling techniques based on features and logits that select small but representative subsets by considering stochastic distribution of features or logits.
Result: Reduces computation time by 30.1% and memory usage by 42.2%, or improves F1-score by 2.5% compared to baseline in class removal experiments.
Conclusion: The proposed sampling methods provide more accurate and efficient influence function estimations by selecting representative subsets, addressing the limitations of random sampling.
Abstract: How can we explain the influence of training data on black-box models? Influence functions (IFs) offer a post-hoc solution by utilizing gradients and Hessians. However, computing the Hessian for an entire dataset is resource-intensive, necessitating a feasible alternative. A common approach involves randomly sampling a small subset of the training data, but this method often results in highly inconsistent IF estimates due to the high variance in sample configurations. To address this, we propose two advanced sampling techniques based on features and logits. These samplers select a small yet representative subset of the entire dataset by considering the stochastic distribution of features or logits, thereby enhancing the accuracy of IF estimations. We validate our approach through class removal experiments, a typical application of IFs, using the F1-score to measure how effectively the model forgets the removed class while maintaining inference consistency on the remaining classes. Our method reduces computation time by 30.1% and memory usage by 42.2%, or improves the F1-score by 2.5% compared to the baseline.
[393] A data free neural operator enabling fast inference of 2D and 3D Navier Stokes equations
Junho Choi, Teng-Yuan Chang, Namjung Kim, Youngjoon Hong
Main category: cs.LG
TL;DR: A data-free neural operator for Navier-Stokes equations that eliminates need for training data, enabling real-time ensemble forecasting with high accuracy in 2D and 3D flows.
Details
Motivation: Traditional ensemble simulations of high-dimensional flow models are computationally expensive for real-time applications, while existing neural operators require costly data and struggle with 3D generalization.Method: Physics-grounded neural operator architecture that takes initial/boundary conditions and forcing functions as inputs, eliminating need for paired solution data during training.
Result: Surpasses prior neural operators in accuracy across 2D benchmarks and 3D test cases, achieves greater efficiency than conventional solvers for ensembles, and successfully solves 3D Navier-Stokes equations - a first for data-free neural operators.
Conclusion: This approach establishes a practical pathway toward data-free, high-fidelity PDE surrogates for end-to-end scientific simulation and prediction by combining numerically grounded architecture with ML scalability.
Abstract: Ensemble simulations of high-dimensional flow models (e.g., Navier Stokes type PDEs) are computationally prohibitive for real time applications. Neural operators enable fast inference but are limited by costly data requirements and poor generalization to 3D flows. We present a data-free operator network for the Navier Stokes equations that eliminates the need for paired solution data and enables robust, real time inference for large ensemble forecasting. The physics-grounded architecture takes initial and boundary conditions as well as forcing functions, yielding solutions robust to high variability and perturbations. Across 2D benchmarks and 3D test cases, the method surpasses prior neural operators in accuracy and, for ensembles, achieves greater efficiency than conventional numerical solvers. Notably, it delivers accurate solutions of the three dimensional Navier Stokes equations, a regime not previously demonstrated for data free neural operators. By uniting a numerically grounded architecture with the scalability of machine learning, this approach establishes a practical pathway toward data free, high fidelity PDE surrogates for end to end scientific simulation and prediction.
[394] CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices
Xuchen Feng, Siyu Liao
Main category: cs.LG
TL;DR: Introduces CDFlow, a novel invertible linear layer using circulant-diagonal matrix products that reduces parameter complexity from O(n²) to O(mn) and enables efficient O(mn log n) inversion and O(mn) log-determinant computation via FFT.
Details
Motivation: To design efficient linear layers for normalizing flows that maintain expressiveness while enabling fast computation of Jacobian determinants and inverses, overcoming computational bottlenecks in existing approaches.Method: Proposes a linear layer based on product of circulant and diagonal matrices, leveraging Fast Fourier Transform for efficient matrix inversion and log-determinant computation. Builds CDFlow framework using this layer.
Result: Achieves strong density estimation on natural image datasets, effectively models periodic data, and significantly accelerates key flow operations with practical scalability benefits.
Conclusion: The circulant-diagonal decomposition provides an effective trade-off between expressiveness and computational efficiency, enabling scalable normalizing flows with practical performance improvements.
Abstract: Normalizing flows are deep generative models that enable efficient likelihood estimation and sampling through invertible transformations. A key challenge is to design linear layers that enhance expressiveness while maintaining efficient computation of the Jacobian determinant and inverse. We introduce a novel invertible linear layer based on the product of circulant and diagonal matrices. This decomposition reduces parameter complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(mn)$ using $m$ diagonal matrices and $m-1$ circulant matrices while still approximating general linear transformations. By leveraging the Fast Fourier Transform, our approach reduces the time complexity of matrix inversion from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn\log n)$ and that of computing the log-determinant from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn)$, where $n$ is the input dimension. We build upon this layer to develop Circulant-Diagonal Flow (CDFlow), which achieves strong density estimation on natural image datasets and effectively models data with inherent periodic structure. Furthermore, CDFlow significantly accelerates key operations in normalizing flows, providing practical benefits for scalable generative modeling.
[395] Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training
Hong Wang, Haiyang Xin, Jie Wang, Xuanze Yang, Fei Zha, Huanshuo Dong, Yan Jiang
Main category: cs.LG
TL;DR: Proposes MoE-POT, a sparse Mixture-of-Experts architecture for PDE solving that efficiently scales parameters while controlling inference costs, achieving 40% error reduction with fewer activated parameters.
Details
Motivation: Address data scarcity and performance limitations in neural operators for PDEs, while overcoming heterogeneity in PDE datasets and high inference costs of dense pre-training models.Method: Uses layer-wise router-gating network to dynamically select 4 routed experts from 16 expert networks plus 2 shared experts, with weighted average output from activated experts. Pre-trained on 6 PDE datasets with 30M to 0.5B parameters.
Result: Model with 90M activated parameters achieves up to 40% reduction in zero-shot error compared to existing models with 120M parameters. Router decisions can infer dataset types, validating MoE effectiveness.
Conclusion: MoE-POT provides an efficient sparse architecture for PDE solving that scales parameters effectively while maintaining low inference costs and achieving superior performance through expert specialization.
Abstract: Pre-training has proven effective in addressing data scarcity and performance limitations in solving PDE problems with neural operators. However, challenges remain due to the heterogeneity of PDE datasets in equation types, which leads to high errors in mixed training. Additionally, dense pre-training models that scale parameters by increasing network width or depth incur significant inference costs. To tackle these challenges, we propose a novel Mixture-of-Experts Pre-training Operator Transformer (MoE-POT), a sparse-activated architecture that scales parameters efficiently while controlling inference costs. Specifically, our model adopts a layer-wise router-gating network to dynamically select 4 routed experts from 16 expert networks during inference, enabling the model to focus on equation-specific features. Meanwhile, we also integrate 2 shared experts, aiming to capture common properties of PDE and reduce redundancy among routed experts. The final output is computed as the weighted average of the results from all activated experts. We pre-train models with parameters from 30M to 0.5B on 6 public PDE datasets. Our model with 90M activated parameters achieves up to a 40% reduction in zero-shot error compared with existing models with 120M activated parameters. Additionally, we conduct interpretability analysis, showing that dataset types can be inferred from router-gating network decisions, which validates the rationality and effectiveness of the MoE architecture.
[396] Offline Clustering of Preference Learning with Active-data Augmentation
Jingyuan Liu, Fatemeh Ghaffari, Xuchuang Wang, Xutong Liu, Mohammad Hajiesmaili, Carlee Joe-Wong
Main category: cs.LG
TL;DR: This paper addresses offline preference learning with multiple users having different preferences, proposing clustering-based methods to handle data imbalance and improve utility for test users through both pure offline and active-data augmentation approaches.
Details
Motivation: Real-world preference learning faces challenges with limited user interactions and diverse user preferences, where offline data is often imbalanced across different preference dimensions and users.Method: Proposes Off-C²PL for pure offline setting using clustering to aggregate data from multiple users, and A²-Off-C²PL for active-data augmentation that selectively collects samples targeting least-informative preference dimensions.
Result: Theoretical analysis shows suboptimality bounds capturing tradeoff between sample noise and bias, and proves that actively collected samples are more effective than offline ones. Simulations on synthetic and real-world datasets validate the theoretical results.
Conclusion: The proposed clustering-based framework effectively handles offline preference learning with multiple users and imbalanced data, with active-data augmentation further improving performance by targeting information-poor dimensions.
Abstract: Preference learning from pairwise feedback is a widely adopted framework in applications such as reinforcement learning with human feedback and recommendations. In many practical settings, however, user interactions are limited or costly, making offline preference learning necessary. Moreover, real-world preference learning often involves users with different preferences. For example, annotators from different backgrounds may rank the same responses differently. This setting presents two central challenges: (1) identifying similarity across users to effectively aggregate data, especially under scenarios where offline data is imbalanced across dimensions, and (2) handling the imbalanced offline data where some preference dimensions are underrepresented. To address these challenges, we study the Offline Clustering of Preference Learning problem, where the learner has access to fixed datasets from multiple users with potentially different preferences and aims to maximize utility for a test user. To tackle the first challenge, we first propose Off-C$^2$PL for the pure offline setting, where the learner relies solely on offline data. Our theoretical analysis provides a suboptimality bound that explicitly captures the tradeoff between sample noise and bias. To address the second challenge of inbalanced data, we extend our framework to the setting with active-data augmentation where the learner is allowed to select a limited number of additional active-data for the test user based on the cluster structure learned by Off-C$^2$PL. In this setting, our second algorithm, A$^2$-Off-C$^2$PL, actively selects samples that target the least-informative dimensions of the test user’s preference. We prove that these actively collected samples contribute more effectively than offline ones. Finally, we validate our theoretical results through simulations on synthetic and real-world datasets.
[397] An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning
Chuyan Chen, Chenyang Ma, Zhangxin Li, Yutong He, Yanjie Dong, Kun Yuan
Main category: cs.LG
TL;DR: ARC-Top-K is a novel gradient compressor that combines the benefits of Top-K and Rand-K methods by enabling All-Reduce operations while preserving globally significant gradient information through lightweight sketches.
Details
Motivation: Existing gradient sparsification methods face limitations: Rand-K loses structural information and performs poorly, while Top-K loses contraction property and requires costly All-Gather operations. Communication remains a bottleneck in distributed ML.Method: ARC-Top-K uses lightweight sketches of gradients to align sparsity patterns across nodes, enabling index-free All-Reduce while preserving globally significant information. It’s combined with momentum error feedback (EF21M).
Result: ARC-Top-K is provably contractive and achieves linear speedup with sharper convergence rates than original EF21M. Empirically, it matches Top-K accuracy while reducing wall-clock training time by up to 60.7%.
Conclusion: ARC-Top-K offers an efficient and scalable solution that combines Rand-K’s robustness with Top-K’s strong performance, addressing communication bottlenecks in distributed ML.
Abstract: Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$ discards structural information and performs poorly in practice, while Top-$K$ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$ matches the accuracy of Top-$K$ while reducing wall-clock training time by up to 60.7%, offering an efficient and scalable solution that combines the robustness of Rand-$K$ with the strong performance of Top-$K$.
cs.MA
[398] FinPos: A Position-Aware Trading Agent System for Real Financial Markets
Bijia Liu, Ronghao Dang
Main category: cs.MA
TL;DR: FinPos is a position-aware trading agent system that uses dual decision agents and multi-timescale rewards to manage continuous positions in financial markets, outperforming state-of-the-art methods.
Details
Motivation: Current trading agents focus on single-step trading tasks and lack awareness of continuous position management, failing to simulate realistic market conditions.Method: Developed FinPos trading agent system with dual decision agents to interpret market information professionally and employ multi-timescale rewards to balance short-term fluctuations against long-term trends.
Result: Extensive experiments show FinPos surpasses state-of-the-art trading agents in position-aware trading tasks that closely mirror real market conditions.
Conclusion: LLM-centered agent systems have significant unexplored potential in long-term market decision-making, particularly for continuous position management.
Abstract: The exceptional potential of large language models (LLMs) in handling text information has garnered significant attention in the field of financial trading. However, current trading agents primarily focus on single-step trading tasks and lack awareness of continuous position management. Therefore, we propose a position-aware trading task designed to simulate a more realistic market. To address this task, we develop a trading agent system, FinPos, optimized for position management. FinPos is able to interpret various types of market information from a professional perspective, providing a reliable basis for positioning decisions. To mitigate the substantial market risks arising from position fluctuations, FinPos employs dual decision agents. Furthermore, the continuous nature of position management necessitates our adoption of multi-timescale rewards, which in turn empowers FinPos to effectively balance short-term fluctuations against long-term trends. Extensive experiments demonstrate that FinPos surpasses state-of-the-art trading agents in the position-aware trading task, which closely mirrors real market conditions. More importantly, our findings reveal that LLM-centered agent systems exhibit a vast, largely unexplored potential in long-term market decision-making.
[399] PartnerMAS: An LLM Hierarchical Multi-Agent Framework for Business Partner Selection on High-Dimensional Features
Lingyao Li, Haolun Wu, Zhenkun Li, Jiabei Hu, Yu Wang, Xiaoshan Huang, Wenyue Hua, Wenqian Wang
Main category: cs.MA
TL;DR: PartnerMAS is a hierarchical multi-agent framework that improves high-dimensional decision-making by decomposing evaluation into planner, specialized, and supervisor agents, achieving 10-15% higher match rates than single-agent or debate-based systems.
Details
Motivation: High-dimensional decision-making tasks like business partner selection involve large candidate pools with diverse features, and existing single-agent or debate-style systems struggle with scalability and consistency in such settings.Method: A hierarchical multi-agent framework with three layers: Planner Agent designs strategies, Specialized Agents perform role-specific assessments, and Supervisor Agent integrates their outputs. Evaluated on a curated benchmark dataset of venture capital co-investments.
Result: Across 140 cases, PartnerMAS consistently outperforms single-agent and debate-based multi-agent baselines, achieving up to 10-15% higher match rates. Analysis shows planners respond to domain-informed prompts, specialists provide complementary feature coverage, and supervisors enable effective aggregation.
Conclusion: Structured collaboration among LLM agents generates more robust outcomes than scaling individual models, making PartnerMAS a promising framework for high-dimensional decision-making in data-rich domains.
Abstract: High-dimensional decision-making tasks, such as business partner selection, involve evaluating large candidate pools with heterogeneous numerical, categorical, and textual features. While large language models (LLMs) offer strong in-context reasoning capabilities, single-agent or debate-style systems often struggle with scalability and consistency in such settings. We propose PartnerMAS, a hierarchical multi-agent framework that decomposes evaluation into three layers: a Planner Agent that designs strategies, Specialized Agents that perform role-specific assessments, and a Supervisor Agent that integrates their outputs. To support systematic evaluation, we also introduce a curated benchmark dataset of venture capital co-investments, featuring diverse firm attributes and ground-truth syndicates. Across 140 cases, PartnerMAS consistently outperforms single-agent and debate-based multi-agent baselines, achieving up to 10–15% higher match rates. Analysis of agent reasoning shows that planners are most responsive to domain-informed prompts, specialists produce complementary feature coverage, and supervisors play an important role in aggregation. Our findings demonstrate that structured collaboration among LLM agents can generate more robust outcomes than scaling individual models, highlighting PartnerMAS as a promising framework for high-dimensional decision-making in data-rich domains.
cs.MM
[400] Mano Technical Report
Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang
Main category: cs.MM
TL;DR: Mano is a robust GUI agent that uses a multi-modal foundation model with a three-stage training pipeline and achieves state-of-the-art performance on GUI benchmarks.
Details
Motivation: Automating GUI interactions is challenging due to visual complexity, dynamic environments, and multi-step reasoning needs. Existing VLM-based methods suffer from limited resolution, domain mismatch, and insufficient sequential decision-making.Method: Built on a multi-modal foundation model pre-trained on web/system data. Features simulated environment for high-fidelity data generation, three-stage training (supervised fine-tuning, offline RL, online RL), and verification module for error recovery.
Result: Achieves state-of-the-art performance on Mind2Web and OSWorld benchmarks with significant improvements in success rate and operational accuracy.
Conclusion: Provides insights into integrating RL with VLMs for GUI agents, emphasizing domain-specific data, iterative training, and holistic reward design for practical deployment.
Abstract: Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
eess.AS
[401] See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement
Jinting Wang, Jun Wang, Hei Victor Cheng, Li Liu
Main category: eess.AS
TL;DR: A novel speech-to-talking face method that directly generates high-quality talking face videos from speech alone, without needing source images or reference videos.
Details
Motivation: To overcome limitations of existing methods that require source images as appearance references and use source speech for motion generation, enabling direct speech-to-face synthesis.Method: Two-stage approach: 1) Speech-to-face portrait generation using speech-conditioned diffusion with facial prior and adaptive weighting, 2) Talking face generation by embedding expressive dynamics in diffusion latent space with region-enhancement for lip sync, plus Transformer-based codebook for high-resolution output.
Result: Outperforms existing methods on HDTF, VoxCeleb, and AVSpeech datasets, generating high-resolution, high-quality talking face videos from single speech input.
Conclusion: First method capable of generating high-resolution talking face videos exclusively from speech input, demonstrating superior performance over existing approaches.
Abstract: Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.
[402] Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition
Amine Razig, Youssef Soulaymani, Loubna Benabbou, Pierre Cauchy
Main category: eess.AS
TL;DR: A multi-step attention-guided framework called Mask-Guided Classification (MGC) improves marine mammal monitoring by segmenting spectrograms to create soft masks of biological energy, then fusing these masks with raw inputs for robust classification across diverse environmental conditions.
Details
Motivation: Automated marine mammal monitoring faces extreme challenges including wide frequency range calls (low-frequency moans to ultrasonic clicks), overlapping calls, and variable anthropogenic/environmental noise in the St. Lawrence Estuary.Method: Multi-step framework that segments spectrograms to generate soft masks of biologically relevant energy, then fuses these masks with raw inputs via mid-level fusion for multi-band, denoised classification using image and mask embeddings.
Result: MGC improves signal discrimination, reduces false positives, maintains stable performance under distributional shifts (OOD), and consistently outperforms baseline architectures with substantial accuracy gains on both in-distribution and OOD data.
Conclusion: MGC learns transferable representations rather than overfitting to specific transformations, demonstrating robustness and suitability for large-scale, real-world biodiversity monitoring across diverse environmental conditions.
Abstract: Automated monitoring of marine mammals in the St. Lawrence Estuary faces extreme challenges: calls span low-frequency moans to ultrasonic clicks, often overlap, and are embedded in variable anthropogenic and environmental noise. We introduce a multi-step, attention-guided framework that first segments spectrograms to generate soft masks of biologically relevant energy and then fuses these masks with the raw inputs for multi-band, denoised classification. Image and mask embeddings are integrated via mid-level fusion, enabling the model to focus on salient spectrogram regions while preserving global context. Using real-world recordings from the Saguenay St. Lawrence Marine Park Research Station in Canada, we demonstrate that segmentation-driven attention and mid-level fusion improve signal discrimination, reduce false positive detections, and produce reliable representations for operational marine mammal monitoring across diverse environmental conditions and signal-to-noise ratios. Beyond in-distribution evaluation, we further assess the generalization of Mask-Guided Classification (MGC) under distributional shifts by testing on spectrograms generated with alternative acoustic transformations. While high-capacity baseline models lose accuracy in this Out-of-distribution (OOD) setting, MGC maintains stable performance, with even simple fusion mechanisms (gated, concat) achieving comparable results across distributions. This robustness highlights the capacity of MGC to learn transferable representations rather than overfitting to a specific transformation, thereby reinforcing its suitability for large-scale, real-world biodiversity monitoring. We show that in all experimental settings, the MGC framework consistently outperforms baseline architectures, yielding substantial gains in accuracy on both in-distribution and OOD data.
[403] Beamforming in the Reproducing Kernel Domain Based on Spatial Differentiation
Takahiro Iwami, Naohisa Inoue, Akira Omoto
Main category: eess.AS
TL;DR: A novel beamforming framework using reproducing kernel domain and spatial differentiation of sound fields, enabling arbitrary beam patterns including non-axisymmetric ones.
Details
Motivation: To provide a unified interpretation of directional response as spatial differentiation and enable formulation of arbitrary beam patterns beyond conventional approaches.Method: Represent directional response using polynomial differential operators, derive reproducing kernel using Hobson’s theorem, and generalize spherical harmonic domain beamformers as spatial differential operators.
Result: The framework enables concise analytical expressions for beam patterns and generalizes conventional beamformers, with numerical simulations in 2D space confirming validity.
Conclusion: The proposed method successfully establishes a unified beamforming framework that clarifies theoretical structure and provides extensibility for arbitrary beam pattern design.
Abstract: This paper proposes a novel beamforming framework in the reproducing kernel domain, derived from a unified interpretation of directional response as spatial differentiation of the sound field. By representing directional response using polynomial differential operators, the proposed method enables the formulation of arbitrary beam patterns including non-axisymmetric. The derivation of the reproducing kernel associated with the interior fields is mathematically supported by Hobson’s theorem, which allows concise analytical expressions. Furthermore, the proposed framework generalizes conventional spherical harmonic domain beamformers by reinterpreting them as spatial differential operators, thereby clarifying their theoretical structure and extensibility. Three numerical simulations conducted in two-dimensional space confirm the validity of the method.
[404] Reference Microphone Selection for Guided Source Separation based on the Normalized L-p Norm
Anselm Lohmann, Tomohiro Nakatani, Rintaro Ikeshita, Marc Delcroix, Shoko Araki, Simon Doclo
Main category: eess.AS
TL;DR: Proposes normalized ℓp-norm based reference microphone selection methods for Guided Source Separation (GSS) to improve distant ASR performance by considering both SNR and early-to-late-reverberant ratio differences.
Details
Motivation: Current GSS-based speech enhancement uses SNR-based reference microphone selection, which is optimal for noise reduction but may neglect differences in early-to-late-reverberant ratio (ELR) across spatially distributed microphones.Method: Two reference microphone selection methods: 1) using only normalized ℓp-norm, 2) combining normalized ℓp-norm and SNR to account for both SNR and ELR differences across microphones.
Result: Experimental evaluation on CHiME-8 distant ASR system shows the proposed ℓp-norm-based methods outperform baseline SNR method, reducing macro-average word error rate.
Conclusion: Normalized ℓp-norm based reference microphone selection methods improve GSS performance by better accounting for both noise and reverberation characteristics across distributed microphones.
Abstract: Guided Source Separation (GSS) is a popular front-end for distant automatic speech recognition (ASR) systems using spatially distributed microphones. When considering spatially distributed microphones, the choice of reference microphone may have a large influence on the quality of the output signal and the downstream ASR performance. In GSS-based speech enhancement, reference microphone selection is typically performed using the signal-to-noise ratio (SNR), which is optimal for noise reduction but may neglect differences in early-to-late-reverberant ratio (ELR) across microphones. In this paper, we propose two reference microphone selection methods for GSS-based speech enhancement that are based on the normalized $\ell_p$-norm, either using only the normalized $\ell_p$-norm or combining the normalized $\ell_p$-norm and the SNR to account for both differences in SNR and ELR across microphones. Experimental evaluation using a CHiME-8 distant ASR system shows that the proposed $\ell_p$-norm-based methods outperform the baseline method, reducing the macro-average word error rate.
eess.IV
[405] UP2D: Uncertainty-aware Progressive Pseudo-label Denoising for Source-Free Domain Adaptive Medical Image Segmentation
Quang-Khai Bui-Tran, Thanh-Huy Nguyen, Manh D. Ho, Thinh B. Lam, Vi Vu, Hoang-Thien Nguyen, Phat Huynh, Ulas Bagci
Main category: eess.IV
TL;DR: UP2D is a source-free domain adaptation framework for medical image segmentation that addresses noisy pseudo-labels and class imbalance through uncertainty-aware denoising, selective teacher updates, and entropy minimization.
Details
Motivation: Medical image segmentation models suffer performance degradation under domain shifts, especially when source data is unavailable due to privacy constraints, requiring effective adaptation without access to source images.Method: Single-stage student-teacher framework with three components: Refined Prototype Filtering to denoise pseudo-labels, Uncertainty-Guided EMA for selective teacher updates based on boundary uncertainty, and quantile-based entropy minimization focusing on ambiguous regions.
Result: Achieves state-of-the-art performance on three retinal fundus benchmarks, outperforming prior UDA and SFDA approaches with superior boundary precision in both standard and open-domain settings.
Conclusion: UP2D effectively addresses domain shift challenges in medical image segmentation without source data access through progressive pseudo-label denoising and uncertainty-aware adaptation strategies.
Abstract: Medical image segmentation models face severe performance drops under domain shifts, especially when data sharing constraints prevent access to source images. We present a novel Uncertainty-aware Progressive Pseudo-label Denoising (UP2D) framework for source-free domain adaptation (SFDA), designed to mitigate noisy pseudo-labels and class imbalance during adaptation. UP2D integrates three key components: (i) a Refined Prototype Filtering module that suppresses uninformative regions and constructs reliable class prototypes to denoise pseudo-labels, (ii) an Uncertainty-Guided EMA (UG-EMA) strategy that selectively updates the teacher model based on spatially weighted boundary uncertainty, and (iii) a quantile-based entropy minimization scheme that focuses learning on ambiguous regions while avoiding overconfidence on easy pixels. This single-stage student-teacher framework progressively improves pseudo-label quality and reduces confirmation bias. Extensive experiments on three challenging retinal fundus benchmarks demonstrate that UP2D achieves state-of-the-art performance across both standard and open-domain settings, outperforming prior UDA and SFDA approaches while maintaining superior boundary precision.
[406] R3GAN-based Optimal Strategy for Augmenting Small Medical Dataset
Tsung-Wei Pan, Chang-Hong Wu, Jung-Hua Wang, Ming-Jer Chen, Yu-Chiao Yi, Tsung-Hsien Lee
Main category: eess.IV
TL;DR: Optimized R3GAN training strategies for small medical imaging datasets can generate realistic images to address data scarcity and class imbalance, significantly improving classification performance in embryo time-lapse imaging.
Details
Motivation: Medical image analysis often suffers from data scarcity and class imbalance, limiting deep learning model effectiveness in clinical applications like human embryo time-lapse imaging.Method: Systematic experiments with R3GAN to establish effective training strategies for 256x256-resolution datasets, featuring full burn-in phase and low, gradually increasing gamma range (5 -> 40). Generated samples used to balance imbalanced embryo dataset.
Result: Substantial improvement in classification performance - recall and F1-score of t3 increased from 0.06 to 0.69 and 0.11 to 0.60 respectively, without compromising other classes.
Conclusion: Tailored R3GAN training strategies can effectively alleviate data scarcity and improve model robustness in small-scale medical imaging tasks.
Abstract: Medical image analysis often suffers from data scarcity and class imbalance, limiting the effectiveness of deep learning models in clinical applications. Using human embryo time-lapse imaging (TLI) as a case study, this work investigates how generative adversarial networks (GANs) can be optimized for small datasets to generate realistic and diagnostically meaningful images. Based on systematic experiments with R3GAN, we established effective training strategies and designed an optimized configuration for 256x256-resolution datasets, featuring a full burn-in phase and a low, gradually increasing gamma range (5 -> 40). The generated samples were used to balance an imbalanced embryo dataset, leading to substantial improvement in classification performance. The recall and F1-score of t3 increased from 0.06 to 0.69 and 0.11 to 0.60, respectively, without compromising other classes. These results demonstrate that tailored R3GAN training strategies can effectively alleviate data scarcity and improve model robustness in small-scale medical imaging tasks.
[407] Diffusion-Driven Generation of Minimally Preprocessed Brain MRI
Samuel W. Remedios, Aaron Carass, Jerry L. Prince, Blake E. Dewey
Main category: eess.IV
TL;DR: This study presents three DDPMs for generating 3D T1-weighted MRI brain images using 80,675 volumes from 42,406 subjects across 38 datasets, achieving coherent brain volume generation with velocity and flow prediction models performing better than sample prediction.
Details
Motivation: To develop and release the first 3D non-latent diffusion model for brain MRI data that preserves natural orientation variations and inhomogeneity without skullstripping or registration.Method: Trained three DDPMs on minimally preprocessed brain MRI images (no registration or bias field correction) using 80,675 volumes from 42,406 subjects across 38 datasets, evaluated via segmentation, FID, and qualitative inspection.
Result: All DDPMs generated coherent MR brain volumes, with velocity and flow prediction models achieving lower FIDs than sample prediction. However, all models had higher FIDs than real images, and generated brain regional volume distributions differed statistically from real data.
Conclusion: Despite statistical differences from real data, the presented DDPMs successfully generate high-resolution 3D T1-weighted brain images, representing the first 3D non-latent diffusion model for brain data without skullstripping or registration.
Abstract: The purpose of this study is to present and compare three denoising diffusion probabilistic models (DDPMs) that generate 3D $T_1$-weighted MRI human brain images. Three DDPMs were trained using 80,675 image volumes from 42,406 subjects spanning 38 publicly available brain MRI datasets. These images had approximately 1 mm isotropic resolution and were manually inspected by three human experts to exclude those with poor quality, field-of-view issues, and excessive pathology. The images were minimally preprocessed to preserve the visual variability of the data. Furthermore, to enable the DDPMs to produce images with natural orientation variations and inhomogeneity, the images were neither registered to a common coordinate system nor bias field corrected. Evaluations included segmentation, Frechet Inception Distance (FID), and qualitative inspection. Regarding results, all three DDPMs generated coherent MR brain volumes. The velocity and flow prediction models achieved lower FIDs than the sample prediction model. However, all three models had higher FIDs compared to real images across multiple cohorts. In a permutation experiment, the generated brain regional volume distributions differed statistically from real data. However, the velocity and flow prediction models had fewer statistically different volume distributions in the thalamus and putamen. In conclusion this work presents and releases the first 3D non-latent diffusion model for brain data without skullstripping or registration. Despite the negative results in statistical testing, the presented DDPMs are capable of generating high-resolution 3D $T_1$-weighted brain images. All model weights and corresponding inference code are publicly available at https://github.com/piksl-research/medforj .
[408] A fragile zero-watermarking method based on dual quaternion matrix decomposition
Mingcui Zhang, Zhigang Jia
Main category: eess.IV
TL;DR: Proposes a fragile zero-watermarking model using dual quaternion matrix decomposition for medical image protection, enabling copyright protection and tampering detection without modifying original images.
Details
Motivation: Medical images face serious risks of copyright infringement and content tampering during transmission and sharing, requiring effective protection methods that preserve image integrity.Method: Uses dual quaternion matrix decomposition to correlate original carrier images with watermark images through operational relationships between standard and dual parts of dual quaternions.
Result: Generates zero-watermarking information that enables copyright protection and content tampering detection for medical images without altering the original carrier.
Conclusion: The proposed fragile zero-watermarking model based on dual quaternion matrix decomposition provides an effective approach for protecting medical images against copyright issues and tampering risks.
Abstract: Medical images play a crucial role in assisting diagnosis, remote consultation, and academic research. However, during the transmission and sharing process, they face serious risks of copyright ownership and content tampering. Therefore, protecting medical images is of great importance. As an effective means of image copyright protection, zero-watermarking technology focuses on constructing watermarks without modifying the original carrier by extracting its stable features, which provides an ideal approach for protecting medical images. This paper aims to propose a fragile zero-watermarking model based on dual quaternion matrix decomposition, which utilizes the operational relationship between the standard part and the dual part of dual quaternions to correlate the original carrier image with the watermark image, and generates zero-watermarking information based on the characteristics of dual quaternion matrix decomposition, ultimately achieving copyright protection and content tampering detection for medical images.
[409] Towards robust quantitative photoacoustic tomography via learned iterative methods
Anssi Manninen, Janek Gröhl, Felix Lucka, Andreas Hauptmann
Main category: eess.IV
TL;DR: This paper proposes model-based learned iterative methods for Quantitative Photoacoustic Tomography (QPAT) that combine deep learning with physical models to achieve faster and more accurate reconstructions with limited training data.
Details
Motivation: Classical PAT reconstruction methods are computationally slow and require sufficient prior information, while deep learning methods need large training datasets which are often unavailable in practice.Method: Adopts model-based learned iterative approach for QPAT, comparing learned updates based on gradient descent, Gauss-Newton, and Quasi-Newton methods. Uses both greedy (iterate-wise optimality) and end-to-end training strategies.
Result: Methods tested with ideal simulated data and digital twin dataset that simulates scarce training data and high modeling error, showing improved generalizability with limited data.
Conclusion: Model-based learned iterative methods provide a promising approach for QPAT that balances computational efficiency with data efficiency, enabling better performance with scarce training data compared to traditional deep learning approaches.
Abstract: Photoacoustic tomography (PAT) is a medical imaging modality that can provide high-resolution tissue images based on the optical absorption. Classical reconstruction methods for quantifying the absorption coefficients rely on sufficient prior information to overcome noisy and imperfect measurements. As these methods utilize computationally expensive forward models, the computation becomes slow, limiting their potential for time-critical applications. As an alternative approach, deep learning-based reconstruction methods have been established for faster and more accurate reconstructions. However, most of these methods rely on having a large amount of training data, which is not the case in practice. In this work, we adopt the model-based learned iterative approach for the use in Quantitative PAT (QPAT), in which additional information from the model is iteratively provided to the updating networks, allowing better generalizability with scarce training data. We compare the performance of different learned updates based on gradient descent, Gauss-Newton, and Quasi-Newton methods. The learning tasks are formulated as greedy, requiring iterate-wise optimality, as well as end-to-end, where all networks are trained jointly. The implemented methods are tested with ideal simulated data as well as against a digital twin dataset that emulates scarce training data and high modeling error.
[410] Combined fluorescence and photoacoustic imaging of tozuleristide in muscle tissue in vitro – toward optically-guided solid tumor surgery: feasibility studies
Ruibo Shang, Matthew Thompson, Matthew D. Carson, Eric J. Seibel, Matthew O’Donnell, Ivan Pelivanov
Main category: eess.IV
TL;DR: Combining NIRF with fast-sweep PAUS imaging extends tumor detection depth from ~5mm to ~34mm, enabling comprehensive pre- and intra-operative guidance for wide-margin tumor excision.
Details
Motivation: Address limitations of NIRF imaging for wide-margin tumor excision due to sensitivity and resolution constraints at depths >5mm in tissue.Method: Used fast-sweep photoacoustic-ultrasound (PAUS) imaging to complement NIRF, tested with tozuleristide contrast agent in bovine muscle tissue, employing laser-fluence compensation and clutter suppression.
Result: PAUS identified tozuleristide at 20 uM concentration up to ~34mm depth, significantly extending NIRF capabilities. Spectroscopic PAUS at ~8mm depth showed improved accuracy and reduced artifacts.
Conclusion: Combined NIRF-PAUS approach is promising for comprehensive pre-operative (PA) and intra-operative (NIRF) solid tumor detection and wide-margin excision in optically guided surgery.
Abstract: Near-infrared fluorescence (NIRF) can deliver high-contrast, video-rate, non-contact imaging of tumor-targeted contrast agents with the potential to guide surgeries excising solid tumors. However, it has been met with skepticism for wide-margin excision due to sensitivity and resolution limitations at depths larger than ~5 mm in tissue. To address this limitation, fast-sweep photoacoustic-ultrasound (PAUS) imaging is proposed to complement NIRF. In an exploratory in vitro feasibility study using dark-red bovine muscle tissue, we observed that PAUS scanning can identify tozuleristide, a clinical stage investigational imaging agent, at a concentration of 20 uM from the background at depths of up to ~34 mm, highly extending the capabilities of NIRF alone. The capability of spectroscopic PAUS imaging was tested by direct injection of 20 uM tozuleristide into bovine muscle tissue at a depth of ~ 8 mm. It is shown that laser-fluence compensation and strong clutter suppression enabled by the unique capabilities of the fast-sweep approach greatly improve spectroscopic accuracy and the PA detection limit, and strongly reduce image artifacts. Thus, the combined NIRF-PAUS approach can be promising for comprehensive pre- (with PA) and intra- (with NIRF) operative solid tumor detection and wide-margin excision in optically guided solid tumor surgery.
[411] Navigated hepatic tumor resection using intraoperative ultrasound imaging
Karin Olthof, Theo Ruers, Tiziano Natali, Lisanne Venix, Jasper Smit, Anne den Hartor, Niels Kok, Matteo Fusaglia, Koert Kuhlmann
Main category: eess.IV
TL;DR: This study evaluates a registration-free ultrasound-based navigation system for liver surgery that uses intraoperative 3D models instead of preoperative imaging, achieving high accuracy and successful tumor resections.
Details
Motivation: To overcome limitations of conventional navigation systems that require registration to preoperative imaging, which can be complex and affected by organ deformation during surgery.Method: Used electromagnetic sensors to track organ motion, acquired intraoperative ultrasound volumes, automatically segmented vasculature, semi-automatically segmented tumors using region-growing or deep learning, and visualized 3D models with tracked surgical instruments.
Result: Navigation was successfully established in all 20 patients with median accuracy of 3.2 mm, and R0 resection was achieved in 93.8% of patients (15/16).
Conclusion: Intraoperative ultrasound-based navigation is feasible and accurate for liver surgery, offering a simpler registration-free approach for image guidance systems.
Abstract: Purpose: This proof-of-concept study evaluates feasibility and accuracy of an ultrasound-based navigation system for open liver surgery. Unlike most conventional systems that rely on registration to preoperative imaging, the proposed system provides navigation-guided resection using 3D models generated from intraoperative ultrasound. Methods: A pilot study was conducted in 25 patients undergoing resection of liver metastases. The first five cases served to optimize the workflow. Intraoperatively, an electromagnetic sensor compensated for organ motion, after which an ultrasound volume was acquired. Vasculature was segmented automatically and tumors semi-automatically using region-growing (n=15) or a deep learning algorithm (n=5). The resulting 3D model was visualized alongside tracked surgical instruments. Accuracy was assessed by comparing the distance between surgical clips and tumors in the navigation software with the same distance on a postoperative CT of the resected specimen. Results: Navigation was successfully established in all 20 patients. However, four cases were excluded from accuracy assessment due to intraoperative sensor detachment (n=3) or incorrect data recording (n=1). The complete navigation workflow was operational within 5-10 minutes. In 16 evaluable patients, 78 clip-to-tumor distances were analyzed. The median navigation accuracy was 3.2 mm [IQR: 2.8-4.8 mm], and an R0 resection was achieved in 15/16 (93.8%) patients and one patient had an R1 vascular resection. Conclusion: Navigation based solely on intra-operative ultrasound is feasible and accurate for liver surgery. This registration-free approach paves the way for simpler and more accurate image guidance systems.
[412] Bayesian model selection and misspecification testing in imaging inverse problems only from noisy and partial measurements
Tom Sprunck, Marcelo Pereyra, Tobias Liaudat
Main category: eess.IV
TL;DR: Proposes an unsupervised model evaluation method for Bayesian imaging using Bayesian cross-validation and data fission, enabling model selection and misspecification detection without ground truth.
Details
Motivation: Existing unsupervised model evaluation methods are unsuitable for computational imaging due to high computational cost and incompatibility with modern machine learning-based image priors.Method: Combines Bayesian cross-validation with data fission (randomized measurement splitting) to evaluate Bayesian imaging models without ground truth, compatible with modern samplers like diffusion and plug-and-play.
Result: Achieves excellent model selection and misspecification detection accuracy with low computational cost across various scoring rules and misspecification types.
Conclusion: Provides a general, computationally efficient methodology for unsupervised model evaluation in Bayesian imaging sciences that works with modern machine learning-based priors.
Abstract: Modern imaging techniques heavily rely on Bayesian statistical models to address difficult image reconstruction and restoration tasks. This paper addresses the objective evaluation of such models in settings where ground truth is unavailable, with a focus on model selection and misspecification diagnosis. Existing unsupervised model evaluation methods are often unsuitable for computational imaging due to their high computational cost and incompatibility with modern image priors defined implicitly via machine learning models. We herein propose a general methodology for unsupervised model selection and misspecification detection in Bayesian imaging sciences, based on a novel combination of Bayesian cross-validation and data fission, a randomized measurement splitting technique. The approach is compatible with any Bayesian imaging sampler, including diffusion and plug-and-play samplers. We demonstrate the methodology through experiments involving various scoring rules and types of model misspecification, where we achieve excellent selection and detection accuracy with a low computational cost.
[413] LV-UNet: A Lightweight and Vanilla Model for Medical Image Segmentation
Juntao Jiang, Mengmeng Wang, Huizhong Tian, Lingbo Cheng, Yong Liu
Main category: eess.IV
TL;DR: LV-UNet is a lightweight medical image segmentation model using MobileNetv3-Large backbone with fusible modules and re-parametrization for efficient deployment.
Details
Motivation: Address challenges in medical image segmentation including optimization complexity, transformer architecture intricacy, computational constraints, and practical deployment needs for mobile medical devices requiring lightweight, real-time models.Method: Uses pre-trained MobileNetv3-Large backbones with fusible modules, enhanced deep training strategy, and re-parametrization during inference to reduce parameters and computational overhead.
Result: Achieves better trade-off between performance and computational load across multiple datasets including ISIC 2016, BUSI, CVC-ClinicDB, CVC-ColonDB, and Kvair-SEG.
Conclusion: LV-UNet provides an effective lightweight solution for medical image segmentation with improved robustness and deployability for mobile medical applications.
Abstract: While large models have achieved significant progress in computer vision, challenges such as optimization complexity, the intricacy of transformer architectures, computational constraints, and practical application demands highlight the importance of simpler model designs in medical image segmentation. This need is particularly pronounced in mobile medical devices, which require lightweight, deployable models with real-time performance. However, existing lightweight models often suffer from poor robustness across datasets, limiting their widespread adoption. To address these challenges, this paper introduces LV-UNet, a lightweight and vanilla model that leverages pre-trained MobileNetv3-Large backbones and incorporates fusible modules. LV-UNet employs an enhanced deep training strategy and switches to a deployment mode during inference by re-parametrization, significantly reducing parameter count and computational overhead. Experimental results on ISIC 2016, BUSI, CVC-ClinicDB, CVC-ColonDB, and Kvair-SEG datasets demonstrate a better trade-off between performance and the computational load. The code will be released at https://github.com/juntaoJianggavin/LV-UNet.
[414] Augmented Reality-based Guidance with Deformable Registration in Head and Neck Tumor Resection
Qingyun Yang, Fangjie Li, Jiayi Xu, Zixuan Liu, Sindhura Sridhar, Whitney Jin, Jennifer Du, Jon Heiselman, Michael Miga, Michael Topf, Jie Ying Wu
Main category: eess.IV
TL;DR: A deformable registration framework improves margin relocation in HNSCC surgery by incorporating specimen thickness information, reducing target registration error by up to 33% and enabling AR-guided visualization.
Details
Motivation: HNSCC has high recurrence rates that could be reduced by better margin localization. Current frozen section analysis faces challenges in accurate margin relocation due to complex 3D anatomy and specimen shrinkage.Method: Proposed a deformable registration framework using pre-resection upper surface and post-resection site data to incorporate thickness information. Integrated with AR-based auto-alignment system for intraoperative visualization.
Result: Improved target registration error by up to 33% compared to prior methods. In pilot study with surgeons, reduced average target relocation error from 9.8 cm to 4.8 cm. Showed enhanced adaptability to thicker specimens, particularly in complex tongue anatomy.
Conclusion: The framework significantly improves margin relocation accuracy in HNSCC surgery, especially for complex tongue specimens. The AR integration enables automatic overlay of positive margin annotations onto resection sites, potentially reducing recurrence rates.
Abstract: Head and neck squamous cell carcinoma (HNSCC) has one of the highest rates of recurrence cases among solid malignancies. Recurrence rates can be reduced by improving positive margins localization. Frozen section analysis (FSA) of resected specimens is the gold standard for intraoperative margin assessment. However, because of the complex 3D anatomy and the significant shrinkage of resected specimens, accurate margin relocation from specimen back onto the resection site based on FSA results remains challenging. We propose a novel deformable registration framework that uses both the pre-resection upper surface and the post-resection site of the specimen to incorporate thickness information into the registration process. The proposed method significantly improves target registration error (TRE), demonstrating enhanced adaptability to thicker specimens. In tongue specimens, the proposed framework improved TRE by up to 33% as compared to prior deformable registration. Notably, tongue specimens exhibit complex 3D anatomies and hold the highest clinical significance compared to other head and neck specimens from the buccal and skin. We analyzed distinct deformation behaviors in different specimens, highlighting the need for tailored deformation strategies. To further aid intraoperative visualization, we also integrated this framework with an augmented reality-based auto-alignment system. The combined system can accurately and automatically overlay the deformed 3D specimen mesh with positive margin annotation onto the resection site. With a pilot study of the AR guided framework involving two surgeons, the integrated system improved the surgeons’ average target relocation error from 9.8 cm to 4.8 cm.
[415] Highly Undersampled MRI Reconstruction via a Single Posterior Sampling of Diffusion Models
Jin Liu, Qing Lin, Zhuang Xiong, Shanshan Shan, Chunyi Liu, Min Li, Feng Liu, G. Bruce Pike, Hongfu Sun, Yang Gao
Main category: eess.IV
TL;DR: SSDM-MRI is a single-step diffusion model that accelerates MRI reconstruction from highly undersampled k-space data, achieving comparable quality to iterative methods but with much faster inference time.
Details
Motivation: Previous deep learning MRI reconstruction methods degrade at high acceleration factors (8×+), and diffusion models show promise but suffer from long inference times due to iterative sampling steps.Method: Trains a conditional diffusion model and iteratively distills it four times using selective distillation algorithm with shortcut reverse sampling strategy for one-step inference.
Result: Outperforms other methods on fastMRI brain/knee datasets and QSM data in PSNR, SSIM, error maps, and fine details; reconstructs 320×320 brain slice in 0.45s.
Conclusion: SSDM-MRI provides high-quality MRI reconstruction comparable to iterative diffusion models but with dramatically reduced inference time, making it practical for clinical applications.
Abstract: Incoherent k-space undersampling and deep learning-based reconstruction methods have shown great success in accelerating MRI. However, the performance of most previous methods will degrade dramatically under high acceleration factors, e.g., 8$\times$ or higher. Recently, denoising diffusion models (DM) have demonstrated promising results in solving this issue; however, one major drawback of the DM methods is the long inference time due to a dramatic number of iterative reverse posterior sampling steps. In this work, a Single Step Diffusion Model-based reconstruction framework, namely SSDM-MRI, is proposed for restoring MRI images from highly undersampled k-space. The proposed method achieves one-step reconstruction by first training a conditional DM and then iteratively distilling this model four times using an iterative selective distillation algorithm, which works synergistically with a shortcut reverse sampling strategy for model inference. Comprehensive experiments were carried out on both publicly available fastMRI brain and knee images, as well as an in-house multi-echo GRE (QSM) subject. Overall, the results showed that SSDM-MRI outperformed other methods in terms of numerical metrics (e.g., PSNR and SSIM), error maps, image fine details, and latent susceptibility information hidden in MRI phase images. In addition, the reconstruction time for a 320$\times$320 brain slice of SSDM-MRI is only 0.45 second, which is only comparable to that of a simple U-net, making it a highly effective solution for MRI reconstruction tasks.
[416] Poisson Informed Retinex Network for Extreme Low-Light Image Enhancement
Isha Rao, Ratul Chakraborty, Sanjay Ghosh
Main category: eess.IV
TL;DR: A lightweight deep learning method for low-light image denoising and enhancement that handles Poisson noise through Retinex-based decomposition and Poisson denoising loss.
Details
Motivation: Traditional noise assumptions like Gaussian noise don't hold in real-world low-light imaging where noise is signal-dependent (Poisson noise), requiring specialized approaches for extreme low-light conditions.Method: Unified encoder-decoder network integrating Retinex-based decomposition with Poisson denoising, using Poisson denoising loss to handle signal-dependent noise without requiring prior reflectance/illumination knowledge.
Result: Method effectively enhances illumination and suppresses noise, improving visibility and brightness while preserving image structure and color constancy without color distortion.
Conclusion: The proposed approach is effective and practical for low-light illumination enhancement, successfully addressing Poisson noise in extreme low-light conditions through joint decomposition and denoising.
Abstract: Low-light image denoising and enhancement are challenging, especially when traditional noise assumptions, such as Gaussian noise, do not hold in majority. In many real-world scenarios, such as low-light imaging, noise is signal-dependent and is better represented as Poisson noise. In this work, we address the problem of denoising images degraded by Poisson noise under extreme low-light conditions. We introduce a light-weight deep learning-based method that integrates Retinex based decomposition with Poisson denoising into a unified encoder-decoder network. The model simultaneously enhances illumination and suppresses noise by incorporating a Poisson denoising loss to address signal-dependent noise. Without prior requirement for reflectance and illumination, the network learns an effective decomposition process while ensuring consistent reflectance and smooth illumination without causing any form of color distortion. The experimental results demonstrate the effectiveness and practicality of the proposed low-light illumination enhancement method. Our method significantly improves visibility and brightness in low-light conditions, while preserving image structure and color constancy under ambient illumination.
[417] Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
Yogesh Thakku Suresh, Vishwajeet Shivaji Hogale, Luca-Alexandru Zamfira, Anandavardhana Hegde
Main category: eess.IV
TL;DR: A transformer-based multimodal framework for generating clinically relevant captions for MRI scans using vision transformer, MediCareBERT, and LSTM decoder with hybrid loss functions.
Details
Motivation: To create automated medical image reporting systems that generate clinically relevant captions for MRI scans, addressing the need for scalable and interpretable solutions in medical imaging.Method: Combines DEiT-Small vision transformer as image encoder, MediCareBERT for caption embedding, and custom LSTM-based decoder with hybrid cosine-MSE loss and contrastive inference via vector similarity for semantic alignment.
Result: Benchmarked on MultiCaRe dataset, the method shows improved caption accuracy and semantic alignment when focusing on domain-specific brain-only MRIs compared to general MRI images, outperforming state-of-the-art methods like BLIP, R2GenGPT, and other transformer approaches.
Conclusion: The proposed framework provides a scalable and interpretable solution for automated medical image reporting, demonstrating that domain-specific data focus enhances caption quality in medical imaging applications.
Abstract: We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.