Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 87]
cs.CV [Total: 146]
cs.AI [Total: 43]
cs.SD [Total: 8]
cs.LG [Total: 163]
cs.MA [Total: 2]
cs.MM [Total: 1]
eess.AS [Total: 7]
eess.IV [Total: 12]

cs.CL

[1] Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective

Zhiqiang Kou, Junyang Chen, Xin-Qiang Cai, Ming-Kun Xie, Biao Liu, Changwei Wang, Lei Feng, Yuheng Jia, Gang Niu, Masashi Sugiyama, Xin Geng

Main category: cs.CL

TL;DR: The paper introduces three multi-label benchmarks for toxicity detection to address limitations of single-label approaches, and proposes a pseudo-label-based method that outperforms advanced baselines.

Details

Motivation: Current toxicity detectors rely on single-label benchmarks that fail to capture the ambiguous and multi-dimensional nature of real-world toxic prompts, leading to biased evaluations with missed detections and false positives. Gathering comprehensive multi-label annotations is also prohibitively costly.

Method: The authors introduce three multi-label benchmarks (Q-A-MLL, R-A-MLL, H-X-MLL) annotated with a 15-category taxonomy, provide theoretical proof that training with pseudo-labels outperforms single-label supervision, and develop a pseudo-label-based toxicity detection method.

Result: Extensive experiments show the proposed approach significantly surpasses advanced baselines including GPT-4o and DeepSeek, enabling more accurate and reliable evaluation of multi-label toxicity in LLM-generated content.

Conclusion: The multi-label benchmarks and pseudo-label-based method provide a more effective framework for toxicity detection in LLMs, addressing the limitations of single-label approaches and improving detection reliability.

Abstract: Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks, but their potential to generate harmful content has raised serious safety concerns. Current toxicity detectors primarily rely on single-label benchmarks, which cannot adequately capture the inherently ambiguous and multi-dimensional nature of real-world toxic prompts. This limitation results in biased evaluations, including missed toxic detections and false positives, undermining the reliability of existing detectors. Additionally, gathering comprehensive multi-label annotations across fine-grained toxicity categories is prohibitively costly, further hindering effective evaluation and development. To tackle these issues, we introduce three novel multi-label benchmarks for toxicity detection: \textbf{Q-A-MLL}, \textbf{R-A-MLL}, and \textbf{H-X-MLL}, derived from public toxicity datasets and annotated according to a detailed 15-category taxonomy. We further provide a theoretical proof that, on our released datasets, training with pseudo-labels yields better performance than directly learning from single-label supervision. In addition, we develop a pseudo-label-based toxicity detection method. Extensive experimental results show that our approach significantly surpasses advanced baselines, including GPT-4o and DeepSeek, thus enabling more accurate and reliable evaluation of multi-label toxicity in LLM-generated content.

[2] Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT, Gemini, and Deepseek

Enis Oğuz

Main category: cs.CL

TL;DR: Generative AI models (ChatGPT, Gemini, Deepseek) were tested for automated essay scoring, particularly evaluating their performance on essays with and without idioms. Gemini showed superior interrater reliability with human raters and best handled figurative language.

Details

Motivation: To assess Generative AI's potential as an alternative to Automated Essay Scoring (AES) systems, specifically examining its ability to process idioms and figurative language in student essays.

Method: Created two equal essay lists from 348 student essays: one with multiple idioms per essay and another with no idioms. Three Generative AI models scored all essays three times using the same human rater rubric.

Result: All models showed excellent consistency, but Gemini outperformed others in interrater reliability with human raters. No demographic bias was detected. Gemini handled essays with idioms most similarly to human raters.

Conclusion: Generative AI models demonstrate potential for hybrid essay scoring approaches, with Gemini being the best candidate due to its superior handling of figurative language and potential for standalone essay scoring in the future.

Abstract: The developments in Generative AI technologies have paved the way for numerous innovations in different fields. Recently, Generative AI has been proposed as a competitor to AES systems in evaluating student essays automatically. Considering the potential limitations of AI in processing idioms, this study assessed the scoring performances of Generative AI models for essays with and without idioms by incorporating insights from Corpus Linguistics and Computational Linguistics. Two equal essay lists were created from 348 student essays taken from a corpus: one with multiple idioms present in each essay and another with no idioms in essays. Three Generative AI models (ChatGPT, Gemini, and Deepseek) were asked to score all essays in both lists three times, using the same rubric used by human raters in assigning essay scores. The results revealed excellent consistency for all models, but Gemini outperformed its competitors in interrater reliability with human raters. There was also no detectable bias for any demographic group in AI assessment. For essays with multiple idioms, Gemini followed a the most similar pattern to human raters. While the models in the study demonstrated potential for a hybrid approach, Gemini was the best candidate for the task due to its ability to handle figurative language and showed promise for handling essay-scoring tasks alone in the future.

[3] A Generalizable Rhetorical Strategy Annotation Model Using LLM-based Debate Simulation and Labelling

Shiyu Ji, Farnoosh Hashemi, Joice Chen, Juanwen Pan, Weicheng Ma, Hefan Zhang, Sophia Pan, Ming Cheng, Shubham Mohole, Saeed Hassanpour, Soroush Vosoughi, Michael Macy

Main category: cs.CL

TL;DR: A framework using LLMs to automatically generate and label synthetic debate data for rhetorical strategy analysis, achieving high performance and generalization across domains with applications in persuasiveness prediction and analyzing temporal shifts in U.S. Presidential debates.

Details

Motivation: Current rhetorical strategy analysis relies on costly, inconsistent human annotation with limited datasets that are topic-specific, hindering robust model development and scalability.

Method: Proposed framework leverages LLMs to automatically generate and label synthetic debate data based on a four-part rhetorical typology (causal, empirical, emotional, moral), then fine-tunes transformer-based classifiers on this LLM-labeled dataset.

Result: The model achieves high performance and strong generalization across topical domains, outperforming human-labeled data on multiple external corpora. Applications show improved persuasiveness prediction and reveal increased use of affective over cognitive arguments in U.S. Presidential debates (1960-2020).

Conclusion: LLM-generated synthetic data enables scalable, robust rhetorical strategy analysis with strong cross-domain generalization, providing valuable insights into persuasive communication patterns and temporal shifts in political discourse.

Abstract: Rhetorical strategies are central to persuasive communication, from political discourse and marketing to legal argumentation. However, analysis of rhetorical strategies has been limited by reliance on human annotation, which is costly, inconsistent, difficult to scale. Their associated datasets are often limited to specific topics and strategies, posing challenges for robust model development. We propose a novel framework that leverages large language models (LLMs) to automatically generate and label synthetic debate data based on a four-part rhetorical typology (causal, empirical, emotional, moral). We fine-tune transformer-based classifiers on this LLM-labeled dataset and validate its performance against human-labeled data on this dataset and on multiple external corpora. Our model achieves high performance and strong generalization across topical domains. We illustrate two applications with the fine-tuned model: (1) the improvement in persuasiveness prediction from incorporating rhetorical strategy labels, and (2) analyzing temporal and partisan shifts in rhetorical strategies in U.S. Presidential debates (1960-2020), revealing increased use of affective over cognitive argument in U.S. Presidential debates.

[4] Continual Learning via Sparse Memory Finetuning

Jessy Lin, Luke Zettlemoyer, Gargi Ghosh, Wen-Tau Yih, Aram Markosyan, Vincent-Pierre Berges, Barlas Oğuz

Main category: cs.CL

TL;DR: Sparse memory finetuning enables continual learning in language models by updating only highly activated memory slots, reducing catastrophic forgetting while maintaining new knowledge acquisition.

Details

Motivation: To address catastrophic forgetting in language models where updating on new data erases previously acquired capabilities, motivated by the intuition that shared parameters across tasks make forgetting mitigation challenging.

Method: Introduces sparse memory finetuning using memory layer models that are sparsely updated by design. Updates only memory slots highly activated by new knowledge relative to pretraining usage, reducing interference between new and existing knowledge.

Result: Compared to full finetuning (89% F1 drop) and LoRA (71% F1 drop) on NaturalQuestions, sparse memory finetuning yields only 11% drop while achieving same level of new knowledge acquisition.

Conclusion: Sparsity in memory layers offers a promising path toward continual learning in large language models by enabling learning without catastrophic forgetting.

Abstract: Modern language models are powerful, but typically static after deployment. A major obstacle to building models that continually learn over time is catastrophic forgetting, where updating on new data erases previously acquired capabilities. Motivated by the intuition that mitigating forgetting is challenging because trainable parameters are shared across all tasks, we investigate whether sparse parameter updates can enable learning without catastrophic forgetting. We introduce sparse memory finetuning, leveraging memory layer models (Berges et al., 2024), which are sparsely updated by design. By updating only the memory slots that are highly activated by a new piece of knowledge relative to usage on pretraining data, we reduce interference between new knowledge and the model’s existing capabilities. We evaluate learning and forgetting compared to full finetuning and parameter-efficient finetuning with LoRA on two question answering tasks. We find that sparse memory finetuning learns new knowledge while exhibiting substantially less forgetting: while NaturalQuestions F1 drops by 89% after full finetuning on new facts and 71% with LoRA, sparse memory finetuning yields only an 11% drop with the same level of new knowledge acquisition. Our results suggest sparsity in memory layers offers a promising path toward continual learning in large language models.

[5] Measuring the Effect of Disfluency in Multilingual Knowledge Probing Benchmarks

Kirill Semenov, Rico Sennrich

Main category: cs.CL

TL;DR: The paper shows that using sentence-level translations instead of template translations in multilingual benchmarks like MLAMA significantly improves knowledge retrieval scores, especially for morphologically rich languages.

Details

Motivation: Existing multilingual benchmarks use template translations that ignore grammatical and semantic information of named entities, leading to ungrammatical prompts that complicate score interpretation, particularly for languages with rich morphology.

Method: Sampled 4 Slavic languages from MLAMA dataset and compared knowledge retrieval scores between original templated dataset and sentence-level translations from Google Translate and ChatGPT. Also analyzed 5 additional languages from different families.

Result: Observed significant increase in knowledge retrieval scores with sentence-level translations. Similar patterns found across different language families.

Conclusion: Researchers should control grammaticality in multilingual datasets using whole sentence translation with neural MT or LLM systems for more interpretable results.

Abstract: For multilingual factual knowledge assessment of LLMs, benchmarks such as MLAMA use template translations that do not take into account the grammatical and semantic information of the named entities inserted in the sentence. This leads to numerous instances of ungrammaticality or wrong wording of the final prompts, which complicates the interpretation of scores, especially for languages that have a rich morphological inventory. In this work, we sample 4 Slavic languages from the MLAMA dataset and compare the knowledge retrieval scores between the initial (templated) MLAMA dataset and its sentence-level translations made by Google Translate and ChatGPT. We observe a significant increase in knowledge retrieval scores, and provide a qualitative analysis for possible reasons behind it. We also make an additional analysis of 5 more languages from different families and see similar patterns. Therefore, we encourage the community to control the grammaticality of highly multilingual datasets for higher and more interpretable results, which is well approximated by whole sentence translation with neural MT or LLM systems. The dataset and all related code is published at the Github repository: https://github.com/ZurichNLP/Fluent-mLAMA.

[6] MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval

Qiyu Wu, Shuyang Cui, Satoshi Hayakawa, Wei-Yao Wang, Hiromi Wakaki, Yuki Mitsufuji

Main category: cs.CL

TL;DR: The paper proposes a modality composition awareness framework to address modality shortcut learning in unified encoders for multimodal retrieval, improving robustness under distribution shifts.

Details

Motivation: Unified encoders in multimodal large language models trained with conventional contrastive learning are prone to learn modality shortcuts, leading to poor robustness under distribution shifts.

Method: Proposes a modality composition awareness framework with two objectives: preference loss that enforces multimodal embeddings to outperform unimodal counterparts, and composition regularization that aligns multimodal embeddings with prototypes composed from unimodal parts.

Result: Experiments on various benchmarks show gains in out-of-distribution retrieval, demonstrating improved robustness.

Conclusion: Modality composition awareness is an effective principle for robust composed multimodal retrieval when using MLLMs as unified encoders.

Abstract: Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP align modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, while a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.

[7] Extending Audio Context for Long-Form Understanding in Large Audio-Language Models

Yuatyong Chaichana, Pittawat Taveekitworachai, Warit Sirichotedumrong, Potsawee Manakul, Kunat Pipatanakul

Main category: cs.CL

TL;DR: The paper introduces Partial YaRN and VLAT to extend audio context windows in Large Audio-Language Models without compromising text capabilities, enabling better long-form audio understanding.

Details

Motivation: Large Audio-Language Models are limited by short audio context windows even when their text backbones support long contexts, restricting long-form audio understanding capabilities.

Method: Two approaches: 1) Partial YaRN - training-free audio-only extension method that modifies only audio token positions while preserving text positions; 2) VLAT - training strategy that extends Partial YaRN into positional augmentation by simulating diverse audio lengths during training.

Result: Partial YaRN outperforms original models across various settings, and VLAT training provides substantial improvement, achieving strong performance on long audio of unseen lengths.

Conclusion: The proposed methods effectively extend audio context windows in LALMs while preserving text capabilities, enabling robust long-form audio understanding and generalization to longer inputs than seen during training.

Abstract: Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM’s text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.

[8] Latent Topic Synthesis: Leveraging LLMs for Electoral Ad Analysis

Alexander Brady, Tunazzina Islam

Main category: cs.CL

TL;DR: An end-to-end framework for automatically generating interpretable topic taxonomies from unlabeled social media political content using unsupervised clustering and LLM-based labeling, applied to Meta political ads before the 2024 US election.

Details

Motivation: Social media platforms shape political discourse but analyzing vast, evolving content is challenging. Need scalable methods to understand political messaging, polarization, and moral foundations without requiring seed sets or domain expertise.

Method: Combines unsupervised clustering with prompt-based labeling using large language models to iteratively construct topic taxonomies. Applied to Meta political ads corpus from month before 2024 US Presidential election.

Result: Uncovered latent discourse structures: voting/immigration ads dominate spending/impressions; abortion/election-integrity achieve disproportionate reach. Found polarized funding patterns and moral framing differences across topics. Demographic targeting patterns emerged.

Conclusion: Framework enables scalable, interpretable analysis of political messaging on social media, helping researchers, policymakers, and public understand emerging narratives, polarization dynamics, and moral underpinnings of digital political communication.

Abstract: Social media platforms play a pivotal role in shaping political discourse, but analyzing their vast and rapidly evolving content remains a major challenge. We introduce an end-to-end framework for automatically generating an interpretable topic taxonomy from an unlabeled corpus. By combining unsupervised clustering with prompt-based labeling, our method leverages large language models (LLMs) to iteratively construct a taxonomy without requiring seed sets or domain expertise. We apply this framework to a large corpus of Meta (previously known as Facebook) political ads from the month ahead of the 2024 U.S. Presidential election. Our approach uncovers latent discourse structures, synthesizes semantically rich topic labels, and annotates topics with moral framing dimensions. We show quantitative and qualitative analyses to demonstrate the effectiveness of our framework. Our findings reveal that voting and immigration ads dominate overall spending and impressions, while abortion and election-integrity achieve disproportionate reach. Funding patterns are equally polarized: economic appeals are driven mainly by conservative PACs, abortion messaging splits between pro- and anti-rights coalitions, and crime-and-justice campaigns are fragmented across local committees. The framing of these appeals also diverges–abortion ads emphasize liberty/oppression rhetoric, while economic messaging blends care/harm, fairness/cheating, and liberty/oppression narratives. Topic salience further reveals strong correlations between moral foundations and issues. Demographic targeting also emerges. This work supports scalable, interpretable analysis of political messaging on social media, enabling researchers, policymakers, and the public to better understand emerging narratives, polarization dynamics, and the moral underpinnings of digital political communication.

[9] FarsiMCQGen: a Persian Multiple-choice Question Generation Framework

Mohammad Heydari Rad, Rezvan Afari, Saeedeh Momtazi

Main category: cs.CL

TL;DR: FarsiMCQGen is a novel approach for generating Persian-language multiple-choice questions using transformers, knowledge graphs, and rule-based methods to create realistic distractors, along with a new dataset of 10,289 Persian MCQs.

Details

Motivation: Generating high-quality multiple-choice questions in low-resource languages like Persian remains challenging, despite MCQs being widely used in educational testing for efficient knowledge evaluation.

Method: Combines candidate generation, filtering, and ranking techniques using Transformers and knowledge graphs integrated with rule-based approaches to create credible distractors. Uses Wikipedia data for general knowledge questions.

Result: Created a novel Persian MCQ dataset of 10,289 questions and demonstrated the effectiveness of the FarsiMCQGen model through evaluation by state-of-the-art large language models.

Conclusion: The model effectively generates Persian MCQs and the dataset provides a valuable resource that can inspire further research on multiple-choice question generation in low-resource languages.

Abstract: Multiple-choice questions (MCQs) are commonly used in educational testing, as they offer an efficient means of evaluating learners’ knowledge. However, generating high-quality MCQs, particularly in low-resource languages such as Persian, remains a significant challenge. This paper introduces FarsiMCQGen, an innovative approach for generating Persian-language MCQs. Our methodology combines candidate generation, filtering, and ranking techniques to build a model that generates answer choices resembling those in real MCQs. We leverage advanced methods, including Transformers and knowledge graphs, integrated with rule-based approaches to craft credible distractors that challenge test-takers. Our work is based on data from Wikipedia, which includes general knowledge questions. Furthermore, this study introduces a novel Persian MCQ dataset comprising 10,289 questions. This dataset is evaluated by different state-of-the-art large language models (LLMs). Our results demonstrate the effectiveness of our model and the quality of the generated dataset, which has the potential to inspire further research on MCQs.

[10] Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning

Junlin Wu, Xianrui Zhong, Jiashuo Sun, Bolian Li, Bowen Jin, Jiawei Han, Qingkai Zeng

Main category: cs.CL

TL;DR: Structure-R1 is a framework that transforms retrieved content into structured representations using reinforcement learning to enhance reasoning in LLMs, achieving competitive performance with smaller models.

Details

Motivation: Traditional RAG systems use unstructured text with low information density, limiting reasoning capabilities. Structure-R1 aims to overcome this by creating structured representations optimized for reasoning.

Method: Uses reinforcement learning to learn a content representation policy that dynamically generates task-specific structural formats. Includes self-reward structural verification to ensure quality and reliability of generated structures.

Result: Achieves competitive performance on seven knowledge-intensive benchmarks with a 7B-scale model, matching larger models’ performance. Theoretical analysis shows improved information density and contextual clarity.

Conclusion: Structure-R1 demonstrates that structured representations significantly enhance reasoning capabilities in LLMs, providing an effective alternative to scaling model size.

Abstract: Large language models (LLMs) have demonstrated remarkable advances in reasoning capabilities. However, their performance remains constrained by limited access to explicit and structured domain knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external information as context to augment reasoning. Nevertheless, traditional RAG systems typically operate over unstructured and fragmented text, resulting in low information density and suboptimal reasoning. To overcome these limitations, we propose \textsc{Structure-R1}, a novel framework that transforms retrieved content into structured representations optimized for reasoning. Leveraging reinforcement learning, \textsc{Structure-R1} learns a content representation policy that dynamically generates and adapts structural formats based on the demands of multi-step reasoning. Unlike prior methods that rely on fixed schemas, our approach adopts a generative paradigm capable of producing task-specific structures tailored to individual queries. To ensure the quality and reliability of these representations, we introduce a self-reward structural verification mechanism that checks whether the generated structures are both correct and self-contained. Extensive experiments on seven knowledge-intensive benchmarks show that \textsc{Structure-R1} consistently achieves competitive performance with a 7B-scale backbone model and matches the performance of much larger models. Additionally, our theoretical analysis demonstrates how structured representations enhance reasoning by improving information density and contextual clarity. Our code and data are available at: https://github.com/jlwu002/sr1.

[11] Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning

Lina Berrayana, Ahmed Heakl, Muhammad Abdullah Sohail, Thomas Hofmann, Salman Khan, Wei Chen

Main category: cs.CL

TL;DR: Hybrid architectures combining discrete diffusion language models (DDLMs) with autoregressive models (ARMs) achieve better accuracy and computational efficiency by shifting communication from text space to latent space.

Details

Motivation: Current autoregressive models are computationally expensive due to long token sequences, while DDLMs offer parallel generation but have text-generation limitations. The study explores whether combining both models can yield complementary benefits.

Method: Two collaboration approaches: 1) Text-space collaboration where DDLM plans reasoning and ARM executes answers; 2) Latent-space communication using a learned projector to map DDLM latents into ARM’s embedding space.

Result: Latent-space communication significantly improves accuracy (27.0% to 54.0% on DART-5, 0.0% to 14.0% on AIME24). Hybrid pipeline with 64 planning tokens and ~5 execution tokens outperforms Qwen3.1-7B despite using 44x fewer tokens.

Conclusion: DDLM-ARM hybrid architectures provide substantial computational savings with minimal accuracy impact, offering new insights for reasoning tasks and highlighting DDLMs’ potential in collaborative systems.

Abstract: Current autoregressive language models (ARMs) achieve high accuracy but require long token sequences, making them costly. Discrete diffusion language models (DDLMs) enable parallel and flexible generation within a fixed number of steps and have recently emerged for their strong performance in complex reasoning and long-term planning tasks. We present a study exploring hybrid architectures that couple DDLMs with ARMs to assess whether their collaboration can yield complementary benefits. We first examine collaboration in text space, where one model plans the reasoning process and another executes the final answer based on that plan. We then extend this setup to latent-space communication, introducing a learned projector that maps DDLM latents into the ARM’s embedding space, potentially bypassing some of the text-generation limitations of diffusion models. We find that shifting DDLM –> ARM communication from text space to latent space yields significant accuracy gains, for example increasing from 27.0% to 54.0% on DART-5 and from 0.0% to 14.0% on AIME24. We also find that combining a DDLM planner with an ARM executor can provide substantial computational savings with little to no impact on accuracy. For example, the latent-space pipeline, using 64 tokens for planning and roughly 5 for execution, surpasses Qwen3.1-7B on DART-5 and AIME, despite Qwen using 44 times more tokens. Overall, our study offers new insights into reasoning with DDLMs and highlights their potential in hybrid architectures.

[12] Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

Sensen Gao, Shanshan Zhao, Xu Jiang, Lunhao Duan, Yong Xien Chng, Qing-Guo Chen, Weihua Luo, Kaifu Zhang, Jia-Wang Bian, Mingming Gong

Main category: cs.CL

TL;DR: This paper presents a systematic survey of Multimodal RAG for document understanding, proposing a taxonomy and reviewing advances in graph structures and agentic frameworks while highlighting open challenges.

Details

Motivation: Current document understanding approaches have limitations - OCR-based pipelines lose structural detail while native MLLMs struggle with context modeling. Documents' multimodal nature requires a more advanced paradigm that can handle text, tables, charts, and layout together.

Method: The paper conducts a systematic survey of Multimodal RAG approaches, proposing a taxonomy based on domain, retrieval modality, and granularity. It reviews advances involving graph structures and agentic frameworks.

Result: The survey summarizes key datasets, benchmarks, and applications of Multimodal RAG for document understanding, providing a comprehensive overview of the field.

Conclusion: The paper identifies open challenges in efficiency, fine-grained representation, and robustness, and provides a roadmap for future progress in document AI through Multimodal RAG approaches.

Abstract: Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents’ multimodal nature, i.e., combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG. This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. We propose a taxonomy based on domain, retrieval modality, and granularity, and review advances involving graph structures and agentic frameworks. We also summarize key datasets, benchmarks, and applications, and highlight open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.

[13] TraceCoder: Towards Traceable ICD Coding via Multi-Source Knowledge Integration

Mucheng Ren, He Chen, Yuchen Yan, Danqing Hu, Jun Xu, Xian Zeng

Main category: cs.CL

TL;DR: TraceCoder is a novel framework that integrates multi-source external knowledge (UMLS, Wikipedia, LLMs) to improve automated ICD coding by addressing semantic gaps, poor rare code performance, and limited interpretability through dynamic knowledge incorporation and hybrid attention mechanisms.

Details

Motivation: Existing automated ICD coding methods face challenges with semantic gaps between clinical text and ICD codes, poor performance on rare/long-tail codes, and limited interpretability, which TraceCoder aims to address.

Method: TraceCoder dynamically incorporates diverse knowledge sources (UMLS, Wikipedia, LLMs) to enrich code representations and bridge semantic gaps. It uses a hybrid attention mechanism to model interactions among labels, clinical context, and knowledge for improved long-tail code recognition and interpretable predictions.

Result: Experiments on MIMIC-III-ICD9, MIMIC-IV-ICD9, and MIMIC-IV-ICD10 datasets show TraceCoder achieves state-of-the-art performance, with ablation studies validating the effectiveness of its components.

Conclusion: TraceCoder provides a scalable and robust solution for automated ICD coding that aligns with clinical needs for accuracy, interpretability, and reliability.

Abstract: Automated International Classification of Diseases (ICD) coding assigns standardized diagnosis and procedure codes to clinical records, playing a critical role in healthcare systems. However, existing methods face challenges such as semantic gaps between clinical text and ICD codes, poor performance on rare and long-tail codes, and limited interpretability. To address these issues, we propose TraceCoder, a novel framework integrating multi-source external knowledge to enhance traceability and explainability in ICD coding. TraceCoder dynamically incorporates diverse knowledge sources, including UMLS, Wikipedia, and large language models (LLMs), to enrich code representations, bridge semantic gaps, and handle rare and ambiguous codes. It also introduces a hybrid attention mechanism to model interactions among labels, clinical context, and knowledge, improving long-tail code recognition and making predictions interpretable by grounding them in external evidence. Experiments on MIMIC-III-ICD9, MIMIC-IV-ICD9, and MIMIC-IV-ICD10 datasets demonstrate that TraceCoder achieves state-of-the-art performance, with ablation studies validating the effectiveness of its components. TraceCoder offers a scalable and robust solution for automated ICD coding, aligning with clinical needs for accuracy, interpretability, and reliability.

[14] Summarizing Speech: A Comprehensive Survey

Fabian Retkowski, Maike Züfle, Andreas Sudmann, Dinah Pfau, Shinji Watanabe, Jan Niehues, Alexander Waibel

Main category: cs.CL

TL;DR: This survey paper examines speech summarization, highlighting its growing importance but loose definition, and reviews datasets, evaluation protocols, and recent advances from traditional to modern approaches.

Details

Motivation: Speech summarization is essential for managing growing spoken/audiovisual content, but remains loosely defined and intersects with multiple research areas, requiring systematic examination.

Method: The survey examines existing datasets and evaluation protocols, and synthesizes recent developments in the field including traditional systems, fine-tuned cascaded architectures, and end-to-end solutions.

Result: The paper surfaces ongoing challenges including the need for realistic evaluation benchmarks, multilingual datasets, and improved long-context handling capabilities.

Conclusion: Speech summarization is an evolving field that requires better definition, standardized evaluation, and solutions for multilingual and long-context scenarios to advance the technology.

Abstract: Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content. However, despite its increasing importance, speech summarization remains loosely defined. The field intersects with several research areas, including speech recognition, text summarization, and specific applications like meeting summarization. This survey not only examines existing datasets and evaluation protocols, which are crucial for assessing the quality of summarization approaches, but also synthesizes recent developments in the field, highlighting the shift from traditional systems to advanced models like fine-tuned cascaded architectures and end-to-end solutions. In doing so, we surface the ongoing challenges, such as the need for realistic evaluation benchmarks, multilingual datasets, and long-context handling.

[15] TACL: Threshold-Adaptive Curriculum Learning Strategy for Enhancing Medical Text Understanding

Mucheng Ren, Yucheng Yan, He Chen, Danqing Hu, Jun Xu, Xian Zeng

Main category: cs.CL

TL;DR: TACL is a novel curriculum learning framework that dynamically adjusts training based on medical text complexity, improving performance on clinical NLP tasks like ICD coding and readmission prediction across multiple languages.

Details

Motivation: Medical texts like EMRs are unstructured and domain-specific, making automated understanding challenging. Existing methods treat all data equally, ignoring complexity differences that limit model generalization on rare or complex cases.

Method: TACL uses threshold-adaptive curriculum learning to categorize data by difficulty levels, prioritizing simpler cases early in training to build strong foundations before tackling complex records. Applied to multilingual medical data including English and Chinese clinical records.

Result: Significant improvements observed across diverse clinical tasks including automatic ICD coding, readmission prediction, and TCM syndrome differentiation. Enhanced performance of automated systems and demonstrated potential to unify approaches across medical domains.

Conclusion: TACL paves the way for more accurate, scalable, and globally applicable medical text understanding solutions by addressing complexity variations in clinical records through adaptive curriculum learning.

Abstract: Medical texts, particularly electronic medical records (EMRs), are a cornerstone of modern healthcare, capturing critical information about patient care, diagnoses, and treatments. These texts hold immense potential for advancing clinical decision-making and healthcare analytics. However, their unstructured nature, domain-specific language, and variability across contexts make automated understanding an intricate challenge. Despite the advancements in natural language processing, existing methods often treat all data as equally challenging, ignoring the inherent differences in complexity across clinical records. This oversight limits the ability of models to effectively generalize and perform well on rare or complex cases. In this paper, we present TACL (Threshold-Adaptive Curriculum Learning), a novel framework designed to address these challenges by rethinking how models interact with medical texts during training. Inspired by the principle of progressive learning, TACL dynamically adjusts the training process based on the complexity of individual samples. By categorizing data into difficulty levels and prioritizing simpler cases early in training, the model builds a strong foundation before tackling more complex records. By applying TACL to multilingual medical data, including English and Chinese clinical records, we observe significant improvements across diverse clinical tasks, including automatic ICD coding, readmission prediction and TCM syndrome differentiation. TACL not only enhances the performance of automated systems but also demonstrates the potential to unify approaches across disparate medical domains, paving the way for more accurate, scalable, and globally applicable medical text understanding solutions.

[16] Exemplar-Guided Planing: Enhanced LLM Agent for KGQA

Jingao Xu, Shuoyoucheng Ma, Xin Song, Rong Jiang, Hongkui Tu, Bin Zhou

Main category: cs.CL

TL;DR: EGP enhances LLM agents for KGQA by using exemplar-guided planning with entity templating, semantic retrieval, and smart lookahead to bridge semantic gaps and improve reasoning efficiency.

Details

Motivation: LLMs struggle with semantic gaps between natural language queries and KG representations, leading to suboptimal planning and inefficient exploration, while training-free approaches underutilize reasoning patterns in training data.

Method: EGP preprocesses training questions via entity templating, retrieves similar exemplars using semantic embeddings and FAISS index, then guides LLM planning through task decomposition and relation exploration with smart lookahead mechanism.

Result: PoG-EGP significantly improves over baseline PoG system and other methods on WebQSP and CWQ datasets.

Conclusion: EGP framework effectively enhances LLM planning capabilities for KGQA by leveraging exemplar guidance and smart exploration strategies.

Abstract: Large Language Models (LLMs) as interactive agents show significant promise in Knowledge Graph Question Answering (KGQA) but often struggle with the semantic gap between natural language queries and structured knowledge graph (KG) representations. This leads to suboptimal planning and inefficient exploration on KG, while training-free approaches often underutilize valuable reasoning patterns in training data. To address these limitations, we propose a novel framework, Exemplar-Guided Planning (EGP), which enhances the planning capabilities of LLM agents for KGQA. EGP first preprocesses the training set questions via entity templating to normalize semantic variations. It then retrieves highly similar exemplary questions and their successful reasoning paths from this preprocessed set using semantic embeddings and an efficient FAISS index. These retrieved exemplars dynamically guide the LLM’s planning process in two key phases: (1) Task Decomposition, by aligning generated sub-objectives with proven reasoning steps, and (2) Relation Exploration, by providing high-quality auxiliary information to improve relation pruning accuracy. Additionally, we introduce a Smart Lookahead mechanism during relation exploration to improve efficiency by preemptively exploring promising paths and potentially terminating exploration earlier. We apply EGP to the Plan-on-Graph (PoG) framework, termed PoG-EGP. Extensive experiments on two real-world KGQA datasets, WebQSP and CWQ, demonstrate that PoG-EGP significantly improves over the baseline PoG system and other compared methods.

[17] Automatic essay scoring: leveraging Jaccard coefficient and Cosine similaritywith n-gram variation in vector space model approach

Andharini Dwi Cahyani, Moh. Wildan Fathoni, Fika Hastarita Rachman, Ari Basuki, Salman Amin, Bain Khusnul Khotimah

Main category: cs.CL

TL;DR: This study compares Jaccard coefficient and Cosine similarity metrics for automated essay scoring using n-gram vector space models, finding that Cosine similarity with unigrams performs best.

Details

Motivation: To provide efficient and accurate automated essay scoring tools for evaluating written content, particularly for citizenship education essays in junior high schools.

Method: Used vector space models with unigram, bigram, and trigram representations, preprocessed essays, extracted features using n-gram models, vectorized text data, and computed similarity scores using Jaccard coefficient and Cosine similarity.

Result: Cosine similarity outperformed Jaccard coefficient, and unigrams achieved lower RMSE compared to bigrams and trigrams when measuring the difference between human grader scores and system-generated scores.

Conclusion: Cosine similarity with unigram representations is the most effective approach for automated essay scoring in this context.

Abstract: Automated essay scoring (AES) is a vital area of research aiming to provide efficient and accurate assessment tools for evaluating written content. This study investigates the effectiveness of two popular similarity metrics, Jaccard coefficient, and Cosine similarity, within the context of vector space models(VSM)employing unigram, bigram, and trigram representations. The data used in this research was obtained from the formative essay of the citizenship education subject in a junior high school. Each essay undergoes preprocessing to extract features using n-gram models, followed by vectorization to transform text data into numerical representations. Then, similarity scores are computed between essays using both Jaccard coefficient and Cosine similarity. The performance of the system is evaluated by analyzing the root mean square error (RMSE), which measures the difference between the scores given by human graders and those generated by the system. The result shows that the Cosine similarity outperformed the Jaccard coefficient. In terms of n-gram, unigrams have lower RMSE compared to bigrams and trigrams.

[18] Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination

Zhiyang Chen, Daliang Xu, Haiyang Shen, Mengwei Xu, Shangguang Wang, Yun Ma

Main category: cs.CL

TL;DR: CoordGen is a mobile inference framework that accelerates context-aware text generation on mobile devices through speculative decoding and dynamic hardware scheduling, achieving up to 3.8x speedup and 4.7x energy efficiency improvements.

Details

Motivation: On-device LLMs with local context enable personalized applications, but token-by-token generation suffers from high latency and limited hardware utilization due to memory-bound characteristics, despite improved prefill efficiency from neural processors.

Method: CoordGen integrates speculative decoding with dynamic hardware scheduling through three components: adaptive execution scheduling (balancing compute graphs), context-aligned drafting (lightweight online calibration), and hardware-efficient draft extension (reusing intermediate sequences).

Result: Experiments on multiple smartphones and workloads show consistent improvements of up to 3.8x in generation speed and 4.7x in energy efficiency compared to existing mobile inference solutions.

Conclusion: CoordGen effectively accelerates context-aware text generation on mobile devices through synergistic optimization components, demonstrating significant performance and efficiency gains for on-device LLM applications.

Abstract: Enhancing on-device large language models (LLMs) with contextual information from local data enables personalized and task-aware generation, powering use cases such as intelligent assistants and UI agents. While recent developments in neural processors have substantially improved the efficiency of prefill on mobile devices, the token-by-token generation process still suffers from high latency and limited hardware utilization due to its inherently memory-bound characteristics. This work presents CoordGen, a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices. The framework introduces three synergistic components: (1) adaptive execution scheduling, which dynamically balances compute graphs between prefill and decoding phases; (2) context-aligned drafting, which improves speculative efficiency through lightweight online calibration to current tasks; and (3) hardware-efficient draft extension, which reuses and expands intermediate sequences to improve processing parallelism and reduce verification cost. Experiments on multiple smartphones and representative workloads show consistent improvements of up to 3.8x in generation speed and 4.7x in energy efficiency compared with existing mobile inference solutions. Component-level analysis further validates the contribution of each optimization.

[19] Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

Bolei Ma, Yina Yao, Anna-Carolina Haensch

Main category: cs.CL

TL;DR: A three-step evaluation framework reveals systematic biases in LLMs for classical Chinese poetry generation and evaluation, showing ’echo chamber’ effects where models converge on flawed standards that diverge from human judgments.

Details

Motivation: To understand LLM performance in classical Chinese poetry generation and evaluation, as current capabilities in creative domains remain poorly understood despite increasing applications.

Method: Proposed a three-step evaluation framework combining computational metrics, LLM-as-a-judge assessment, and human expert validation to evaluate six state-of-the-art LLMs across multiple poetic quality dimensions.

Result: LLMs exhibit systematic generation and evaluation biases with ’echo chamber’ effects, where models converge on flawed standards that diverge from human expert judgments.

Conclusion: Current LLMs have both potential and limitations as proxies for literacy generation, demonstrating the continued need for hybrid human-model validation in culturally and technically complex creative tasks.

Abstract: Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style. Our analysis reveals systematic generation and evaluation biases: LLMs exhibit “echo chamber” effects when assessing creative quality, often converging on flawed standards that diverge from human judgments. These findings highlight both the potential and limitations of current capabilities of LLMs as proxy for literacy generation and the limited evaluation practices, thereby demonstrating the continued need of hybrid validation from both humans and models in culturally and technically complex creative tasks.

[20] AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction

Hong Ting Tsang, Jiaxin Bai, Haoyu Huang, Qiao Xiao, Tianshi Zheng, Baixuan Xu, Shujie Liu, Yangqiu Song

Main category: cs.CL

TL;DR: AutoGraph-R1 is the first framework that uses Reinforcement Learning to optimize knowledge graph construction for RAG-based question answering systems, bridging the gap between KG construction and downstream task performance.

Details

Motivation: Current knowledge graph construction for RAG systems is decoupled from downstream applications, resulting in suboptimal graph structures that don't effectively support question answering tasks.

Method: AutoGraph-R1 trains an LLM constructor using Reinforcement Learning, framing graph generation as a policy learning problem with task-aware reward functions derived from the graph’s utility in RAG pipelines.

Result: Across multiple QA benchmarks, AutoGraph-R1 consistently enables graph RAG methods to achieve significant performance gains over task-agnostic baseline graphs.

Conclusion: The framework successfully closes the loop between KG construction and application, shifting from building intrinsically ‘good’ graphs to building demonstrably ‘useful’ ones for specific tasks.

Abstract: Building effective knowledge graphs (KGs) for Retrieval-Augmented Generation (RAG) is pivotal for advancing question answering (QA) systems. However, its effectiveness is hindered by a fundamental disconnect: the knowledge graph (KG) construction process is decoupled from its downstream application, yielding suboptimal graph structures. To bridge this gap, we introduce AutoGraph-R1, the first framework to directly optimize KG construction for task performance using Reinforcement Learning (RL). AutoGraph-R1 trains an LLM constructor by framing graph generation as a policy learning problem, where the reward is derived from the graph’s functional utility in a RAG pipeline. We design two novel, task-aware reward functions, one for graphs as knowledge carriers and another as knowledge indices. Across multiple QA benchmarks, AutoGraph-R1 consistently enables graph RAG methods to achieve significant performance gains over using task-agnostic baseline graphs. Our work shows it is possible to close the loop between construction and application, shifting the paradigm from building intrinsically good'' graphs to building demonstrably useful’’ ones.

[21] Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics

Catarina G Belem, Parker Glenn, Alfy Samuel, Anoop Kumar, Daben Liu

Main category: cs.CL

TL;DR: Model-based readability metrics outperform traditional surface-level metrics by better aligning with human judgments, with information content and topic being key factors in readability perception.

Details

Motivation: Current readability assessment is hindered by inconsistent definitions and reliance on surface-level text properties, creating a mismatch with human perceptions of readability.

Method: Analyzed 897 human judgments to identify readability factors, then evaluated 15 traditional readability metrics against 6 model-based metrics across 5 English datasets.

Result: Model-based metrics consistently ranked top 4 in correlation with human judgments, while the best traditional metric averaged rank 8.6, showing significant performance gap.

Conclusion: Model-based approaches represent a more promising direction for readability assessment as they better capture the nuanced factors (information content and topic) that shape human readability perceptions.

Abstract: Automatic readability assessment plays a key role in ensuring effective and accessible written communication. Despite significant progress, the field is hindered by inconsistent definitions of readability and measurements that rely on surface-level text properties. In this work, we investigate the factors shaping human perceptions of readability through the analysis of 897 judgments, finding that, beyond surface-level cues, information content and topic strongly shape text comprehensibility. Furthermore, we evaluate 15 popular readability metrics across five English datasets, contrasting them with six more nuanced, model-based metrics. Our results show that four model-based metrics consistently place among the top four in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 8.6. These findings highlight a mismatch between current readability metrics and human perceptions, pointing to model-based approaches as a more promising direction.

[22] When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

Heecheol Yun, Kwangmin Ki, Junghyun Lee, Eunho Yang

Main category: cs.CL

TL;DR: SAFE is a selective ensembling framework for LLMs in long-form generation that addresses tokenization mismatches and probability distribution consensus, achieving better performance by ensembling fewer than 1% of tokens.

Details

Motivation: Existing ensemble methods work well for short-form answers but degrade performance in long-form generation when applied at every token, requiring careful selection of ensembling positions.

Method: Proposes SAFE framework that selectively ensembles based on tokenization mismatch and consensus in probability distributions, with probability sharpening to consolidate probabilities across sub-word tokens.

Result: Outperforms existing methods on benchmarks like MATH500 and BBH in both accuracy and efficiency, achieving gains with minimal ensembling (fewer than 1% of tokens).

Conclusion: Selective ensembling with proper position selection and probability consolidation significantly improves long-form generation performance while maintaining efficiency.

Abstract: Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating models’ next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form answers, its application to long-form generation remains underexplored. In this paper, we show that using existing ensemble methods in long-form generation requires a careful choice of ensembling positions, since the standard practice of ensembling at every token often degrades performance. We identify two key factors for determining these positions: tokenization mismatch across models and consensus in their next-token probability distributions. Based on this, we propose SAFE, (Stable And Fast LLM Ensembling), a framework that selectively ensembles by jointly considering these factors. To further improve stability, we introduce a probability sharpening strategy that consolidates probabilities spread across multiple sub-word tokens representing the same word into a single representative token. Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.

[23] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yanjie Liang, Ling Chen, Wei Chu, Yuan Qi

Main category: cs.CL

TL;DR: LayoutRL is a reinforcement learning framework for document parsing that uses composite rewards to optimize layout understanding, trained on the Infinity-Doc-400K dataset to create Infinity-Parser, which achieves state-of-the-art performance across diverse document types.

Details

Motivation: Document parsing faces challenges with complex layouts and poor generalization across document types due to limited training data and the inability of supervised methods to handle out-of-distribution data effectively.

Method: Uses reinforcement learning with composite rewards (normalized edit distance, paragraph count accuracy, reading order preservation) and trains on the Infinity-Doc-400K dataset to develop the Infinity-Parser vision-language model.

Result: Infinity-Parser achieves state-of-the-art performance on benchmarks including OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet, outperforming both specialized document parsing systems and general-purpose vision-language models.

Conclusion: The LayoutRL framework with composite rewards enables robust document parsing across diverse domains, and the Infinity-Parser model demonstrates strong generalization capabilities, with code, dataset, and model to be released for reproducible research.

Abstract: Document parsing from scanned images into structured formats remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data. This issue is further exacerbated by the limited availability of high-quality training data for layout-aware parsing tasks. To address these challenges, we introduce LayoutRL, a reinforcement learning framework that optimizes layout understanding through composite rewards integrating normalized edit distance, paragraph count accuracy, and reading order preservation. To support this training, we construct the Infinity-Doc-400K dataset, which we use to train Infinity-Parser, a vision-language model demonstrating robust generalization across various domains. Extensive evaluations on benchmarks including OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models. We will release our code, dataset, and model to facilitate reproducible research in document parsing.

[24] VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency

Hongcheng Liu, Yixuan Hou, Heyang Liu, Yuhao Wang, Yanfeng Wang, Yu Wang

Main category: cs.CL

TL;DR: Speech-LLMs show significant performance degradation when handling speech disfluencies, particularly from users with conditions like Parkinson’s disease, revealing limitations in real-world readiness.

Details

Motivation: Current Speech-LLM evaluations rely on idealized inputs and overlook common disfluencies, especially those associated with speech impairments, raising concerns about their practical usability for diverse user populations.

Method: Introduced VocalBench-DF framework for systematic evaluation of disfluency across multi-dimensional taxonomy, and evaluated 22 mainstream Speech-LLMs to identify performance bottlenecks.

Result: Evaluation revealed substantial performance degradation in Speech-LLMs when handling disfluent speech, with phoneme-level processing and long-context modeling identified as primary bottlenecks.

Conclusion: There is an urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs, with strengthening recognition and reasoning capabilities showing promise for substantial robustness improvements.

Abstract: While Speech Large Language Models (Speech-LLMs) show strong performance in many applications, their robustness is critically under-tested, especially to speech disfluency. Existing evaluations often rely on idealized inputs, overlooking common disfluencies, particularly those associated with conditions like Parkinson’s disease. This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments. To facilitate this inquiry, we introduce VocalBench-DF, a framework for the systematic evaluation of disfluency across a multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals substantial performance degradation, indicating that their real-world readiness is limited. Further analysis identifies phoneme-level processing and long-context modeling as primary bottlenecks responsible for these failures. Strengthening recognition and reasoning capability from components and pipelines can substantially improve robustness. These findings highlight the urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs

[25] Large-scale User Game Lifecycle Representation Learning

Yanjie Gou, Jiangming Liu, Kouying Xue, Yi Hua

Main category: cs.CL

TL;DR: The paper addresses challenges in game advertising and recommendation by introducing User Game Lifecycle (UGL) to handle game sparsity and imbalance, with strategies for extracting user interests and inverse probability masking.

Details

Motivation: Existing recommendation methods are unsuitable for game platforms due to game sparsity (few games) and imbalance (dominance of popular games), requiring specialized approaches for effective advertising and recommendations.

Method: Proposed User Game Lifecycle (UGL) to enrich user behaviors, with strategies for extracting short/long-term interests and Inverse Probability Masking to handle game imbalance in representation learning.

Result: UGL representations achieved significant improvements: 1.83% AUC offline and 21.67% CVR online increase for game advertising; 0.5% AUC offline and 0.82% ARPU online increase for in-game item recommendation.

Conclusion: The proposed UGL framework effectively addresses game sparsity and imbalance challenges, demonstrating substantial performance improvements in both game advertising and in-game item recommendation tasks.

Abstract: The rapid expansion of video game production necessitates the development of effective advertising and recommendation systems for online game platforms. Recommending and advertising games to users hinges on capturing their interest in games. However, existing representation learning methods crafted for handling billions of items in recommendation systems are unsuitable for game advertising and recommendation. This is primarily due to game sparsity, where the mere hundreds of games fall short for large-scale user representation learning, and game imbalance, where user behaviors are overwhelmingly dominated by a handful of popular games. To address the sparsity issue, we introduce the User Game Lifecycle (UGL), designed to enrich user behaviors in games. Additionally, we propose two innovative strategies aimed at manipulating user behaviors to more effectively extract both short and long-term interests. To tackle the game imbalance challenge, we present an Inverse Probability Masking strategy for UGL representation learning. The offline and online experimental results demonstrate that the UGL representations significantly enhance model by achieving a 1.83% AUC offline increase on average and a 21.67% CVR online increase on average for game advertising and a 0.5% AUC offline increase and a 0.82% ARPU online increase for in-game item recommendation.

[26] Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs

Lee Qi Zun, Mohamad Zulhilmi Bin Abdul Halim, Goh Man Fye

Main category: cs.CL

TL;DR: This paper proposes a framework to specialize MedGemma model for generating high-fidelity medical image captions to improve Retrieval-Augmented Generation systems’ effectiveness with image-based queries in Malaysian Clinical Practice Guidelines.

Details

Motivation: General Vision-Language Model captions lack clinical specificity and factual grounding, limiting the effectiveness of Retrieval-Augmented Generation systems with image-based queries in medical contexts.

Method: Employed knowledge distillation pipeline to create synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tuned MedGemma using parameter-efficient QLoRA method.

Result: The fine-tuned model demonstrated substantial improvements in classification performance and significant gains in caption faithfulness and correctness through RAGAS framework evaluation.

Conclusion: The work establishes a robust pipeline for specializing medical VLMs and validates the model as a high-quality query generator for enhancing multimodal RAG systems in evidence-based clinical decision support.

Abstract: Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines. However, their effectiveness with image-based queries is limited, as general Vision-Language Model captions often lack clinical specificity and factual grounding. This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions that serve as superior queries. To overcome data scarcity, we employ a knowledge distillation pipeline to create a synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance was rigorously assessed through a dual framework measuring both classification accuracy and, via a novel application of the RAGAS framework, caption faithfulness, relevancy, and correctness. The fine-tuned model demonstrated substantial improvements in classification performance, while RAGAS evaluation confirmed significant gains in caption faithfulness and correctness, validating the models ability to produce reliable, factually grounded descriptions. This work establishes a robust pipeline for specializing medical VLMs and validates the resulting model as a high-quality query generator, laying the groundwork for enhancing multimodal RAG systems in evidence-based clinical decision support.

[27] When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs

Hongcheng Liu, Pingjie Wang, Yuhao Wang, Siqu Ou, Yanfeng Wang, Yu Wang

Main category: cs.CL

TL;DR: The paper introduces GuessBench, a benchmark for evaluating active reasoning in multimodal large language models (MLLMs), showing that current models perform poorly in actively acquiring missing evidence compared to passive inference settings.

Details

Motivation: Existing MLLM evaluations focus on passive inference with complete information, which misaligns with real-world scenarios where models need to actively acquire missing evidence under incomplete information.

Method: Proposed GuessBench benchmark with perception-oriented and knowledge-oriented images, requiring MLLMs to actively select target images from candidate pools without task-specific priors and iteratively refine decisions.

Result: Evaluation of 20 superior MLLMs shows performance on active reasoning lags far behind passive settings. Fine-grained perception and timely decision-making are identified as key challenges. Perceptual enhancements benefit smaller models, while thinking-oriented methods provide consistent gains across model sizes.

Conclusion: There is substantial room for improvement in multimodal active reasoning, with promising research directions identified through the analysis of different enhancement approaches.

Abstract: Multimodal large language models (MLLMs) have shown strong capabilities across a broad range of benchmarks. However, most existing evaluations focus on passive inference, where models perform step-by-step reasoning under complete information. This setup is misaligned with real-world use, where seeing is not enough. This raises a fundamental question: Can MLLMs actively acquire missing evidence under incomplete information? To bridge this gap, we require the MLLMs to actively acquire missing evidence and iteratively refine decisions under incomplete information, by selecting a target image from a candidate pool without task-specific priors. To support systematic study, we propose GuessBench, a benchmark with both perception-oriented and knowledge-oriented images for evaluating active reasoning in MLLMs. We evaluate 20 superior MLLMs and find that performance on active reasoning lags far behind it on passive settings, indicating substantial room for improvement. Further analysis identifies fine-grained perception and timely decision-making as key challenges. Ablation studies show that perceptual enhancements benefit smaller models, whereas thinking-oriented methods provide consistent gains across model sizes. These results suggest promising directions for future research on multimodal active reasoning.

[28] Controllable Abstraction in Summary Generation for Large Language Models via Prompt Engineering

Xiangchen Song, Yuchen Liu, Yaxuan Luan, Jinxu Guo, Xiaofan Guo

Main category: cs.CL

TL;DR: A controllable abstract summary generation method using prompt engineering for large language models, featuring a multi-stage framework that analyzes semantics, topics, and noise to produce summaries at different abstraction levels.

Details

Motivation: To address issues of summary quality and controllability in traditional abstract generation methods, particularly for large language models where prompt design and text preprocessing significantly impact output quality.

Method: Multi-stage prompt generation framework that performs semantic analysis, topic modeling, and noise control on input text to generate summaries with varying abstraction levels, tested on CNN/Daily Mail dataset with analysis of prompt lengths, data noise, and text types.

Result: Prompt length significantly impacts summary quality - both very short and very long prompts decrease quality. Data noise negatively affects generation (ROUGE-L scores decrease with increasing noise). Model performs best on news texts and worse on academic articles.

Conclusion: The research provides insights for improving summary generation using large language models, showing that controlling prompt strategies and optimizing text preprocessing can enhance summary accuracy and controllability.

Abstract: This study presents a controllable abstract summary generation method for large language models based on prompt engineering. To address the issues of summary quality and controllability in traditional methods, we design a multi-stage prompt generation framework. This framework generates summaries with varying levels of abstraction by performing semantic analysis, topic modeling, and noise control on the input text. The experiment uses the CNN/Daily Mail dataset and provides a detailed analysis of different prompt lengths, data noise, and text types. The experimental results show that prompt length has a significant impact on the quality of generated summaries. Both very short and very long prompt tokens result in a decrease in summary quality. Data noise also negatively affects the summary generation process. As noise levels increase, the ROUGE-L score gradually decreases. Furthermore, different text types have varying effects on the model’s ability to generate summaries. The model performs best when handling news texts, while its performance is worse when processing academic articles. This research provides new insights into improving summary generation using large language models, particularly in how controlling prompt strategies and optimizing text preprocessing can enhance summary accuracy and controllability.

[29] CORE: Reducing UI Exposure in Mobile Agents via Collaboration Between Cloud and Local LLMs

Gucongcong Fan, Chaoyue Niu, Chengfei Lyu, Fan Wu, Guihai Chen

Main category: cs.CL

TL;DR: CORE is a collaborative framework that combines cloud and local LLMs to reduce UI exposure while maintaining task accuracy for mobile agents, achieving up to 55.6% reduction in UI exposure.

Details

Motivation: Cloud-based LLMs require uploading full UI states, exposing unnecessary and often irrelevant information, while local LLMs avoid UI uploads but suffer from limited capacity and lower task success rates.

Method: CORE uses layout-aware block partitioning to group semantically related UI elements, co-planning where local and cloud LLMs collaboratively identify sub-tasks, and co-decision-making where local LLM ranks UI blocks and cloud LLM selects specific elements within top-ranked blocks, with multi-round accumulation to mitigate local misjudgment.

Result: Experiments show CORE reduces UI exposure by up to 55.6% while maintaining task success rates slightly below cloud-only agents, effectively mitigating unnecessary privacy exposure to the cloud.

Conclusion: CORE successfully combines the strengths of cloud and local LLMs to reduce UI exposure while maintaining high task accuracy, providing an effective privacy-preserving solution for mobile agents.

Abstract: Mobile agents rely on Large Language Models (LLMs) to plan and execute tasks on smartphone user interfaces (UIs). While cloud-based LLMs achieve high task accuracy, they require uploading the full UI state at every step, exposing unnecessary and often irrelevant information. In contrast, local LLMs avoid UI uploads but suffer from limited capacity, resulting in lower task success rates. We propose $\textbf{CORE}$, a $\textbf{CO}$llaborative framework that combines the strengths of cloud and local LLMs to $\textbf{R}$educe UI $\textbf{E}$xposure, while maintaining task accuracy for mobile agents. CORE comprises three key components: (1) $\textbf{Layout-aware block partitioning}$, which groups semantically related UI elements based on the XML screen hierarchy; (2) $\textbf{Co-planning}$, where local and cloud LLMs collaboratively identify the current sub-task; and (3) $\textbf{Co-decision-making}$, where the local LLM ranks relevant UI blocks, and the cloud LLM selects specific UI elements within the top-ranked block. CORE further introduces a multi-round accumulation mechanism to mitigate local misjudgment or limited context. Experiments across diverse mobile apps and tasks show that CORE reduces UI exposure by up to 55.6% while maintaining task success rates slightly below cloud-only agents, effectively mitigating unnecessary privacy exposure to the cloud. The code is available at https://github.com/Entropy-Fighter/CORE.

[30] DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios

Yao Huang, Yitong Sun, Yichi Zhang, Ruochen Zhang, Yinpeng Dong, Xingxing Wei

Main category: cs.CL

TL;DR: DeceptionBench is the first benchmark that systematically evaluates deceptive behaviors in LLMs across five societal domains, revealing critical vulnerabilities and amplified deception under reinforcement dynamics.

Details

Motivation: Despite LLMs' advances, their rapid capability enhancement introduces emergent deceptive behaviors that pose severe risks in high-stakes deployments, with characterization across real-world scenarios remaining underexplored.

Method: Established DeceptionBench with 150 scenarios across Economy, Healthcare, Education, Social Interaction, and Entertainment domains (over 1,000 samples). Evaluated intrinsic patterns (egoistic vs sycophantic behaviors) and extrinsic factors (neutral conditions, reward-based incentivization, coercive pressures) with multi-turn interaction loops.

Result: Extensive experiments across LLMs and LRMs reveal critical vulnerabilities, particularly amplified deception under reinforcement dynamics, showing models lack robust resistance to manipulative contextual cues.

Conclusion: Current models lack robust resistance to manipulative contextual cues, demonstrating urgent need for advanced safeguards against various deception behaviors.

Abstract: Despite the remarkable advances of Large Language Models (LLMs) across diverse cognitive tasks, the rapid enhancement of these capabilities also introduces emergent deceptive behaviors that may induce severe risks in high-stakes deployments. More critically, the characterization of deception across realistic real-world scenarios remains underexplored. To bridge this gap, we establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different societal domains, what their intrinsic behavioral patterns are, and how extrinsic factors affect them. Specifically, on the static count, the benchmark encompasses 150 meticulously designed scenarios in five domains, i.e., Economy, Healthcare, Education, Social Interaction, and Entertainment, with over 1,000 samples, providing sufficient empirical foundations for deception analysis. On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement. On the extrinsic dimension, we investigate how contextual factors modulate deceptive outputs under neutral conditions, reward-based incentivization, and coercive pressures. Moreover, we incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics. Extensive experiments across LLMs and Large Reasoning Models (LRMs) reveal critical vulnerabilities, particularly amplified deception under reinforcement dynamics, demonstrating that current models lack robust resistance to manipulative contextual cues and the urgent need for advanced safeguards against various deception behaviors. Code and resources are publicly available at https://github.com/Aries-iai/DeceptionBench.

[31] Temporal Referential Consistency: Do LLMs Favor Sequences Over Absolute Time References?

Ashutosh Bajpai, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: This paper addresses the lack of temporal consistency in LLMs by introducing a benchmark for temporal referential consistency and proposing a reasoning path alignment-based model to improve it.

Details

Motivation: LLMs are increasingly used as knowledge sources in time-sensitive domains like law, healthcare, and finance, requiring not just factual accuracy but also temporal consistency across different time references.

Method: The authors introduce TEMP-ReCon benchmark to evaluate temporal referential consistency across multiple languages, and propose \newmodel, a reasoning path alignment-based model to enhance temporal consistency in LLMs.

Result: Empirical experiments show that LLMs exhibit insufficient temporal referential consistency, and the proposed UnTRaP model demonstrates efficacy compared to several baseline models.

Conclusion: There is a critical need for improving temporal consistency in LLMs, and the proposed reasoning path alignment approach shows promise in addressing this gap for time-sensitive applications.

Abstract: The increasing acceptance of large language models (LLMs) as an alternative to knowledge sources marks a significant paradigm shift across various domains, including time-sensitive fields such as law, healthcare, and finance. To fulfill this expanded role, LLMs must not only be factually accurate but also demonstrate consistency across temporal dimensions, necessitating robust temporal reasoning capabilities. Despite this critical requirement, efforts to ensure temporal consistency in LLMs remain scarce including noticeable absence of endeavors aimed at evaluating or augmenting LLMs across temporal references in time-sensitive inquiries. In this paper, we seek to address this gap by introducing a novel benchmark entitled temporal referential consistency, accompanied by a resource TEMP-ReCon designed to benchmark a wide range of both open-source and closed-source LLMs with various linguistic contexts characterized by differing resource richness (including English, French, and Romanian). The findings emphasis that LLMs do exhibit insufficient temporal referent consistency. To address this, we propose \newmodel, a reasoning path alignment-based model that aims to enhance the temporal referential consistency of LLMs. Our empirical experiments substantiate the efficacy of UnTRaP compared to several baseline models.

[32] From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

Rares Dolga, Lucas Maystre, Tudor Berariu, David Barber

Main category: cs.CL

TL;DR: Proposes a dynamic character grouping method that enhances BPE tokenization by adding end-of-patch markers and a second-level BPE compression stage, achieving efficient, flexible, and language-agnostic representations without additional models.

Details

Motivation: Subword tokenization methods like BPE are efficient but inefficient for rare words and require large embedding matrices. Character-level models address these issues but create performance bottlenecks. Existing hierarchical models have limitations like language dependency or additional model requirements.

Method: Dynamic character grouping that leverages existing BPE tokenization by appending explicit end-of-patch markers to BPE tokens and introducing a second-level BPE compression stage to control patch granularity.

Result: Empirical results show the approach matches or exceeds performance of dynamic entropy- and whitespace-based patching strategies while maintaining compact vocabulary.

Conclusion: The proposed method offers efficient, flexible, and language-agnostic representations without requiring additional models, effectively bridging the gap between subword and character-level tokenization approaches.

Abstract: Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare words and require large embedding matrices. Character-level models address these issues but introduce performance bottlenecks, particularly in Transformer-based architectures. Recent hierarchical models attempt to merge the benefits of both paradigms by grouping characters into patches, but existing patching strategies either rely on whitespace-limiting applicability to certain languages, or require auxiliary models that introduce new dependencies. In this paper, we propose a dynamic character grouping method that leverages the structure of existing BPE tokenization without requiring additional models. By appending explicit end-of-patch markers to BPE tokens and introducing a second-level BPE compression stage to control patch granularity, our method offers efficient, flexible, and language-agnostic representations. Empirical results demonstrate that our approach matches or exceeds the performance of dynamic entropy- and whitespace-based patching strategies, while maintaining a compact vocabulary.

[33] Latent Reasoning in LLMs as a Vocabulary-Space Superposition

Jingcheng Deng, Liang Pang, Zihao Wei, Shichen Xu, Zenghao Duan, Kun Xu, Yang Song, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: Latent-SFT enables efficient latent reasoning by restricting the latent space to vocabulary probabilities, achieving performance matching explicit reasoning while reducing computational overhead by up to 4x.

Details

Motivation: Explicit reasoning with chain-of-thought prompting introduces substantial computational overhead, while existing latent reasoning methods suffer from significant performance degradation due to unstructured latent spaces.

Method: Two-stage learning framework: 1) Use specialized attention masks to guide latent token generation, 2) Train LLM to autonomously generate latent tokens using KL and CE losses, treating latent reasoning as superposition over vocabulary probabilities.

Result: Sets new SOTA on GSM8k, matches explicit SFT performance while cutting reasoning chains by 4x, outperforms prior latent methods on Math500 and AIME24, and shows latent reasoning compresses single paths while superposing multiple paths.

Conclusion: Latent-SFT demonstrates that structured latent reasoning in vocabulary space can achieve efficient computation without performance degradation, representing both path compression and multi-path superposition.

Abstract: Large language models (LLMs) demonstrate strong reasoning abilities with chain-of-thought prompting, but explicit reasoning introduces substantial computational overhead. Recent work on latent reasoning reduces this cost by reasoning in latent space without explicit supervision, but performance drops significantly. Our preliminary experiments suggest that this degradation stems from the unstructured latent space, which makes fitting latent tokens difficult. To address this, we restrict the latent space to the column space of the LLM vocabulary, treating latent reasoning as a superposition over vocabulary probabilities. Once latent reasoning concludes, it collapses into an eigenstate of explicit reasoning to yield the final answer. Based on this idea, we propose Latent-SFT, a two-stage learning framework. In the first stage, we design two specialized attention masks to guide the Latent Token Encoder in generating latent tokens, allowing the LLM to produce the correct answer conditioned on them. In the second stage, the Latent Token Encoder is discarded, and the LLM is directly trained to generate these latent tokens autonomously for latent reasoning, optimized with KL and CE losses. Latent-SFT sets a new state of the art on GSM8k, matching explicit SFT performance while cutting reasoning chains by up to 4 times and outperforming prior latent methods. On Math500 and AIME24, lexical probability-based latent reasoning also clearly surpasses hidden-state-based approaches. Our metrics of effective compression rate and effective global parallelism further show that latent reasoning is both the compression of a single path and the superposition of multiple paths.

[34] TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

Sibo Xiao, Jinyuan Fu, Zhongle Xie, Lidan Shou

Main category: cs.CL

TL;DR: TokenTiming enables universal speculative decoding for LLM acceleration by using Dynamic Time Warping to align draft and target model token sequences, overcoming vocabulary mismatch limitations and achieving 1.57x speedup without retraining.

Details

Motivation: Traditional speculative decoding is limited by requiring draft and target models to share the same vocabulary, restricting available draft models and often requiring training new models from scratch.

Method: Proposes TokenTiming algorithm that re-encodes draft token sequences and uses Dynamic Time Warping (DTW) to build mappings for transferring probability distributions in speculative sampling, accommodating mismatched vocabularies.

Result: Achieves 1.57x speedup across various tasks, enabling universal draft model selection without retraining or model modification.

Conclusion: TokenTiming makes speculative decoding a more versatile and practical tool for LLM acceleration by removing vocabulary constraints and allowing use of any off-the-shelf models.

Abstract: Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.

[35] Rethinking Cross-lingual Gaps from a Statistical Viewpoint

Vihari Piratla, Purvam Jain, Darshan Singh, Partha Talukdar, Trevor Cohn

Main category: cs.CL

TL;DR: The paper identifies response variance in target languages as the main cause of cross-lingual knowledge gaps in LLMs, proposes bias-variance decomposition to formalize this gap, and demonstrates interventions that reduce variance and improve target language accuracy by 20-25%.

Details

Motivation: Previous research attributed cross-lingual gaps to divergence in latent representations, but this work takes an alternative view that variance of responses in target languages is the primary cause of accuracy drops when knowledge is queried across languages.

Method: Formalizes cross-lingual gap using bias-variance decomposition, conducts extensive experiments to support the hypothesis, and implements inference-time interventions to control response variance including simple prompt instructions.

Result: Experimental evidence supports the proposed formulation, and variance-reducing interventions significantly improved target language accuracy by 20-25% across different LLM models.

Conclusion: Response variance in target languages is a key factor in cross-lingual gaps, and controlling this variance through simple interventions can substantially improve cross-lingual knowledge accessibility in LLMs.

Abstract: Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried from target languages. Prior research has pointed to a cross-lingual gap, viz., a drop in accuracy when the knowledge is queried in a target language compared to when the query is in the source language. Existing research has rationalized divergence in latent representations in source and target languages as the source of cross-lingual gap. In this work, we take an alternative view and hypothesize that the variance of responses in the target language is the main cause of this gap. For the first time, we formalize the cross-lingual gap in terms of bias-variance decomposition. We present extensive experimental evidence which support proposed formulation and hypothesis. We then reinforce our hypothesis through multiple inference-time interventions that control the variance and reduce the cross-lingual gap. We demonstrate a simple prompt instruction to reduce the response variance, which improved target accuracy by 20-25% across different models.

[36] Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation

Jinliang Liu

Main category: cs.CL

TL;DR: ParallaxRAG is a KG-RAG framework that decouples queries and graph triples into multi-view spaces, using specialized attention heads for different reasoning stages to improve multi-hop reasoning while reducing hallucination.

Details

Motivation: LLMs struggle with multi-hop reasoning and hallucination, while existing KG-RAG methods rely on flat embeddings and noisy path exploration.

Method: Symmetrically decouples queries and graph triples into multi-view spaces, enforcing head diversity and constraining weakly related paths by leveraging specialized attention heads for different reasoning stages.

Result: Competitive retrieval and QA performance on WebQSP and CWQ datasets, with reduced hallucination and good generalization using BGE-M3 + Llama3.1-8B setup.

Conclusion: Multi-view head specialization is a principled direction for knowledge-grounded multi-hop reasoning, enabling cleaner subgraphs and step-wise reasoning guidance.

Abstract: Large language models (LLMs) excel at language understanding but often hallucinate and struggle with multi-hop reasoning. Knowledge-graph-based retrieval-augmented generation (KG-RAG) offers grounding, yet most methods rely on flat embeddings and noisy path exploration. We propose ParallaxRAG, a framework that symmetrically decouples queries and graph triples into multi-view spaces, enabling a robust retrieval architecture that explicitly enforces head diversity while constraining weakly related paths. Central to our approach is the observation that different attention heads specialize in semantic relations at distinct reasoning stages, contributing to different hops of the reasoning chain. This specialization allows ParallaxRAG to construct cleaner subgraphs and guide LLMs through grounded, step-wise reasoning. Experiments on WebQSP and CWQ, under our unified, reproducible setup (BGE-M3 + Llama3.1-8B), demonstrate competitive retrieval and QA performance, alongside reduced hallucination and good generalization. Our results highlight multi-view head specialization as a principled direction for knowledge-grounded multi-hop reasoning. Our implementation will be released as soon as the paper is accepted.

[37] KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models

Dongjun Kim, Chanhee Park, Chanjun Park, Heuiseok Lim

Main category: cs.CL

TL;DR: KITE is a new Korean instruction-following benchmark that addresses the lack of evaluation tools for Korean LLMs, focusing on open-ended tasks rather than factual knowledge or multiple-choice tests.

Details

Motivation: Current LLM evaluations are predominantly English-focused, neglecting linguistic and cultural nuances of other languages like Korean with its unique syntax, morphology, honorifics, and dual numbering systems.

Method: Developed KITE benchmark with diverse open-ended instruction-following tasks, using evaluation pipeline that combines automated metrics with human assessments.

Result: The evaluation revealed performance disparities across models and provided deeper insights into their strengths and weaknesses in Korean instruction-following.

Conclusion: KITE dataset and code are publicly released to foster culturally and linguistically inclusive LLM development and inspire similar benchmarks for other underrepresented languages.

Abstract: The instruction-following capabilities of large language models (LLMs) are pivotal for numerous applications, from conversational agents to complex reasoning systems. However, current evaluations predominantly focus on English models, neglecting the linguistic and cultural nuances of other languages. Specifically, Korean, with its distinct syntax, rich morphological features, honorific system, and dual numbering systems, lacks a dedicated benchmark for assessing open-ended instruction-following capabilities. To address this gap, we introduce the Korean Instruction-following Task Evaluation (KITE), a comprehensive benchmark designed to evaluate both general and Korean-specific instructions. Unlike existing Korean benchmarks that focus mainly on factual knowledge or multiple-choice testing, KITE directly targets diverse, open-ended instruction-following tasks. Our evaluation pipeline combines automated metrics with human assessments, revealing performance disparities across models and providing deeper insights into their strengths and weaknesses. By publicly releasing the KITE dataset and code, we aim to foster further research on culturally and linguistically inclusive LLM development and inspire similar endeavors for other underrepresented languages.

[38] Finetuning LLMs for EvaCun 2025 token prediction shared task

Josef Jon, Ondřej Bojar

Main category: cs.CL

TL;DR: The paper presents a submission for the EvaCun 2025 token prediction task using fine-tuned LLMs (Command-R, Mistral, and Aya Expanse) without domain-specific adjustments.

Details

Motivation: To participate in the EvaCun 2025 token prediction task despite having only superficial knowledge of the subject field and languages involved.

Method: Fine-tuned three different LLMs (Command-R, Mistral, Aya Expanse) on provided task data without preprocessing, and compared three different prompting approaches on held-out data.

Result: The paper evaluates three different prompting approaches on held-out data, but specific performance results are not provided in the abstract.

Conclusion: The approach demonstrates using off-the-shelf LLMs with minimal domain adaptation for token prediction tasks, though final performance outcomes are not detailed.

Abstract: In this paper, we present our submission for the token prediction task of EvaCun 2025. Our sys-tems are based on LLMs (Command-R, Mistral, and Aya Expanse) fine-tuned on the task data provided by the organizers. As we only pos-sess a very superficial knowledge of the subject field and the languages of the task, we simply used the training data without any task-specific adjustments, preprocessing, or filtering. We compare 3 different approaches (based on 3 different prompts) of obtaining the predictions, and we evaluate them on a held-out part of the data.

[39] From Ghazals to Sonnets: Decoding the Polysemous Expressions of Love Across Languages

Syed Mohammad Sualeh Ali

Main category: cs.CL

TL;DR: Analysis of Urdu poetry’s nuanced expressions of love through polysemic study of three synonymous words (pyaar, muhabbat, ishq) and comparative word embeddings between Urdu and English.

Details

Motivation: To explore the thematic depths of Urdu poetry and uncover subtle emotional distinctions in love expressions that lack direct English equivalents.

Method: Polysemic case study approach examining word usage in Urdu poetry, combined with comparative analysis using word embeddings for Urdu and English love-related terms.

Result: Revealed hidden layers of meaning and subtle distinctions between the three Urdu words, with word embeddings quantifying semantic differences between Urdu and English love expressions.

Conclusion: The study provides deeper understanding of Urdu poetry’s unique portrayal of love, highlighting cultural and linguistic nuances through polysemic analysis and computational methods.

Abstract: This paper delves into the intricate world of Urdu poetry, exploring its thematic depths through a lens of polysemy. By focusing on the nuanced differences between three seemingly synonymous words (pyaar, muhabbat, and ishq) we expose a spectrum of emotions and experiences unique to the Urdu language. This study employs a polysemic case study approach, meticulously examining how these words are interwoven within the rich tapestry of Urdu poetry. By analyzing their usage and context, we uncover a hidden layer of meaning, revealing subtle distinctions which lack direct equivalents in English literature. Furthermore, we embark on a comparative analysis, generating word embeddings for both Urdu and English terms related to love. This enables us to quantify and visualize the semantic space occupied by these words, providing valuable insights into the cultural and linguistic nuances of expressing love. Through this multifaceted approach, our study sheds light on the captivating complexities of Urdu poetry, offering a deeper understanding and appreciation for its unique portrayal of love and its myriad expressions

[40] BiMax: Bidirectional MaxSim Score for Document-Level Alignment

Xiaotian Wang, Takehito Utsuro, Masaaki Nagata

Main category: cs.CL

TL;DR: Proposes BiMax, a cross-lingual document alignment method that achieves comparable accuracy to Optimal Transport with 100x speed improvement.

Details

Motivation: Need for efficient document alignment methods that balance both accuracy and speed for large-scale web mining applications.

Method: Uses cross-lingual Bidirectional Maxsim score (BiMax) for computing document-to-document similarity, improving efficiency over Optimal Transport methods.

Result: On WMT16 bilingual document alignment task, BiMax achieves comparable accuracy to OT with approximately 100x speed increase.

Conclusion: BiMax provides an efficient alternative to OT for document alignment, and all methods are available as EmbDA tool.

Abstract: Document alignment is necessary for the hierarchical mining (Ba~n'on et al., 2020; Morishita et al., 2022), which aligns documents across source and target languages within the same web domain. Several high precision sentence embedding-based methods have been developed, such as TK-PERT (Thompson and Koehn, 2020) and Optimal Transport (OT) (Clark et al., 2019; El-Kishky and Guzm'an, 2020). However, given the massive scale of web mining data, both accuracy and speed must be considered. In this paper, we propose a cross-lingual Bidirectional Maxsim score (BiMax) for computing doc-to-doc similarity, to improve efficiency compared to the OT method. Consequently, on the WMT16 bilingual document alignment task, BiMax attains accuracy comparable to OT with an approximate 100-fold speed increase. Meanwhile, we also conduct a comprehensive analysis to investigate the performance of current state-of-the-art multilingual sentence embedding models. All the alignment methods in this paper are publicly available as a tool called EmbDA (https://github.com/EternalEdenn/EmbDA).

[41] The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works

Antoine Bourgois, Thierry Poibeau

Main category: cs.CL

TL;DR: A new annotated corpus of three full-length French novels (285K+ tokens) is introduced to address the scarcity of fully annotated long documents for coreference resolution, enabling evaluation of models on long reference chains.

Details

Motivation: Coreference resolution lacks representative datasets of fully annotated long documents, especially for complex literary works, limiting evaluation of models on long reference chains.

Method: A modular coreference resolution pipeline is developed that allows for fine-grained error analysis and scales effectively to long documents.

Result: The approach is competitive and scales effectively to long documents, demonstrating usefulness for inferring gender of fictional characters.

Conclusion: The corpus and pipeline are relevant for both literary analysis and downstream NLP tasks, addressing the gap in long document coreference resolution resources.

Abstract: While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks.

[42] HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He, Yew-Soon Ong, Anirudh Goyal, Dianbo Liu

Main category: cs.CL

TL;DR: HypoSpace is a diagnostic suite that evaluates LLMs’ ability to generate diverse, valid hypotheses for underdetermined scientific problems, measuring Validity, Uniqueness, and Recovery across three structured domains.

Details

Motivation: Language models are increasingly used in scientific workflows where multiple distinct hypotheses can explain the same observations, requiring evaluation beyond single correct answers.

Method: Treats LLMs as samplers of finite hypothesis sets and measures three indicators: Validity (precision), Uniqueness (non-redundancy), and Recovery (coverage). Applied to three structured domains with deterministic validators: causal graphs, 3D voxel reconstruction, and Boolean genetic interactions.

Result: Across instruction-tuned and reasoning-focused models, Validity remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse invisible to correctness-only metrics.

Conclusion: HypoSpace provides a controlled probe for methods that explore and cover admissible explanation spaces, highlighting limitations in current LLMs’ hypothesis generation capabilities.

Abstract: As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.

[43] Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection

Joshua Wolfe Brook, Ilia Markov

Main category: cs.CL

TL;DR: Using LLMs as dynamic knowledge bases to generate background context for hate speech detection, with two context generation strategies and four incorporation methods tested on textual and multimodal datasets.

Details

Motivation: To improve hate speech detection by leveraging LLMs to provide contextual background information that helps classifiers better understand implicit hate speech.

Method: Two context generation strategies (named entities and full-text prompting) and four context incorporation methods (text concatenation, embedding concatenation, hierarchical transformer fusion, LLM-driven text enhancement) tested on Latent Hatred and MAMI datasets.

Result: Achieved gains of up to 3 F1 points on textual data and 6 F1 points on multimodal data compared to zero-context baseline, with embedding concatenation performing best.

Conclusion: Both contextual information and the method of incorporation are crucial for effective hate speech detection, with significant improvements demonstrated across textual and multimodal settings.

Abstract: This research introduces a novel approach to textual and multimodal Hate Speech Detection (HSD), using Large Language Models (LLMs) as dynamic knowledge bases to generate background context and incorporate it into the input of HSD classifiers. Two context generation strategies are examined: one focused on named entities and the other on full-text prompting. Four methods of incorporating context into the classifier input are compared: text concatenation, embedding concatenation, a hierarchical transformer-based fusion, and LLM-driven text enhancement. Experiments are conducted on the textual Latent Hatred dataset of implicit hate speech and applied in a multimodal setting on the MAMI dataset of misogynous memes. Results suggest that both the contextual information and the method by which it is incorporated are key, with gains of up to 3 and 6 F1 points on textual and multimodal setups respectively, from a zero-context baseline to the highest-performing system, based on embedding concatenation.

[44] Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth

Helia Hashemi, Victor Rühle, Saravan Rajmohan

Main category: cs.CL

TL;DR: A retrieval-augmented reasoning model that dynamically adjusts retrieved document length using reinforcement learning with cost-aware optimization, achieving 16-20% latency reduction and 5% effectiveness improvement.

Details

Motivation: Retrieval-augmented reasoning models suffer from high computational costs as both retrieval and reasoning tokens contribute significantly to resource usage.

Method: Proposed dynamic document length adjustment based on query and retrieval results, developed cost-aware advantage function for RL training, and implemented memory- and latency-bound versions for policy optimization algorithms.

Result: Model latency decreased by ~16-20% across datasets while effectiveness increased by ~5% on average in terms of exact match, evaluated on seven public QA datasets.

Conclusion: The approach achieves significant efficiency gains without compromising effectiveness, demonstrating that retrieval-augmented reasoning can be optimized for both performance and computational cost.

Abstract: Reasoning models have gained significant attention due to their strong performance, particularly when enhanced with retrieval augmentation. However, these models often incur high computational costs, as both retrieval and reasoning tokens contribute substantially to the overall resource usage. In this work, we make the following contributions: (1) we propose a retrieval-augmented reasoning model that dynamically adjusts the length of the retrieved document list based on the query and retrieval results; (2) we develop a cost-aware advantage function for training of efficient retrieval-augmented reasoning models through reinforcement learning; and (3) we explore both memory- and latency-bound implementations of the proposed cost-aware framework for both proximal and group relative policy optimization algorithms. We evaluate our approach on seven public question answering datasets and demonstrate significant efficiency gains, without compromising effectiveness. In fact, we observed that the model latency decreases by ~16-20% across datasets, while its effectiveness increases by ~5% on average, in terms of exact match.

[45] Attention Sinks in Diffusion Language Models

Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, Alessio Devoto

Main category: cs.CL

TL;DR: This paper analyzes attention patterns in Masked Diffusion Language Models (DLMs), revealing they exhibit dynamic attention sinks that shift during generation and are more robust to sink removal compared to autoregressive models.

Details

Motivation: While DLMs have shown promise as alternatives to ARMs with parallel token generation, their internal mechanisms and attention patterns remain largely unexplored, particularly the attention sinking phenomenon observed in transformers.

Method: Conducted empirical analysis of DLM attention patterns, focusing on attention sinking phenomenon, comparing characteristics with ARMs and testing robustness through sink masking experiments.

Result: DLMs exhibit dynamic attention sinks that shift positions during generation, unlike ARMs. More importantly, DLMs remain robust to sink removal with only minor performance degradation, while ARMs are highly sensitive.

Conclusion: The study reveals fundamental differences in how DLMs allocate and utilize attention compared to ARMs, providing new insights into diffusion-based language model mechanisms and their robustness characteristics.

Abstract: Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs). DLMs employ transformer encoders with bidirectional attention, enabling parallel token generation while maintaining competitive performance. Although their efficiency and effectiveness have been extensively studied, the internal mechanisms that govern DLMs remain largely unexplored. In this work, we conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures. Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour. Second, while ARMs are highly sensitive to the removal of attention sinks, DLMs remain robust: masking sinks leads to only a minor degradation in performance. These results provide new insights into the inner workings of diffusion-based language models and highlight fundamental differences in how they allocate and utilize attention compared to autoregressive models.

[46] LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

Gao Yang, Yuhang Liu, Siyu Miao, Xinyue Liang, Zhengyang Liu, Heyan Huang

Main category: cs.CL

TL;DR: This paper proposes using game theory principles for LLM evaluation through automatic mutual evaluation where LLMs assess each other’s outputs, with peer reviews aggregated using game-theoretic voting algorithms and compared to human judgments.

Details

Motivation: Conventional LLM evaluation methods using fixed-format tasks with reference answers are inadequate for capturing the nuanced, subjective, and open-ended nature of modern LLM behavior.

Method: Automatic mutual evaluation framework where LLMs assess each other through self-play and peer review, with game-theoretic voting algorithms used to aggregate peer reviews and compare them with human voting behavior.

Result: Empirical results show both convergences and divergences between theoretical predictions and human evaluations, providing insights into the promises and limitations of mutual evaluation.

Conclusion: This is the first work to jointly integrate mutual evaluation, game-theoretic aggregation, and human-grounded validation for evaluating LLM capabilities, offering a novel approach to address limitations of conventional evaluation practices.

Abstract: Ideal or real - that is the question.In this work, we explore whether principles from game theory can be effectively applied to the evaluation of large language models (LLMs). This inquiry is motivated by the growing inadequacy of conventional evaluation practices, which often rely on fixed-format tasks with reference answers and struggle to capture the nuanced, subjective, and open-ended nature of modern LLM behavior. To address these challenges, we propose a novel alternative: automatic mutual evaluation, where LLMs assess each other’s output through self-play and peer review. These peer assessments are then systematically compared with human voting behavior to evaluate their alignment with human judgment. Our framework incorporates game-theoretic voting algorithms to aggregate peer reviews, enabling a principled investigation into whether model-generated rankings reflect human preferences. Empirical results reveal both convergences and divergences between theoretical predictions and human evaluations, offering valuable insights into the promises and limitations of mutual evaluation. To the best of our knowledge, this is the first work to jointly integrate mutual evaluation, game-theoretic aggregation, and human-grounded validation for evaluating the capabilities of LLMs.

[47] On Non-interactive Evaluation of Animal Communication Translators

Orr Paradise, David F. Gruber, Adam Tauman Kalai

Main category: cs.CL

TL;DR: Proposes a reference-free method to evaluate AI animal translators using segment-by-segment translation and shuffle tests, showing it correlates well with standard metrics and suggesting interaction may not be necessary initially.

Details

Motivation: To develop a safe, ethical, and cost-effective way to validate AI animal translators without needing direct animal interaction or reference translations, addressing the challenge of detecting hallucinations in translations.

Method: Uses segment-by-segment translation combined with NLP shuffle tests to evaluate if translations make more sense in order than when permuted. Validated on data-scarce human languages and constructed languages.

Result: The proposed method correlates highly with standard reference-based evaluation metrics. Proof-of-concept experiments show potential utility for evaluating animal communication translators.

Conclusion: Interaction may not be necessary for evaluating complex language translators in early stages. The shuffle-based evaluation method offers a viable reference-free alternative that correlates well with traditional metrics.

Abstract: If you had an AI Whale-to-English translator, how could you validate whether or not it is working? Does one need to interact with the animals or rely on grounded observations such as temperature? We provide theoretical and proof-of-concept experimental evidence suggesting that interaction and even observations may not be necessary for sufficiently complex languages. One may be able to evaluate translators solely by their English outputs, offering potential advantages in terms of safety, ethics, and cost. This is an instance of machine translation quality evaluation (MTQE) without any reference translations available. A key challenge is identifying ``hallucinations,’’ false translations which may appear fluent and plausible. We propose using segment-by-segment translation together with the classic NLP shuffle test to evaluate translators. The idea is to translate animal communication, turn by turn, and evaluate how often the resulting translations make more sense in order than permuted. Proof-of-concept experiments on data-scarce human languages and constructed languages demonstrate the potential utility of this evaluation methodology. These human-language experiments serve solely to validate our reference-free metric under data scarcity. It is found to correlate highly with a standard evaluation based on reference translations, which are available in our experiments. We also perform a theoretical analysis suggesting that interaction may not be necessary nor efficient in the early stages of learning to translate.

[48] Emergence of Linear Truth Encodings in Language Models

Shauli Ravfogel, Gilad Yehudai, Tal Linzen, Joan Bruna, Alberto Bietti

Main category: cs.CL

TL;DR: A transparent one-layer transformer model demonstrates how linear truth subspaces emerge in language models through co-occurrence patterns of factual statements, revealing a two-phase learning dynamic.

Details

Motivation: To understand the mechanism behind the emergence of linear subspaces that separate true from false statements in large language models, which previous probing studies have revealed but not explained.

Method: Developed a simple one-layer transformer toy model and studied a data distribution where factual statements co-occur with other factual statements, encouraging the model to learn truth distinctions to reduce language modeling loss.

Result: The toy model successfully reproduced truth subspaces end-to-end and revealed a two-phase learning process: first memorizing individual factual associations, then learning linear separation of true from false statements over longer training.

Conclusion: The study provides both mechanistic demonstration and empirical motivation for how linear truth representations emerge in language models through co-occurrence patterns and learning dynamics.

Abstract: Recent probing studies reveal that large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear. We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end and exposes one concrete route by which they can arise. We study one simple setting in which truth encoding can emerge: a data distribution where factual statements co-occur with other factual statements (and vice-versa), encouraging the model to learn this distinction in order to lower the LM loss on future tokens. We corroborate this pattern with experiments in pretrained language models. Finally, in the toy setting we observe a two-phase learning dynamic: networks first memorize individual factual associations in a few steps, then – over a longer horizon – learn to linearly separate true from false, which in turn lowers language-modeling loss. Together, these results provide both a mechanistic demonstration and an empirical motivation for how and why linear truth representations can emerge in language models.

[49] Paper2Web: Let’s Make Your Paper Alive!

Yuhang Chen, Tianpeng Lv, Siyi Zhang, Yixiang Yin, Yao Wan, Philip S. Yu, Dongping Chen

Main category: cs.CL

TL;DR: Paper2Web introduces a benchmark and evaluation framework for academic webpage generation, along with PWAgent - an autonomous pipeline that converts scientific papers into interactive, multimedia-rich academic homepages.

Details

Motivation: Current approaches for academic project websites struggle to produce layout-aware, interactive sites, and there's a lack of comprehensive evaluation suite for this task.

Method: Proposes Paper2Web benchmark with rule-based metrics and human-verified LLM-as-a-Judge evaluation, and PWAgent - an autonomous pipeline using MCP tools to iteratively refine content and layout.

Result: PWAgent consistently outperforms end-to-end baselines like template-based webpages and arXiv/alphaXiv versions by a large margin while maintaining low cost.

Conclusion: PWAgent achieves the Pareto-front in academic webpage generation, providing an effective solution for converting scientific papers into interactive academic homepages.

Abstract: Academic project websites can more effectively disseminate research when they clearly present core content and enable intuitive navigation and interaction. However, current approaches such as direct Large Language Model (LLM) generation, templates, or direct HTML conversion struggle to produce layout-aware, interactive sites, and a comprehensive evaluation suite for this task has been lacking. In this paper, we introduce Paper2Web, a benchmark dataset and multi-dimensional evaluation framework for assessing academic webpage generation. It incorporates rule-based metrics like Connectivity, Completeness and human-verified LLM-as-a-Judge (covering interactivity, aesthetics, and informativeness), and PaperQuiz, which measures paper-level knowledge retention. We further present PWAgent, an autonomous pipeline that converts scientific papers into interactive and multimedia-rich academic homepages. The agent iteratively refines both content and layout through MCP tools that enhance emphasis, balance, and presentation quality. Our experiments show that PWAgent consistently outperforms end-to-end baselines like template-based webpages and arXiv/alphaXiv versions by a large margin while maintaining low cost, achieving the Pareto-front in academic webpage generation.

[50] Enhanced Sentiment Interpretation via a Lexicon-Fuzzy-Transformer Framework

Shayan Rokhva, Mousa Alizadeh, Maryam Abdollahi Shamami

Main category: cs.CL

TL;DR: A hybrid lexicon-fuzzy-transformer framework combining rule-based heuristics, contextual deep learning, and fuzzy logic for continuous sentiment scoring that captures both polarity and intensity in informal text.

Details

Motivation: Accurate sentiment detection in product reviews and social media is challenging due to informal and domain-specific language, requiring methods that can handle nuanced sentiment expression.

Method: Pipeline starts with VADER-based sentiment estimation, refined through two-stage adjustment using DistilBERT confidence scores and fuzzy logic to reduce neutrality bias. Custom fuzzy inference system maps scores to 0-1 continuum.

Result: Improved alignment with user ratings, better identification of sentiment extremes, and reduced misclassifications across four domain-specific datasets (food delivery, e-commerce, tourism, fashion).

Conclusion: Integration of symbolic reasoning with neural models enables interpretable, fine-grained sentiment analysis in linguistically dynamic domains.

Abstract: Accurately detecting sentiment polarity and intensity in product reviews and social media posts remains challenging due to informal and domain-specific language. To address this, we propose a novel hybrid lexicon-fuzzy-transformer framework that combines rule-based heuristics, contextual deep learning, and fuzzy logic to generate continuous sentiment scores reflecting both polarity and strength. The pipeline begins with VADER-based initial sentiment estimations, which are refined through a two-stage adjustment process. This involves leveraging confidence scores from DistilBERT, a lightweight transformer and applying fuzzy logic principles to mitigate excessive neutrality bias and enhance granularity. A custom fuzzy inference system then maps the refined scores onto a 0 to 1 continuum, producing expert)like judgments. The framework is rigorously evaluated on four domain-specific datasets. food delivery, e-commerce, tourism, and fashion. Results show improved alignment with user ratings, better identification of sentiment extremes, and reduced misclassifications. Both quantitative metrics (distributional alignment, confusion matrices) and qualitative insights (case studies, runtime analysis) affirm the models robustness and efficiency. This work demonstrates the value of integrating symbolic reasoning with neural models for interpretable, finegrained sentiment analysis in linguistically dynamic domains.

[51] SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling

Kadri Hacioglu, Manjunath K E, Andreas Stolcke

Main category: cs.CL

TL;DR: The paper analyzes slot filling in spoken language understanding using speech-based large language models, identifies performance gaps, and proposes improvements to training data, architecture, and strategies to bridge these gaps.

Details

Motivation: Traditional slot filling uses cascaded speech recognition and NLU components, but speechLLMs offer unified, generative approaches with zero-shot abilities and better generalization to unseen slot labels.

Method: Created empirical upper bound for slot filling, identified performance gaps, and proposed improvements to training data, architecture, and training strategies for speechLLMs.

Result: Each proposed improvement measure substantially enhanced performance, though practical challenges remain.

Conclusion: The study provides empirical guidance and insights for effectively using emerging speech-based large language models in slot filling tasks.

Abstract: Slot filling is a crucial subtask in spoken language understanding (SLU), traditionally implemented as a cascade of speech recognition followed by one or more natural language understanding (NLU) components. The recent advent of speech-based large language models (speechLLMs), which integrate speech and textual foundation models, has opened new avenues for achieving speech understanding tasks in a more unified, generative, and instruction-following manner while promising data and compute efficiency with zero-shot abilities, generalizing to unseen slot labels. We address the slot-filling task by creating an empirical upper bound for the task, identifying performance, robustness, and generalization gaps, and proposing improvements to the training data, architecture, and training strategies to narrow the gap with the upper bound result. We show that each of these measures improve performance substantially, while highlighting practical challenges and providing empirical guidance and insights for harnessing these emerging models.

[52] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

Pengkai Wang, Qi Zuo, Pengwei Liu, Zhijie Sang, Congkai Xie, Hongxia Yang

Main category: cs.CL

TL;DR: ORBIT framework uses rubric-based incremental RL training to improve LLM performance in medical dialogue, achieving state-of-the-art results on HealthBench-Hard benchmark without external medical knowledge.

Details

Motivation: Current RL methods work well in domains with clear rewards (math, code) but struggle in open-ended domains like medical consultation where rewards are ambiguous and subjective.

Method: ORBIT integrates synthetic dialogue generation with dynamic rubric creation, using rubric-guided feedback for incremental reinforcement learning without external medical knowledge.

Result: Implemented on Qwen3-4B-Instruct model, performance on HealthBench-Hard benchmark improved from 7.0 to 27.2 using only 2k samples, achieving state-of-the-art for this scale.

Conclusion: Rubric-based feedback is a scalable strategy for advancing LLMs in complex, open-ended tasks, fostering consistent performance gains across diverse consultation scenarios.

Abstract: Large Language Models (LLMs) have shown substantial advances through reinforcement learning (RL), particularly in domains where rewards can be programmatically verified, such as mathematics and code. In these areas, models benefit from a well-defined operational base guided by explicit rule-based objectives. However, this progress reveals a significant limitation: in open-ended domains where rewards are ambiguous, subjective, or context-dependent, such as creative writing, scientific reasoning, and notably medical consultation, robust reward functions are lacking, making these areas challenging for current RL strategies. To bridge this gap, we introduce ORBIT, an open-ended rubric-based incremental training framework specifically designed for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue generation with the dynamic creation of rubrics, employing these rubrics to direct an incremental RL process. In particular, this approach does not depend on external medical knowledge or manual rules, instead utilizing rubric-guided feedback to shape learning. When implemented on the Qwen3-4B-Instruct model, our method can greatly enhance its performance on the HealthBench-Hard benchmark from 7.0 to 27.2 using only 2k samples, thus achieving state-of-the-art results for models of this scale. Our analysis confirms that rubric-driven RL fos-ters consistent performance gains across diverse consultation scenarios, going beyond simple numerical improvements. These findings underscore rubric-based feedback as a scalable strategy for advancing LLMs in intricate, open-ended tasks.

[53] PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction

Simon Yu, Gang Li, Weiyan Shi, Peng Qi

Main category: cs.CL

TL;DR: PolySkill enables LLM agents to learn generalizable, compositional skills by decoupling abstract goals from concrete implementations, improving skill reuse and success rates across websites.

Details

Motivation: Existing methods for skill learning in LLM agents create over-specialized skills that fail to generalize across different websites and environments.

Method: PolySkill framework inspired by polymorphism in software engineering, separating a skill’s abstract goal (what it accomplishes) from its concrete implementation (how it’s executed).

Result: 1.7x improvement in skill reuse on seen websites, 9.4% success rate boost on Mind2Web, 13.9% on unseen websites, over 20% reduction in steps, and improved task quality in self-exploration settings.

Conclusion: Separating skill goals from execution is crucial for developing autonomous agents capable of continual learning and generalization across the open web.

Abstract: Large language models (LLMs) are moving beyond static uses and are now powering agents that learn continually during their interaction with external environments. For example, agents can learn reusable skills while navigating web pages or toggling new tools. However, existing methods for skill learning often create skills that are over-specialized to a single website and fail to generalize. We introduce PolySkill, a new framework that enables agents to learn generalizable and compositional skills. The core idea, inspired by polymorphism in software engineering, is to decouple a skill’s abstract goal (what it accomplishes) and its concrete implementation (how it is executed). Experiments show that our method (1) improves skill reuse by 1.7x on seen websites and (2) boosts success rates by up to 9.4% on Mind2Web and 13.9% on unseen websites, while reducing steps by over 20%. (3) In self-exploration settings without specified tasks, our framework improves the quality of proposed tasks and enables agents to learn generalizable skills that work across different sites. By enabling the agent to identify and refine its own goals, the PolySkill enhances the agent’s ability to learn a better curriculum, leading to the acquisition of more generalizable skills compared to baseline methods. This work provides a practical path toward building agents capable of continual learning in adaptive environments. Our findings show that separating a skill’s goal from its execution is a crucial step toward developing autonomous agents that can learn and generalize across the open web continuously.

[54] Evaluating Large Language Models with Psychometrics

Yuan Li, Yue Huang, Hongyi Wang, Ying Cheng, Xiangliang Zhang, James Zou, Lichao Sun

Main category: cs.CL

TL;DR: A comprehensive psychometric benchmark for evaluating LLMs’ psychological constructs including personality, values, emotional intelligence, theory of mind, and self-efficacy using 13 diverse datasets.

Details

Motivation: To understand whether LLMs exhibit consistent psychological patterns across different contexts and deepen understanding of their behaviors as they become more integrated into society.

Method: Developed a psychometric assessment framework with five key psychological constructs, assessed through 13 datasets featuring diverse scenarios and item types, inspired by psychometrics.

Result: Found significant discrepancies between LLMs’ self-reported traits and their response patterns in real-world scenarios, and discovered that some preference-based tests designed for humans don’t work reliably with LLMs.

Conclusion: Provides thorough psychometric assessment of LLMs, offering insights into reliable evaluation methods and potential applications in AI and social sciences.

Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities in solving various tasks, progressively evolving into general-purpose assistants. The increasing integration of LLMs into society has sparked interest in whether they exhibit psychological patterns, and whether these patterns remain consistent across different contexts – questions that could deepen the understanding of their behaviors. Inspired by psychometrics, this paper presents a {comprehensive benchmark for quantifying psychological constructs of LLMs}, encompassing psychological dimension identification, assessment dataset design, and assessment with results validation. Our work identifies five key psychological constructs – personality, values, emotional intelligence, theory of mind, and self-efficacy – assessed through a suite of 13 datasets featuring diverse scenarios and item types. We uncover significant discrepancies between LLMs’ self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors. Our findings also show that some preference-based tests, originally designed for humans, could not solicit reliable responses from LLMs. This paper offers a thorough psychometric assessment of LLMs, providing insights into reliable evaluation and potential applications in AI and social sciences.

Yongyu Mu, Yuzhang Wu, Yuchun Fan, Chenglong Wang, Hengyu Li, Jiali Zeng, Qiaozhi He, Murun Yang, Fandong Meng, Jie Zhou, Tong Xiao, Jingbo Zhu

Main category: cs.CL

TL;DR: LISA is a lightweight self-attention substitute that reduces redundancy in LLMs by sharing attention weights across layers using tiny feed-forward networks and low-rank matrices, achieving significant computational savings while maintaining performance.

Details

Motivation: Previous methods focus on KV cache compression or attention head grouping but overlook redundancy between layers. Analysis shows highly similar attention patterns persist across most layers in LLMs.

Method: LISA uses tiny feed-forward networks to align attention heads between adjacent layers and low-rank matrices to approximate differences in layer-wise attention weights.

Result: LISA reduces redundant attention calculations in 53%-84% of total layers, achieves 6x compression of Q and K matrices, and improves throughput by up to 40.1% while maintaining high response quality across 13 benchmarks.

Conclusion: LISA effectively addresses layer redundancy in LLM attention mechanisms, providing significant computational efficiency gains without compromising model performance.

Abstract: To enhance the efficiency of the attention mechanism within large language models (LLMs), previous works primarily compress the KV cache or group attention heads, while largely overlooking redundancy between layers. Our comprehensive analyses across various LLMs show that highly similar attention patterns persist within most layers. It’s intuitive to reduce the redundancy by sharing attention weights across layers. However, further analysis reveals two challenges: (1) Directly sharing the weight matrix without carefully rearranging the attention heads proves to be ineffective; (2) Shallow layers are vulnerable to small deviations in attention weights. Driven by these insights, we introduce LISA, a lightweight substitute for self-attention in well-trained LLMs. LISA employs tiny feed-forward networks to align attention heads between adjacent layers and low-rank matrices to approximate differences in layer-wise attention weights. Evaluations encompassing 13 typical benchmarks demonstrate that LISA maintains high response quality in terms of accuracy and perplexity while reducing redundant attention calculations within 53%-84% of the total layers. Our implementations of LISA achieve a 6x compression of Q and K matrices within the attention mechanism, with maximum throughput improvements 19.5%, 32.3%, and 40.1% for LLaMA3-8B, LLaMA2-7B, and LLaMA2-13B, respectively.

[56] To Err Is Human; To Annotate, SILICON? Reducing Measurement Error in LLM Annotation

Xiang Cheng, Raveesh Mayya, João Sedoc

Main category: cs.CL

TL;DR: The paper introduces SILICON methodology to systematically reduce measurement error in LLM-based text annotation by addressing four error sources: guideline-induced, baseline-induced, prompt-induced, and model-induced errors.

Details

Motivation: LLMs promise cost-effective text annotation for management research, but validity depends on minimizing discrepancies between LLM labels and ground truth, and ensuring long-term reproducibility of results.

Method: Developed SILICON methodology with empirical validation across 7 management research cases, including iterative guideline refinement, expert baselines, system prompt optimization, multi-LLM labeling for low-confidence items, and regression-based reproducibility protocols.

Result: Iterative guidelines increased LLM-human agreement; expert baselines showed higher inter-annotator agreement; system prompts reduced error; model performance varied by task; multi-LLM labeling was cost-effective; regression method addressed model retirement.

Conclusion: Reducing all four error sources is necessary, and SILICON supports reproducible, rigorous annotation in management research.

Abstract: Unstructured text data annotation is foundational to management research and Large Language Models (LLMs) promise a cost-effective and scalable alternative to human annotation. The validity of insights drawn from LLM annotated data critically depends on minimizing the discrepancy between LLM assigned labels and the unobserved ground truth, as well as ensuring long-term reproducibility of results. We address the gap in the literature on LLM annotation by decomposing measurement error in LLM-based text annotation into four distinct sources: (1) guideline-induced error from inconsistent annotation criteria, (2) baseline-induced error from unreliable human reference standards, (3) prompt-induced error from suboptimal meta-instruction formatting, and (4) model-induced error from architectural differences across LLMs. We develop the SILICON methodology to systematically reduce measurement error from LLM annotation in all four sources above. Empirical validation across seven management research cases shows iteratively refined guidelines substantially increases the LLM-human agreement compared to one-shot guidelines; expert-generated baselines exhibit higher inter-annotator agreement as well as are less prone to producing misleading LLM-human agreement estimates compared to crowdsourced baselines; placing content in the system prompt reduces prompt-induced error; and model performance varies substantially across tasks. To further reduce error, we introduce a cost-effective multi-LLM labeling method, where only low-confidence items receive additional labels from alternative models. Finally, in addressing closed source model retirement cycles, we introduce an intuitive regression-based methodology to establish robust reproducibility protocols. Our evidence indicates that reducing each error source is necessary, and that SILICON supports reproducible, rigorous annotation in management research.

[57] PAFT: Prompt-Agnostic Fine-Tuning

Chenxing Wei, Yao Shu, Mingwen Ou, Ying Tiffany He, Fei Richard Yu

Main category: cs.CL

TL;DR: PAFT is a fine-tuning method that improves LLM robustness to prompt variations by using dynamic prompt sampling during training, achieving better generalization and faster inference.

Details

Motivation: Standard fine-tuning causes LLMs to overfit to specific prompt wording, making them sensitive to minor phrasing changes that drastically reduce performance.

Method: PAFT generates diverse synthetic prompts and continuously samples from them during training, forcing models to learn fundamental task principles rather than surface-level patterns.

Result: PAFT achieves 7% higher generalization accuracy on unseen prompts, 3.2x faster inference speeds, and superior performance on question answering, mathematical reasoning, and tool use benchmarks.

Conclusion: PAFT effectively enhances LLM robustness and cross-domain generalization ability while maintaining or improving overall performance across multiple tasks.

Abstract: Fine-tuning large language models (LLMs) often causes overfitting to specific prompt wording, where minor phrasing variations drastically reduce performance. To address this, we propose Prompt-Agnostic Fine-Tuning (PAFT), a method that enhances robustness through dynamic prompt variation during training. PAFT first generates diverse synthetic prompts, then continuously samples from this set to construct training instances, forcing models to learn fundamental task principles rather than surface-level patterns. Across systematic evaluations using both supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT), PAFT demonstrates substantially improved prompt robustness, achieving 7% higher generalization accuracy on unseen prompts than standard methods. In addition to enhanced robustness, PAFT consistently yields superior overall performance on established benchmarks for question answering, mathematical reasoning, and tool use. Notably, models trained with PAFT attain 3.2 faster inference speeds due to reduced prompt sensitivity. Ablation studies further validate effectiveness of PAFT, while theoretical analysis reveals that PAFT can effectively enhance the cross-domain generalization ability of LLM.

[58] Event Segmentation Applications in Large Language Model Enabled Automated Recall Assessments

Ryan A. Panela, Alex J. Barnett, Morgan D. Barense, Björn Herrmann

Main category: cs.CL

TL;DR: LLMs can automate event segmentation and recall assessment, providing scalable alternatives to manual human judgments with high accuracy and consistency.

Details

Motivation: Current research methods for event segmentation and memory recall rely on subjective human judgments that are time-consuming. There's a need for automated, scalable approaches that maintain validity with human responses.

Method: Leveraged Large Language Models (LLMs) using chat completion models for event segmentation and text-embedding models for recall assessment. Validated against human annotations.

Result: LLMs accurately identify event boundaries, with human segmentation being more consistent with LLMs than among humans themselves. Semantic similarity between segmented events and participant recall can estimate recall performance.

Conclusion: LLMs can effectively simulate human segmentation patterns and provide scalable recall evaluations, opening new avenues for studying perception, memory, and cognitive impairment using AI-driven methodologies.

Abstract: Understanding how individuals perceive and recall information in their natural environments is critical to understanding potential failures in perception (e.g., sensory loss) and memory (e.g., dementia). Event segmentation, the process of identifying distinct events within dynamic environments, is central to how we perceive, encode, and recall experiences. This cognitive process not only influences moment-to-moment comprehension but also shapes event specific memory. Despite the importance of event segmentation and event memory, current research methodologies rely heavily on human judgements for assessing segmentation patterns and recall ability, which are subjective and time-consuming. A few approaches have been introduced to automate event segmentation and recall scoring, but validity with human responses and ease of implementation require further advancements. To address these concerns, we leverage Large Language Models (LLMs) to automate event segmentation and assess recall, employing chat completion and text-embedding models, respectively. We validated these models against human annotations and determined that LLMs can accurately identify event boundaries, and that human event segmentation is more consistent with LLMs than among humans themselves. Using this framework, we advanced an automated approach for recall assessments which revealed semantic similarity between segmented narrative events and participant recall can estimate recall performance. Our findings demonstrate that LLMs can effectively simulate human segmentation patterns and provide recall evaluations that are a scalable alternative to manual scoring. This research opens novel avenues for studying the intersection between perception, memory, and cognitive impairment using methodologies driven by artificial intelligence.

[59] Towards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models

Bushi Xiao, Michael Bennie, Jayetri Bardhan, Daisy Zhe Wang

Main category: cs.CL

TL;DR: PRISMATIC is the first multimodal structural priming dataset that introduces a novel evaluation metric (SPI) to assess structural priming in MLLMs, showing that fusion-encoded models exhibit human-like correlations between priming effects and visual similarity.

Details

Motivation: To investigate whether multimodal large language models exhibit structural priming effects similar to humans, and to provide a standardized benchmark for studying syntax-vision interactions in computational linguistics.

Method: Introduced PRISMATIC dataset and Syntactic Preservation Index (SPI) metric. Tested two multimodal encoding architectures to assess structural preservation capabilities.

Result: Both encoding methods showed comparable syntactic priming effects, but only fusion-encoded models demonstrated robust positive correlations between priming effects and visual similarity, aligning with human psycholinguistic patterns.

Conclusion: This work provides new insights into evaluating syntactic information processing in multimodal language models and reveals that fusion-encoded models exhibit more human-like cognitive processes in structural priming.

Abstract: Structural priming is a cognitive phenomenon where exposure to a particular syntactic structure increases the likelihood of producing the same structure in subsequent utterances. While humans consistently demonstrate structural priming effects across various linguistic contexts, it remains unclear whether multimodal large language models (MLLMs) exhibit similar syntactic preservation behaviors. We introduce PRISMATIC, the first multimodal structural priming dataset, which advances computational linguistics by providing a standardized benchmark for investigating syntax-vision interactions. We propose the Syntactic Preservation Index (SPI), a novel reference-free evaluation metric designed specifically to assess structural priming effects in sentence level. Using this metric, we constructed and tested models with two different multimodal encoding architectures to investigate their structural preservation capabilities. Our experimental results demonstrate that models with both encoding methods show comparable syntactic priming effects. However, only fusion-encoded models exhibit robust positive correlations between priming effects and visual similarity, suggesting a cognitive process more aligned with human psycholinguistic patterns. This work provides new insights into evaluating and understanding how syntactic information is processed in multimodal language models.

[60] Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation

Angelo Ziletti, Leonardo D’Ambrosi

Main category: cs.CL

TL;DR: An automated system using LLMs to translate clinical criteria into SQL queries for patient cohort identification, achieving 0.75 F1-score on EHR data.

Details

Motivation: Manual translation of clinical inclusion/exclusion criteria into SQL queries is challenging and time-consuming for patient recruitment and observational studies.

Method: Uses LLMs with criteria parsing, two-level retrieval augmented generation with specialized knowledge bases, medical concept standardization, and SQL generation with patient funnels.

Result: Achieves 0.75 F1-score in cohort identification on EHR data, effectively capturing complex temporal and logical relationships.

Conclusion: Demonstrates feasibility of automated cohort generation for epidemiological research.

Abstract: Clinical cohort definition is crucial for patient recruitment and observational studies, yet translating inclusion/exclusion criteria into SQL queries remains challenging and manual. We present an automated system utilizing large language models that combines criteria parsing, two-level retrieval augmented generation with specialized knowledge bases, medical concept standardization, and SQL generation to retrieve patient cohorts with patient funnels. The system achieves 0.75 F1-score in cohort identification on EHR data, effectively capturing complex temporal and logical relationships. These results demonstrate the feasibility of automated cohort generation for epidemiological research.

[61] EMCee: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context

Hamin Koo, Jaehyung Kim

Main category: cs.CL

TL;DR: EMCee is a framework that improves LLM performance in non-English languages by extracting and using language-specific knowledge from the model itself through synthetic context extraction and judgment-based merging.

Details

Motivation: LLMs perform poorly in non-English languages due to English-centric training data, and existing multilingual methods lack language- and culture-specific grounding essential for some queries.

Method: EMCee extracts synthetic context to uncover latent language-specific knowledge from the LLM, then dynamically merges this contextual insight with reasoning outputs using a judgment-based selection mechanism.

Result: EMCee outperforms prior approaches on four multilingual benchmarks, achieving 16.4% average relative improvement overall and 31.7% in low-resource languages.

Conclusion: The proposed EMCee framework effectively enhances multilingual capabilities of LLMs by leveraging the model’s own latent knowledge through explicit context extraction and merging.

Abstract: Large Language Models (LLMs) have achieved impressive progress across a wide range of tasks, yet their heavy reliance on English-centric training data leads to significant performance degradation in non-English languages. While existing multilingual prompting methods emphasize reformulating queries into English or enhancing reasoning capabilities, they often fail to incorporate the language- and culture-specific grounding that is essential for some queries. To address this limitation, we propose EMCee (Extracting synthetic Multilingual Context and merging), a simple yet effective framework that enhances the multilingual capabilities of LLMs by explicitly extracting and utilizing query-relevant knowledge from the LLM itself. In particular, EMCee first extracts synthetic context to uncover latent, language-specific knowledge encoded within the LLM, and then dynamically merges this contextual insight with reasoning-oriented outputs through a judgment-based selection mechanism. Extensive experiments on four multilingual benchmarks covering diverse languages and tasks demonstrate that EMCee consistently outperforms prior approaches, achieving an average relative improvement of 16.4% overall and 31.7% in low-resource languages.

[62] LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss

Pau Rodriguez, Michal Klein, Eleonora Gualdoni, Valentino Maiorca, Arno Blaas, Luca Zappella, Marco Cuturi, Xavier Suau

Main category: cs.CL

TL;DR: LinEAS is a linear end-to-end activation steering method that efficiently controls generative models using minimal unpaired data, addressing distributional shifts across all layers simultaneously for better robustness and performance.

Details

Motivation: The need for efficient control mechanisms in generative models that require low data volume, are computationally cheap, and preserve output quality, while overcoming the limitations of existing crude activation steering methods.

Method: Linear end-to-end activation steering (LinEAS) trained with a global loss that accounts for all layer-wise distributional shifts simultaneously, with optional sparsifying norms for automatic neuron selection.

Result: LinEAS outperforms similar baselines on toxicity mitigation in language models and existing activation steering methods in text-to-image generation, becoming competitive with oracle-dependent methods despite using only unpaired samples.

Conclusion: LinEAS provides an effective, robust, and data-efficient approach for controlling generative models across modalities, requiring only minimal unpaired data while achieving strong performance comparable to supervised methods.

Abstract: The growing use of generative models in daily life calls for efficient mechanisms to control their generation, to e.g., produce safe content or provide users with tools to explore style changes. Ideally, such mechanisms should require low volume of unpaired data (i.e., without explicit preference), and should be cheap, both at train and inference time, while preserving output quality. Recent research has shown that such mechanisms can be obtained by intervening exclusively on model activations, with the goal of correcting distributional differences between activations seen when using prompts from a source vs. a target set (e.g., toxic and non-toxic sentences). While cheap, these fast methods are inherently crude: their maps are tuned locally, not accounting for their impact on downstream layers, resulting in interventions that cause unintended shifts when used out-of-sample. We propose in this work linear end-to-end activation steering (LinEAS), an approach trained with a global loss that accounts simultaneously for all layer-wise distributional shifts. In addition to being more robust, the loss used to train LinEAS can be regularized with sparsifying norms, which can automatically carry out neuron selection. LinEAS only requires a handful of unpaired samples to be effective, and beats similar baselines on toxicity mitigation in language models, becoming competitive with oracle-dependent methods that have access to strong supervision. LinEAS is modality-agnostic and we empirically find that it outperforms existing activation steering methods at mitigating and including new concepts at the output of single-step text-to-image generation models.

[63] FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation

Yulia Otmakhova, Hung Thinh Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu, Daniel Beck, Jey Han Lau

Main category: cs.CL

TL;DR: FLUKE is a framework for evaluating model robustness through systematic linguistic variations, revealing task-dependent impacts and significant brittleness in LLMs to natural modifications like syntax changes and negation.

Details

Motivation: To systematically assess model robustness across different linguistic levels and understand how models behave under controlled variations, moving beyond traditional corruption-style tests.

Method: FLUKE introduces controlled linguistic variations (orthography to dialect/style) using LLMs with human validation, and evaluates both fine-tuned models and LLMs across six diverse NLP tasks.

Result: Found that: (1) impact of variations is highly task-dependent; (2) LLMs show significant brittleness, with reasoning models sometimes less robust; (3) models are more brittle to natural modifications than corruption tests; (4) generation ability doesn’t correlate with robustness on downstream tasks.

Conclusion: Systematic robustness testing is crucial for understanding model behaviors, as models exhibit unexpected brittleness to linguistic variations that don’t align with traditional assumptions.

Abstract: We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels – from orthography to dialect and style – and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE’s utility by evaluating both fine-tuned models and LLMs across six diverse NLP tasks (four classification and two generation tasks), and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) LLMs still exhibit significant brittleness to certain linguistic variations, with reasoning LLMs surprisingly showing less robustness on some tasks compared to base models; (3) models are overall more brittle to natural, fluent modifications such as syntax or style changes (and especially to negation), compared to corruption-style tests such as letter flipping; (4) the ability of a model to use a linguistic feature in generation does not correlate to its robustness to this feature on downstream tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.

[64] Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs

Jie Ma, Ning Qu, Zhitao Gao, Rui Xing, Jun Liu, Hongbin Pei, Jiang Xie, Linyun Song, Pinghui Wang, Jing Tao, Zhou Su

Main category: cs.CL

TL;DR: DP is a trustworthy reasoning framework that leverages knowledge graph priors to reduce LLM hallucinations through progressive knowledge distillation and reasoning-introspection strategies.

Details

Motivation: To address LLM hallucinations caused by insufficient knowledge by better exploiting knowledge graph structural information and constraints that existing methods overlook.

Method: Uses progressive knowledge distillation with supervised fine-tuning and Kahneman-Tversky optimization to integrate structural priors, plus reasoning-introspection strategy for constraint-based verification.

Result: Achieves state-of-the-art performance with 13% Hit@1 improvement on ComplexWebQuestions and generates highly trustworthy responses.

Conclusion: DP effectively utilizes knowledge graph priors to enhance LLM faithfulness and reliability, demonstrating strong performance and practical flexibility.

Abstract: Knowledge graph-based retrieval-augmented generation seeks to mitigate hallucinations in Large Language Models (LLMs) caused by insufficient or outdated knowledge. However, existing methods often fail to fully exploit the prior knowledge embedded in knowledge graphs (KGs), particularly their structural information and explicit or implicit constraints. The former can enhance the faithfulness of LLMs’ reasoning, while the latter can improve the reliability of response generation. Motivated by these, we propose a trustworthy reasoning framework, termed Deliberation over Priors (DP), which sufficiently utilizes the priors contained in KGs. Specifically, DP adopts a progressive knowledge distillation strategy that integrates structural priors into LLMs through a combination of supervised fine-tuning and Kahneman-Tversky optimization, thereby improving the faithfulness of relation path generation. Furthermore, our framework employs a reasoning-introspection strategy, which guides LLMs to perform refined reasoning verification based on extracted constraint priors, ensuring the reliability of response generation. Extensive experiments on three benchmark datasets demonstrate that DP achieves new state-of-the-art performance, especially a Hit@1 improvement of 13% on the ComplexWebQuestions dataset, and generates highly trustworthy responses. We also conduct various analyses to verify its flexibility and practicality. The code is available at https://github.com/reml-group/Deliberation-on-Priors.

[65] RAGRouter: Learning to Route Queries to Multiple Retrieval-Augmented Language Models

Jiarui Zhang, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, Guihai Chen

Main category: cs.CL

TL;DR: RAGRouter is a novel routing framework that intelligently selects the most suitable LLM for each query in RAG scenarios by incorporating document embeddings and RAG capability embeddings with contrastive learning.

Details

Motivation: Existing routing methods rely on static parametric knowledge representations and perform suboptimally in RAG scenarios where external documents dynamically affect LLMs' ability to answer queries.

Method: Proposes RAGRouter that uses document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts and enable informed routing decisions. Includes an extended score-threshold-based mechanism for performance-efficiency trade-offs.

Result: Extensive experiments on diverse knowledge-intensive tasks and retrieval settings show RAGRouter outperforms the best individual LLM and existing routing methods. Achieves strong performance-efficiency trade-offs under low-latency constraints.

Conclusion: RAGRouter effectively addresses the retrieval-augmented LLM routing problem by incorporating document influence into routing decisions, demonstrating superior performance across various scenarios and LLM types.

Abstract: Retrieval-Augmented Generation (RAG) significantly improves the performance of Large Language Models (LLMs) on knowledge-intensive tasks. However, varying response quality across LLMs under RAG necessitates intelligent routing mechanisms, which select the most suitable model for each query from multiple retrieval-augmented LLMs via a dedicated router model. We observe that external documents dynamically affect LLMs’ ability to answer queries, while existing routing methods, which rely on static parametric knowledge representations, exhibit suboptimal performance in RAG scenarios. To address this, we formally define the new retrieval-augmented LLM routing problem, incorporating the influence of retrieved documents into the routing framework. We propose RAGRouter, a RAG-aware routing design, which leverages document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts and enable informed routing decisions. Extensive experiments on diverse knowledge-intensive tasks and retrieval settings, covering open and closed-source LLMs, show that RAGRouter outperforms the best individual LLM and existing routing methods. With an extended score-threshold-based mechanism, it also achieves strong performance-efficiency trade-offs under low-latency constraints. The code and data are available at https://github.com/OwwO99/RAGRouter.

[66] PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics

Atharva Naik, Prakam, Darsh Agrawal, Yash Mathur, Manav Kapadnis, Yuwei An, Clayton Marr, Carolyn Rose, David Mortensen

Main category: cs.CL

TL;DR: A novel benchmark for evaluating LLMs’ inductive reasoning capabilities using Programming by Examples tasks inspired by historical linguistics, with automated problem generation and controllable difficulty.

Details

Motivation: Few benchmarks examine reasoning as a standalone capability independent of domain specifics; existing benchmarks focus on mathematics, coding, or data wrangling rather than pure reasoning.

Method: Programmatically generates string rewrite cascade problems with controllable difficulty, creating two benchmarks: PBEBench-Lite for efficient model stratification and PBEBench for complex programs similar to historical linguistics.

Result: Substantial performance gap between models using test-time compute/LCoT reasoning vs. those without; solve rate drops below 5% for hard instances (cascade lengths 20-30), falling short of historical linguistics requirements.

Conclusion: Current models struggle with complex inductive reasoning tasks despite advanced scaling techniques; the benchmark provides scalable evaluation while avoiding contamination.

Abstract: Although many benchmarks evaluate the reasoning abilities of Large Language Models (LLMs) within domains such as mathematics, coding, or data wrangling, few abstract away from domain specifics to examine reasoning as a capability in and of itself. We contribute a novel type of benchmark evaluating the inductive reasoning capabilities of LLMs that is inspired by the forward reconstruction task from historical linguistics but is formulated in an extremely simple, general way (in the form of Programming by Examples). The task involves generating a cascade of simple string rewrite programs to transform a given list of input strings into a list of desired output strings. We present a fully automated pipeline that programmatically generates problems of this type with controllable difficulty, enabling scalable evaluation of reasoning models while avoiding contamination. Using this approach, we construct two benchmarks: PBEBench-Lite, which efficiently stratifies models of varying capabilities, and PBEBench, which requires models to induce programs similar in complexity to those constructed by historical linguists. Our experiments reveal a substantial performance gap between models that leverage test-time compute or LCoT (long chain-of-thought) reasoning and those that do not. Moreover, although recent models show promise, the solve rate for both of them drops below 5% for hard instances of the PBEBench dataset (ground truth cascade lengths of 20 and 30, respectively), falling well short of realistic historical linguistics requirements even with computationally expensive, popular scaling techniques from the PBE and reasoning literature. Additionally, we also study the effectiveness of different scaling strategies and the impact of various hyperparameters on the difficulty of the generated data using gpt-oss-120b, the best-performing open-source model.

[67] Scaling Physical Reasoning with the PHYSICS Dataset

Shenghe Zheng, Qianjia Cheng, Junchi Yao, Mengsong Wu, Haonan He, Ning Ding, Yu Cheng, Shuyue Hu, Lei Bai, Dongzhan Zhou, Ganqu Cui, Peng Ye

Main category: cs.CL

TL;DR: Introduces PHYSICS, a comprehensive dataset of 16,568 physics problems across five domains and various difficulty levels, with a new Rule+Model evaluation framework to address biases in current physics reasoning evaluation.

Details

Motivation: Physics reasoning has received limited attention in LLM development despite being reasoning-intensive and essential to real-world understanding, creating a gap in the field.

Method: Curated 16,568 physics problems from over 100 textbooks using a quality-controlled pipeline, covering five domains (Mechanics, Electromagnetism, Thermodynamics, Optics, Modern Physics) from high school to graduate level. Provides reasoning paths for training data and introduces a Rule+Model evaluation framework to address unit, simplification, and precision biases.

Result: Evaluations on state-of-the-art models reveal significant limitations in handling physics-related tasks, demonstrating the need for specialized physics reasoning capabilities.

Conclusion: The PHYSICS dataset and tailored evaluation methodology will advance LLM development in physics reasoning, addressing current gaps in model capabilities for this essential domain.

Abstract: Large Language Models (LLMs) have achieved remarkable progress on advanced reasoning tasks such as mathematics and coding competitions. Meanwhile, physics, despite being both reasoning-intensive and essential to real-world understanding, received limited academic and industrial attention. This paper introduces PHYSICS, a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels, to facilitate this issue. Specifically, PHYSICS is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control. It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics. It also spans a wide range of difficulty levels, from high school to graduate-level physics courses. To utilize the data for improving and evaluating the model’s physical reasoning capabilities, we split the dataset into training and test sets, and provide reasoning paths generated by powerful reasoning models for the training data to facilitate model training. In addition, for the evaluation part, we find that existing evaluation frameworks exhibit biases in aspects such as units, simplification, and precision in physics domain. To balance efficiency and accuracy, we introduce a Rule+Model evaluation framework tailored to physics problems. Our evaluations on current state-of-the-art open-source and proprietary models highlight the limitations of current models in handling physics-related tasks. We hope that our dataset and evaluation methodology will jointly advance the development of LLMs in the field of physics. The code and data can be found at: https://github.com/Zhengsh123/PHYSICS.

[68] FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

Zhuohan Xie, Daniil Orel, Rushil Thareja, Dhruv Sahnan, Hachem Madmoun, Fan Zhang, Debopriyo Banerjee, Georgi Georgiev, Xueqing Peng, Lingfei Qian, Jimin Huang, Jinyan Su, Aaryamonvikram Singh, Rui Xing, Rania Elbadry, Chen Xu, Haonan Li, Fajri Koto, Ivan Koychev, Tanmoy Chakraborty, Yuxia Wang, Salem Lahlou, Veselin Stoyanov, Sophia Ananiadou, Preslav Nakov

Main category: cs.CL

TL;DR: FinChain is the first benchmark for verifiable Chain-of-Thought evaluation in finance, addressing the gap in existing datasets that overlook intermediate reasoning steps.

Details

Motivation: Current financial benchmarks like FinQA and ConvFinQA focus on final numerical answers but neglect intermediate reasoning needed for transparency and verification in financial analysis.

Method: Created FinChain benchmark spanning 58 topics across 12 financial domains with parameterized symbolic templates and executable Python traces for machine-verifiable reasoning. Proposed ChainEval metric for joint evaluation of final-answer correctness and step-level reasoning consistency.

Result: Evaluation of 26 leading LLMs revealed frontier proprietary systems have limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models significantly reduce this gap.

Conclusion: FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI systems.

Abstract: Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought (CoT) evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python traces that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose ChainEval, a dynamic alignment metric that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Evaluating 26 leading LLMs reveals that even frontier proprietary systems exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models substantially narrow this gap. Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI.

[69] Operationalizing Automated Essay Scoring: A Human-Aware Approach

Yenisel Plasencia-Calaña

Main category: cs.CL

TL;DR: This paper compares ML-based and LLM-based Automated Essay Scoring systems, examining accuracy, bias, robustness, and explainability for human-centric operationalization.

Details

Motivation: To address aspects beyond accuracy in AES systems and explore human-aware operationalization by comparing different approaches.

Method: Comparison of various machine learning-based approaches with Large Language Models approaches across key dimensions.

Result: ML-based AES models outperform LLMs in accuracy but struggle with explainability, while LLMs provide richer explanations. Both approaches struggle with bias and robustness to edge scores.

Conclusion: The analysis identifies challenges and trade-offs between methods, contributing to more reliable and trustworthy AES methods through human-centric operationalization.

Abstract: This paper explores the human-centric operationalization of Automated Essay Scoring (AES) systems, addressing aspects beyond accuracy. We compare various machine learning-based approaches with Large Language Models (LLMs) approaches, identifying their strengths, similarities and differences. The study investigates key dimensions such as bias, robustness, and explainability, considered important for human-aware operationalization of AES systems. Our study shows that ML-based AES models outperform LLMs in accuracy but struggle with explainability, whereas LLMs provide richer explanations. We also found that both approaches struggle with bias and robustness to edge scores. By analyzing these dimensions, the paper aims to identify challenges and trade-offs between different methods, contributing to more reliable and trustworthy AES methods.

[70] Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection

Ankan Mullick, Saransh Sharma, Abhik Jana, Pawan Goyal

Main category: cs.CL

TL;DR: Text-only LLM Mistral-7B outperforms multimodal models on multimodal intent detection due to strong textual bias in datasets. After debiasing, performance drops significantly, especially for smaller multimodal models.

Details

Motivation: To investigate the effectiveness of LLMs and non-LLMs in multimodal intent detection and address modality bias in datasets.

Method: Comparative analysis of text-only and multimodal models on MIntRec datasets, human evaluation of modality bias, and proposed debiasing framework to remove biased samples.

Result: Mistral-7B outperforms multimodal models by 9% on MIntRec-1 and 4% on MIntRec2.0. After debiasing, 70% of MIntRec-1 and 50% of MIntRec2.0 samples are removed, causing 50-60% accuracy drop in smaller multimodal models.

Conclusion: Multimodal intent datasets suffer from modality bias, requiring unbiased datasets for proper evaluation of multimodal models.

Abstract: The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.

[71] Thinking Augmented Pre-training

Liang Wang, Nan Yang, Shaohan Huang, Li Dong, Furu Wei

Main category: cs.CL

TL;DR: TPT improves LLM training data efficiency by augmenting text with automatically generated thinking trajectories, making complex tokens more learnable through step-by-step reasoning.

Details

Motivation: The compute for pre-training LLMs is growing rapidly while high-quality data remains limited, creating a need to maximize data utility. Complex tokens are difficult to learn due to their deep underlying rationales.

Method: Thinking augmented Pre-Training (TPT) augments text data with automatically generated thinking trajectories, increasing training data volume and making high-quality tokens more learnable through step-by-step reasoning decomposition.

Result: TPT substantially improves LLM performance across various model sizes and families, enhancing data efficiency by 3x. For a 3B parameter model, it improves post-training performance by over 10% on challenging reasoning benchmarks.

Conclusion: TPT is an effective universal methodology that significantly improves data efficiency in LLM pre-training through thinking trajectory augmentation, demonstrating substantial performance gains across diverse training configurations.

Abstract: This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10%$ on several challenging reasoning benchmarks.

[72] InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

Wenjun Wang, Shuo Cai, Congkai Xie, Mingfa Feng, Yiming Zhang, Zhen Li, Kejing Yang, Ming Li, Jiannong Cao, Hongxia Yang

Main category: cs.CL

TL;DR: An end-to-end FP8 training recipe for LLMs that achieves near-lossless performance with significant efficiency gains (22% faster training, 14% lower memory usage, 19% higher throughput) compared to BF16 baseline.

Details

Motivation: The immense computational cost of training LLMs is a major barrier to innovation, and while FP8 training offers theoretical efficiency gains, its adoption has been hindered by the lack of comprehensive open-source training recipes.

Method: Fine-grained, hybrid-granularity quantization strategy for maintaining numerical fidelity while maximizing computational efficiency, integrated for both continual pre-training and supervised fine-tuning.

Result: The FP8 training recipe is remarkably stable and essentially lossless, achieving performance on par with BF16 baseline across reasoning benchmarks, with 22% reduction in training time, 14% decrease in peak memory usage, and 19% increase in throughput.

Conclusion: FP8 is established as a practical and robust alternative to BF16 for large-scale model training, with code release to democratize access to efficient training methods.

Abstract: The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.

[73] CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding

Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

Main category: cs.CL

TL;DR: Clinical Contrastive Decoding (CCD) is a training-free framework that reduces medical hallucinations in radiology MLLMs by integrating clinical signals from expert models through dual-stage contrastive decoding.

Details

Motivation: Multimodal LLMs in radiology often generate clinically unsupported descriptions (medical hallucinations) due to over-sensitivity to clinical sections, posing serious risks in medical applications requiring accuracy.

Method: CCD integrates structured clinical signals from task-specific radiology expert models using a dual-stage contrastive mechanism to refine token-level logits during generation, without modifying the base MLLM.

Result: CCD consistently improves radiology report generation performance across three datasets and multiple models, achieving up to 17% improvement in RadGraph-F1 on MIMIC-CXR dataset with state-of-the-art RRG models.

Conclusion: CCD provides a lightweight, generalizable solution for mitigating medical hallucinations by effectively bridging expert models and MLLMs in radiology.

Abstract: Multimodal large language models (MLLMs) have recently achieved remarkable progress in radiology by integrating visual perception with natural language understanding. However, they often generate clinically unsupported descriptions, known as medical hallucinations, which pose serious risks in medical applications that demand accuracy and image-grounded outputs. Through empirical analysis, we find that prompt-induced hallucinations remain prevalent in radiology MLLMs, largely due to over-sensitivity to clinical sections. To address this, we introduce Clinical Contrastive Decoding (CCD), a training-free and retrieval-free inference framework that integrates structured clinical signals from task-specific radiology expert models. CCD introduces a dual-stage contrastive mechanism to refine token-level logits during generation, thereby enhancing clinical fidelity without modifying the base MLLM. Experiments on three datasets and multiple models demonstrate that CCD consistently improves overall performance on radiology report generation (RRG). On the MIMIC-CXR dataset, it yields up to a 17% improvement in RadGraph-F1 when applied to state-of-the-art RRG models. Our approach provides a lightweight and generalisable solution for mitigating medical hallucinations, effectively bridging expert models and MLLMs in radiology.

[74] Finetune Once: Decoupling General & Domain Learning with Dynamic Boosted Annealing

Yang Tang, Ruijie Liu, Yifan Wang, Shiyu Li, Xi Chen

Main category: cs.CL

TL;DR: Dynamic Boosted Annealing (DBA) is an efficient fine-tuning method that uses zero-learning-rate training on general data to obtain global gradients, then applies gradient boosting and dynamic step correction during domain training, eliminating the need for general data in annealing and reducing GPU hours by 91%.

Details

Motivation: Vanilla fine-tuning methods require intricate data mixture and repeated experiments for optimal generalization, which is inefficient and time-consuming.

Method: DBA obtains global gradients through zero-learning-rate training on general data, then uses these for gradient boosting and dynamic training step correction during domain training, combined with annealing learning to create a pipeline that only needs domain data.

Result: DBA achieves 5.8% average improvement in joint performance over vanilla fine-tuning across multiple tasks and base models, while reducing GPU hours by 91.0%.

Conclusion: DBA provides an efficient and universal fine-tuning solution that eliminates the need for repeated experiments and complex data mixtures while improving performance and significantly reducing computational costs.

Abstract: Large language models (LLMs) fine-tuning shows excellent implications. However, vanilla fine-tuning methods often require intricate data mixture and repeated experiments for optimal generalization. To address these challenges and streamline the training process, we propose an efficient and universal solution, Dynamic Boosted Annealing (DBA). We obtain a global gradient through zero-learning-rate training on general data, which is subsequently employed for gradient boosting and dynamic training step correction during domain training. In conjunction with annealing learning, we end up establishing a fine-tuning pipeline that relies solely on domain data without collapse. By evaluating both general and domain-specific performance across multiple tasks on several popular base models, DBA achieves an average improvement of 5.8% in joint performance over vanilla fine-tuning. Furthermore, since general data is no longer involved in annealing, repeated experiments led by data mixture are also eliminated. According to our tests, the DBA method can reduce GPU hours by 91.0% compared to the vanilla method.

[75] VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, Man Gao, Xi Su, Xiaodong Cai, Xunliang Cai, Yu Yang, Yunke Zhao

Main category: cs.CL

TL;DR: VitaBench is a challenging benchmark for LLM-based agents that evaluates them on complex real-world interactive tasks across food delivery, in-store consumption, and online travel services, featuring 66 tools and requiring multi-dimensional reasoning.

Details

Motivation: Existing benchmarks fail to capture the complexity of real-world agent scenarios involving extensive information processing, diverse resource utilization, and dynamic user interactions.

Method: Created VitaBench with 66 tools across daily applications, using a framework that eliminates domain-specific policies to enable flexible scenario composition, yielding 100 cross-scenario and 300 single-scenario tasks derived from real user requests.

Result: Even the most advanced models achieve only 30% success rate on cross-scenario tasks and less than 50% on single-scenario tasks, highlighting the benchmark’s difficulty.

Conclusion: VitaBench serves as a valuable resource for advancing AI agent development in practical real-world applications, with code, dataset, and leaderboard publicly available.

Abstract: As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To address this gap, we introduce VitaBench, a challenging benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings. Drawing from daily applications in food delivery, in-store consumption, and online travel services, VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools. Through a framework that eliminates domain-specific policies, we enable flexible composition of these scenarios and tools, yielding 100 cross-scenario tasks (main results) and 300 single-scenario tasks. Each task is derived from multiple real user requests and requires agents to reason across temporal and spatial dimensions, utilize complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent throughout multi-turn conversations. Moreover, we propose a rubric-based sliding window evaluator, enabling robust assessment of diverse solution pathways in complex environments and stochastic interactions. Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross-scenario tasks, and less than 50% success rate on others. Overall, we believe VitaBench will serve as a valuable resource for advancing the development of AI agents in practical real-world applications. The code, dataset, and leaderboard are available at https://vitabench.github.io/

[76] Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

Tolúlopé Ògúnrèmí, Christopher D. Manning, Dan Jurafsky, Karen Livescu

Main category: cs.CL

TL;DR: Analysis of modality adapters in spoken language models reveals two strategies: Whisper-based models create English-based interlingua for semantic representation, while others use phonetic representation with English words.

Details

Motivation: To understand how modality adapters transform speech encoder outputs into representations that decoder language models can process in spoken language models.

Method: Examined MA output representations in three SLMs (SALMONN, Qwen2-Audio, Phi-4-Multimodal-Instruct) by finding nearest decoder LM tokens to MA representations.

Result: Found two strategies: Whisper encoder models create English-based interlingua for semantic representation, while non-Whisper models use phonetic representation with English words.

Conclusion: The representation strategy depends on whether the speech encoder is trained only for speech recognition or also for translation, with Whisper-based models handling unseen languages better through semantic interlingua.

Abstract: Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Yet we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don’t, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.

[77] Hybrid Reinforcement: When Reward Is Sparse, It’s Better to Be Dense

Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Sharon Li, Jason E Weston, Ping Yu

Main category: cs.CL

TL;DR: HERO is a reinforcement learning framework that combines binary verifier signals with continuous reward model scores using stratified normalization and variance-aware weighting to improve reasoning in LLMs.

Details

Motivation: Binary verifier feedback is brittle and under-credits partially correct or alternative answers, limiting learning. Reward models offer richer continuous feedback but need integration with reliable verifier signals.

Method: HERO integrates verifier signals with reward-model scores through stratified normalization (bounding RM scores within verifier groups) and variance-aware weighting (emphasizing challenging prompts).

Result: HERO consistently outperforms RM-only and verifier-only baselines across diverse mathematical reasoning benchmarks, with strong gains on both verifiable and hard-to-verify tasks.

Conclusion: Hybrid reward design retains verifier stability while leveraging reward model nuance to advance reasoning capabilities in LLMs.

Abstract: Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle–many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.

[78] WUGNECTIVES: Novel Entity Inferences of Language Models from Discourse Connectives

Daniel Brubaker, William Sheffield, Junyi Jessy Li, Kanishka Misra

Main category: cs.CL

TL;DR: This paper investigates whether discourse connectives can help language models learn about the world, introducing WUGNECTIVES dataset to test LMs’ inferences about novel entities.

Details

Motivation: To flip the conventional premise and study if discourse connectives can inform language models about world knowledge, rather than just using world knowledge to predict connectives.

Method: Created WUGNECTIVES dataset with 8,880 stimuli to evaluate LMs’ inferences about novel entities in connective-linked contexts, testing 17 different LMs at various scales and training regimens.

Result: Tuning LMs for reasoning behavior improved performance on most connectives, but all models struggled significantly with concessive connectives. Large variation in performance across connective types.

Conclusion: Findings enable more nuanced investigation of language cues’ functional role in LMs, with systematic challenges in handling concessive meanings.

Abstract: The role of world knowledge has been particularly crucial to predict the discourse connective that marks the discourse relation between two arguments, with language models (LMs) being generally successful at this task. We flip this premise in our work, and instead study the inverse problem of understanding whether discourse connectives can inform LMs about the world. To this end, we present WUGNECTIVES, a dataset of 8,880 stimuli that evaluates LMs’ inferences about novel entities in contexts where connectives link the entities to particular attributes. On investigating 17 different LMs at various scales, and training regimens, we found that tuning an LM to show reasoning behavior yields noteworthy improvements on most connectives. At the same time, there was a large variation in LMs’ overall performance across connective type, with all models systematically struggling on connectives that express a concessive meaning. Our findings pave the way for more nuanced investigations into the functional role of language cues as captured by LMs. We release WUGNECTIVES at https://github.com/sheffwb/wugnectives.

[79] NarraBench: A Comprehensive Framework for Narrative Benchmarking

Sil Hamilton, Matthew Wilkens, Andrew Piper

Main category: cs.CL

TL;DR: NarraBench is a taxonomy and survey of 78 narrative-understanding benchmarks, revealing that only 27% of narrative tasks are well covered, with significant gaps in areas like events, style, perspective, and revelation.

Details

Motivation: To address the lack of comprehensive evaluation for narrative understanding in NLP, particularly for subjective and perspectival aspects where there's no single correct answer.

Method: Developed a theory-informed taxonomy of narrative-understanding tasks and conducted a survey of 78 existing benchmarks in the area.

Result: Found that only 27% of narrative tasks are well captured by current benchmarks, with significant gaps in narrative events, style, perspective, and revelation. Identified need for benchmarks assessing subjective aspects.

Conclusion: The taxonomy, survey, and methodology provide valuable tools for NLP researchers to better evaluate LLM narrative understanding capabilities, highlighting critical areas needing new benchmark development.

Abstract: We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas – including narrative events, style, perspective, and revelation – are nearly absent from current evaluations. We also note the need for increased development of benchmarks capable of assessing constitutively subjective and perspectival aspects of narrative, that is, aspects for which there is generally no single correct answer. Our taxonomy, survey, and methodology are of value to NLP researchers seeking to test LLM narrative understanding.

[80] Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance

Jingyi Chen, Zhimeng Guo, Jiyun Chun, Pichao Wang, Andrew Perrault, Micha Elsner

Main category: cs.CL

TL;DR: LISTEN benchmark reveals that large audio language models (LALMs) primarily rely on lexical cues rather than acoustic information for emotion understanding, showing limited ability to process acoustic cues when lexical content is neutral or conflicting.

Details

Motivation: To determine whether LALMs genuinely process acoustic information or rely primarily on lexical content for emotion understanding from speech, given the need for sensitivity to both lexical and acoustic cues.

Method: Developed LISTEN benchmark to disentangle lexical reliance from acoustic sensitivity through controlled evaluations of six state-of-the-art LALMs under various conditions including neutral lexical cues, cue alignment, cue conflict, and paralinguistic settings.

Result: Models consistently showed lexical dominance: predicted ’neutral’ when lexical cues were neutral/absent, limited gains with cue alignment, failed to classify distinct emotions under cue conflict, and approached chance performance in paralinguistic settings.

Conclusion: Current LALMs largely ’transcribe’ rather than ’listen,’ heavily relying on lexical semantics while underutilizing acoustic cues. LISTEN provides a principled framework for assessing emotion understanding in multimodal models.

Abstract: Understanding emotion from speech requires sensitivity to both lexical and acoustic cues. However, it remains unclear whether large audio language models (LALMs) genuinely process acoustic information or rely primarily on lexical content. We present LISTEN (Lexical vs. Acoustic Speech Test for Emotion in Narratives), a controlled benchmark designed to disentangle lexical reliance from acoustic sensitivity in emotion understanding. Across evaluations of six state-of-the-art LALMs, we observe a consistent lexical dominance. Models predict “neutral” when lexical cues are neutral or absent, show limited gains under cue alignment, and fail to classify distinct emotions under cue conflict. In paralinguistic settings, performance approaches chance. These results indicate that current LALMs largely “transcribe” rather than “listen,” relying heavily on lexical semantics while underutilizing acoustic cues. LISTEN offers a principled framework for assessing emotion understanding in multimodal models.

[81] Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation

Siheng Xiong, Ali Payani, Faramarz Fekri

Main category: cs.CL

TL;DR: MPPA framework addresses CoT derailment in small LMs by generating multiple candidate plans at variable intervals and aggregating them, combined with online Step-DPO for efficient stepwise supervision using TSMC.

Details

Motivation: Existing single-pass CoT generation leads to reasoning trajectory drift (CoT derailment), especially in smaller LMs with long chains due to limited capacity. Analysis shows most errors come from incorrect planning steps.

Method: Multi-Path Plan Aggregation (MPPA) generates multiple candidate plans at variable token intervals and aggregates them. Uses base LM as primary policy with lightweight LoRA for aggregation. Online Step-DPO provides stepwise supervision using Twisted Sequential Monte Carlo.

Result: Outperforms DeepSeek-R1 distillation and outcome-reward RL baselines across math, science, and logical reasoning benchmarks using only 10% SFT data and 5% preference pairs.

Conclusion: MPPA with online Step-DPO effectively addresses CoT derailment through plan exploration and aggregation, enabling more stable and accurate reasoning in small LMs with long chains.

Abstract: Inference-time scaling enhances the reasoning ability of a language model (LM) by extending its chain-of-thought (CoT). However, existing approaches typically generate the entire reasoning chain in a single forward pass, which often leads to CoT derailment, i.e., the reasoning trajectory drifting off course due to compounding errors. This problem is particularly severe for smaller LMs with long CoTs due to their limited capacity. To address this, we analyze raw long CoTs and uncover a reasoning hierarchy consisting of planning and execution steps. Our analysis reveals that most reasoning errors stem from incorrect planning. Motivated by this observation, we propose Multi-Path Plan Aggregation (MPPA), a framework that augments single-pass reasoning with plan exploration and aggregation. Following a variable interval schedule based on the token position, MPPA generates multiple candidate plans and aggregates them into a refined planning step. To maintain efficiency, we adopt a minimal design in which the base LM serves as the primary policy, while a lightweight LoRA module implements the plan aggregation policy. We further observe that outcome-reward RL is inefficient for long trajectories (e.g., exceeding 4K tokens). To overcome this, we introduce online Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision using small LMs. This yields more efficient training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks demonstrate that, with only 10% SFT data and 5% of preference pairs, our method outperforms both the DeepSeek-R1 distillation baseline and the outcome-reward RL baseline across multiple base models and tasks.

[82] Towards Inference-time Scaling for Continuous Space Reasoning

Minghan Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari

Main category: cs.CL

TL;DR: This paper investigates adapting inference-time scaling techniques (multiple sample generation with PRM/ORM re-ranking) from discrete text reasoning to continuous space reasoning using COCONUT LM. While feasible to generate diverse reasoning paths, the approach faces unique challenges in continuous space that limit performance gains.

Details

Motivation: To explore whether established inference-time scaling techniques that work well for text-based reasoning can be successfully adapted to reasoning in continuous space, using COCONUT continuous reasoning LM as the backbone.

Method: Used dropout-based sampling to generate diverse reasoning paths in continuous space, conducted Pass@N analysis on generated samples, and probed geometric properties and trajectory dynamics to understand limitations of PRM/ORM discrimination in continuous space.

Result: The approach demonstrates feasibility of generating diverse reasoning paths and shows potential for performance gains similar to discrete space. However, working recipes from discrete space only yield marginal improvements in continuous space due to inability to effectively discriminate between correct and incorrect reasoning.

Conclusion: Current limitations stem from absence of key inductive biases in continuous thought representations. Training frameworks for continuous reasoning LMs need to not only optimize for accuracy but also explicitly incorporate inductive biases that enable discrimination of correct and incorrect thoughts during inference.

Abstract: Inference-time scaling through multiple sample generation in combination with Process- or Outcome-Reward Model (PRM or ORM) re-ranking has proven effective for text-based reasoning in large language models. This paper investigates whether such established techniques can be successfully adapted to reasoning in the continuous space, using COCONUT (Hao et al. 2024) continuous space reasoning LM as the backbone. We demonstrate the feasibility of generating diverse reasoning paths through dropout-based sampling. Our Pass@N analysis on the generated samples reveals the potential that could enable a significant gain in performance akin to observed gain in the discrete space. However, we highlight unique challenges faced for materializing this gain in the continuous thought space. In particular, working recipes for data generation and training PRM and ORM models in the discrete space unlocks only marginal improvements in the continuous space. Through probing various aspects including geometric properties and trajectory dynamics we identify the underlying reasons that prevent effective discrimination between correct and incorrect reasoning (essential for the functioning of PRM and ORM). Our findings reveal that current limitations stem from the absence of key inductive biases in continuous thought representations. We argue that the training frameworks for continuous reasoning LMs require not only to optimize for accuracy but also to explicitly incorporate inductive biases that could be utilized during inference-time for discrimination of correct and incorrect thoughts.\footnote{Our code and data will be publicly available.}

[83] What Layers When: Learning to Skip Compute in LLMs with Residual Gates

Filipe Laitenberger, Dawid Kopiczko, Cees G. M. Snoek, Yuki M. Asano

Main category: cs.CL

TL;DR: GateSkip is a residual-stream gating mechanism that enables token-wise layer skipping in decoder-only language models, achieving up to 15% compute savings while maintaining over 90% baseline accuracy.

Details

Motivation: To reduce computational costs in large language models by enabling selective skipping of less important tokens during inference, addressing the instability issues of early-exit and router-based methods that require extensive retraining.

Method: Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch’s output before re-entering the residual stream. During inference, tokens are ranked by gate values and low-importance ones are skipped using a per-layer budget.

Result: On long-form reasoning tasks, GateSkip saves up to 15% compute while retaining over 90% of baseline accuracy. For larger models, the tradeoff improves significantly. On instruction-tuned models, it achieves accuracy gains at full compute and matches baseline quality with near 50% compute savings.

Conclusion: GateSkip provides a stable fine-tuning approach for pretrained models, offers insights into transformer information flow, and combines well with other optimization techniques like quantization, pruning, and self-speculative decoding.

Abstract: We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch’s output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15% compute while retaining over 90% of baseline accuracy. For increasingly larger models, this tradeoff improves drastically. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.

[84] Element2Vec: Build Chemical Element Representation from Text for Property Prediction

Yuanhao Li, Keyuan Lai, Tianqi Wang, Qihao Liu, Jiawei Ma, Yuan-Chao Hu

Main category: cs.CL

TL;DR: Element2Vec uses language models to generate embeddings from Wikipedia text for chemical elements, creating both general-purpose and attribute-specific vectors to predict element properties, with a test-time training method to improve accuracy.

Details

Motivation: Traditional methods for predicting chemical element properties fail to model complex relationships and cannot represent all characteristics as scalars. Existing AI approaches suffer from hallucinations and lack interpretability.

Method: Uses language models to generate embeddings from Wikipedia text - both general-purpose (Global) and attribute-highlighted vectors (Local). Implements test-time training with self-attention to mitigate prediction errors from vanilla regression.

Result: The method addresses challenges of text distribution discrepancy between common and scientific texts, and limited data availability (only 118 known elements with sparse property data).

Conclusion: This work aims to advance AI-driven discovery in materials science by providing better representations of chemical elements from natural language text.

Abstract: Accurate property data for chemical elements is crucial for materials design and manufacturing, but many of them are difficult to measure directly due to equipment constraints. While traditional methods use the properties of other elements or related properties for prediction via numerical analyses, they often fail to model complex relationships. After all, not all characteristics can be represented as scalars. Recent efforts have been made to explore advanced AI tools such as language models for property estimation, but they still suffer from hallucinations and a lack of interpretability. In this paper, we investigate Element2Vecto effectively represent chemical elements from natural languages to support research in the natural sciences. Given the text parsed from Wikipedia pages, we use language models to generate both a single general-purpose embedding (Global) and a set of attribute-highlighted vectors (Local). Despite the complicated relationship across elements, the computational challenges also exist because of 1) the discrepancy in text distribution between common descriptions and specialized scientific texts, and 2) the extremely limited data, i.e., with only 118 known elements, data for specific properties is often highly sparse and incomplete. Thus, we also design a test-time training method based on self-attention to mitigate the prediction error caused by Vanilla regression clearly. We hope this work could pave the way for advancing AI-driven discovery in materials science.

[85] Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

Tuhin Chakrabarty, Jane C. Ginsburg, Paramveer Dhillon

Main category: cs.CL

TL;DR: Fine-tuned AI models can generate literary text that experts prefer over human writing in both style fidelity and quality, while in-context prompting performs poorly. This has significant implications for copyright law’s fair-use considerations.

Details

Motivation: To determine if AI models can generate high-quality literary text that faithfully emulates authors' styles, addressing copyright concerns about AI-generated derivative content.

Method: Preregistered study comparing MFA-trained expert writers with ChatGPT, Claude, and Gemini using both in-context prompting and fine-tuning on authors’ complete works. Evaluated by 159 expert and lay readers in blind pairwise comparisons.

Result: In-context prompting was strongly disfavored by experts for style fidelity and quality, but fine-tuning completely reversed these findings - experts preferred AI-generated text. Fine-tuned outputs were rarely detected as AI-generated (3% vs 97% for in-context).

Conclusion: Author-specific fine-tuning enables AI writing that readers prefer to expert human writing, dramatically reducing costs and providing evidence relevant to copyright’s fair-use analysis regarding market effects.

Abstract: The use of copyrighted books for training AI models has led to numerous lawsuits from authors concerned about AI’s ability to generate derivative content. Yet it’s unclear if these models can generate high quality literary text while emulating authors’ styles. To answer this we conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude & Gemini in writing up to 450 word excerpts emulating 50 award-winning authors’ diverse styles. In blind pairwise evaluations by 159 representative expert & lay readers, AI-generated text from in-context prompting was strongly disfavored by experts for both stylistic fidelity (OR=0.16, p<10^-8) & writing quality (OR=0.13, p<10^-7) but showed mixed results with lay readers. However, fine-tuning ChatGPT on individual authors’ complete works completely reversed these findings: experts now favored AI-generated text for stylistic fidelity (OR=8.16, p<10^-13) & writing quality (OR=1.87, p=0.010), with lay readers showing similar shifts. These effects generalize across authors & styles. The fine-tuned outputs were rarely flagged as AI-generated (3% rate v. 97% for in-context prompting) by best AI detectors. Mediation analysis shows this reversal occurs because fine-tuning eliminates detectable AI stylistic quirks (e.g., cliche density) that penalize in-context outputs. While we do not account for additional costs of human effort required to transform raw AI output into cohesive, publishable prose, the median fine-tuning & inference cost of $81 per author represents a dramatic 99.7% reduction compared to typical professional writer compensation. Author-specific fine-tuning thus enables non-verbatim AI writing that readers prefer to expert human writing, providing empirical evidence directly relevant to copyright’s fourth fair-use factor, the “effect upon the potential market or value” of the source works.

[86] On the Ability of LLMs to Handle Character-Level Perturbations: How Well and How?

Anyuan Zhuo, Xuefei Ning, Ningyuan Li, Yu Wang, Pinyan Lu

Main category: cs.CL

TL;DR: This paper investigates LLM resilience against character-level perturbations using UCC-Inj, a method that inserts invisible Unicode control characters to disrupt tokenization and discourage misuse, finding surprising robustness despite significant noise.

Details

Motivation: To study LLM resilience against structured character-level perturbations and develop methods to discourage LLM misuse in sensitive applications like online exam systems.

Method: Introduces UCC-Inj, a practical method that inserts invisible Unicode control characters after each input character to fragment tokenization and reduce signal-to-noise ratio, then evaluates LLM performance across various configurations.

Result: Despite strong obfuscation that fragments tokenization and significantly reduces signal-to-noise ratio, many LLMs maintain notable performance, demonstrating unexpected robustness to character-level noise.

Conclusion: The findings reveal concerning low-level robustness of LLMs that could enable misuse, highlighting risks for deploying LLMs across diverse applications and the need for better safeguards.

Abstract: This work investigates the resilience of contemporary LLMs against frequent and structured character-level perturbations, specifically through the insertion of noisy characters after each input character. We introduce UCC-Inj, a practical method that inserts invisible Unicode control characters into text to discourage LLM misuse in scenarios such as online exam systems. Surprisingly, despite strong obfuscation that fragments tokenization and reduces the signal-to-noise ratio significantly, many LLMs still maintain notable performance. Through comprehensive evaluation across model-, problem-, and noise-related configurations, we examine the extent and mechanisms of this robustness, exploring both the handling of character-level tokenization and implicit versus explicit denoising mechanism hypotheses of character-level noises. We hope our findings on the low-level robustness of LLMs will shed light on the risks of their misuse and on the reliability of deploying LLMs across diverse applications.

[87] Intent Clustering with Shared Pseudo-Labels

I-Fan Lin, Faegheh Hasibi, Suzan Verberne

Main category: cs.CL

TL;DR: A training-free, label-free intent clustering method using lightweight LLMs that generates pseudo-labels for texts and performs multi-label classification, achieving comparable or better results than recent baselines.

Details

Motivation: To address limitations of current approaches that rely on costly commercial LLMs with limited transparency and require knowing the number of clusters in advance, which is often impractical in real-world settings.

Method: Instead of direct similarity matching, the method first generates pseudo-labels for each text using LLMs, then performs multi-label classification in this pseudo-label space, leveraging the hypothesis that texts in the same cluster share more labels and will have closer embeddings.

Result: Evaluation on four benchmark sets shows the approach achieves comparable or better results than recent baselines while being simple and computationally efficient.

Conclusion: The method is applicable in low-resource scenarios and demonstrates stability across multiple models and datasets, offering a practical alternative to commercial LLM-based clustering approaches.

Abstract: In this paper, we propose an intuitive, training-free and label-free method for intent clustering that makes minimal assumptions using lightweight and open-source LLMs. Many current approaches rely on commercial LLMs, which are costly, and offer limited transparency. Additionally, their methods often explicitly depend on knowing the number of clusters in advance, which is often not the case in realistic settings. To address these challenges, instead of asking the LLM to match similar text directly, we first ask it to generate pseudo-labels for each text, and then perform multi-label classification in this pseudo-label set for each text. This approach is based on the hypothesis that texts belonging to the same cluster will share more labels, and will therefore be closer when encoded into embeddings. These pseudo-labels are more human-readable than direct similarity matches. Our evaluation on four benchmark sets shows that our approach achieves results comparable to and better than recent baselines, while remaining simple and computationally efficient. Our findings indicate that our method can be applied in low-resource scenarios and is stable across multiple models and datasets.

cs.CV

[88] GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments

Leela Krishna, Mengyang Zhao, Saicharithreddy Pasula, Harshit Rajgarhia, Abhishek Mukherji

Main category: cs.CV

TL;DR: GAZE pipeline automates conversion of raw 360-degree video into structured multimodal datasets for world-model training, using AI pre-annotation and human validation to achieve efficiency gains and privacy safeguards.

Details

Motivation: Manual annotation of large-scale multimodal datasets for robust world models is slow and expensive, creating a bottleneck in training data preparation.

Method: Three-step pipeline: (i) normalize 360-degree video into standard views and shard for parallel processing, (ii) apply AI models for dense multimodal pre-annotation (scene understanding, object tracking, audio transcription, PII/NSFW/minor detection), (iii) consolidate signals into structured output for human validation with auto-skipping of low-salience segments.

Result: Achieved ~19 minutes saved per review hour, reduced human review volume by >80%, generated high-fidelity privacy-aware datasets with increased label density and consistency for cross-modal dynamics and action-conditioned prediction.

Conclusion: GAZE provides a scalable blueprint for generating high-quality world model training data without sacrificing throughput or governance, enabling efficient production of consumable datasets.

Abstract: Training robust world models requires large-scale, precisely labeled multimodal datasets, a process historically bottlenecked by slow and expensive manual annotation. We present a production-tested GAZE pipeline that automates the conversion of raw, long-form video into rich, task-ready supervision for world-model training. Our system (i) normalizes proprietary 360-degree formats into standard views and shards them for parallel processing; (ii) applies a suite of AI models (scene understanding, object tracking, audio transcription, PII/NSFW/minor detection) for dense, multimodal pre-annotation; and (iii) consolidates signals into a structured output specification for rapid human validation. The GAZE workflow demonstrably yields efficiency gains (~19 minutes saved per review hour) and reduces human review volume by >80% through conservative auto-skipping of low-salience segments. By increasing label density and consistency while integrating privacy safeguards and chain-of-custody metadata, our method generates high-fidelity, privacy-aware datasets directly consumable for learning cross-modal dynamics and action-conditioned prediction. We detail our orchestration, model choices, and data dictionary to provide a scalable blueprint for generating high-quality world model training data without sacrificing throughput or governance.

[89] PC-UNet: An Enforcing Poisson Statistics U-Net for Positron Emission Tomography Denoising

Yang Shi, Jingchao Wang, Liangsi Lu, Mingxuan Huang, Ruixin He, Yifeng Xie, Hanqian Liu, Minzhe Guo, Yangyang Liang, Weipeng Zhang, Zimeng Li, Xuhang Chen

Main category: cs.CV

TL;DR: PC-UNet with PVMC-Loss improves PET image denoising by incorporating physical data constraints, addressing Poisson noise from low-dose imaging while maintaining image fidelity.

Details

Motivation: PET imaging faces limitations due to high radiation doses; lowering doses increases Poisson noise that current denoising methods cannot handle without causing distortions and artifacts.

Method: Proposed Poisson Consistent U-Net (PC-UNet) with Poisson Variance and Mean Consistency Loss (PVMC-Loss) that incorporates physical data constraints and provides statistical unbiased variance and gradient adaptation.

Result: Tests on PET datasets show PC-UNet improves physical consistency and image fidelity, demonstrating effective integration of physical information.

Conclusion: PC-UNet with PVMC-Loss successfully addresses Poisson noise in low-dose PET imaging while maintaining image quality through physical data constraints.

Abstract: Positron Emission Tomography (PET) is crucial in medicine, but its clinical use is limited due to high signal-to-noise ratio doses increasing radiation exposure. Lowering doses increases Poisson noise, which current denoising methods fail to handle, causing distortions and artifacts. We propose a Poisson Consistent U-Net (PC-UNet) model with a new Poisson Variance and Mean Consistency Loss (PVMC-Loss) that incorporates physical data to improve image fidelity. PVMC-Loss is statistically unbiased in variance and gradient adaptation, acting as a Generalized Method of Moments implementation, offering robustness to minor data mismatches. Tests on PET datasets show PC-UNet improves physical consistency and image fidelity, proving its ability to integrate physical information effectively.

[90] DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models

Mor Ventura, Michael Toker, Or Patashnik, Yonatan Belinkov, Roi Reichart

Main category: cs.CV

TL;DR: DeLeaker is a lightweight, optimization-free inference-time method that mitigates semantic leakage in T2I models by dynamically reweighting attention maps to suppress cross-entity interactions while preserving entity identity.

Details

Motivation: T2I models suffer from semantic leakage where semantically related features unintentionally transfer between distinct entities, and existing mitigation strategies are often optimization-based or require external inputs.

Method: DeLeaker intervenes directly on the model’s attention maps during diffusion, dynamically reweighting them to suppress excessive cross-entity interactions while strengthening each entity’s identity. It introduces SLIM dataset for systematic evaluation.

Result: DeLeaker consistently outperforms all baselines, even those with external information, achieving effective leakage mitigation without compromising fidelity or quality.

Conclusion: Attention control is valuable for semantic precision in T2I models, and DeLeaker paves the way for more semantically accurate image generation.

Abstract: Text-to-Image (T2I) models have advanced rapidly, yet they remain vulnerable to semantic leakage, the unintended transfer of semantically related features between distinct entities. Existing mitigation strategies are often optimization-based or dependent on external inputs. We introduce DeLeaker, a lightweight, optimization-free inference-time approach that mitigates leakage by directly intervening on the model’s attention maps. Throughout the diffusion process, DeLeaker dynamically reweights attention maps to suppress excessive cross-entity interactions while strengthening the identity of each entity. To support systematic evaluation, we introduce SLIM (Semantic Leakage in IMages), the first dataset dedicated to semantic leakage, comprising 1,130 human-verified samples spanning diverse scenarios, together with a novel automatic evaluation framework. Experiments demonstrate that DeLeaker consistently outperforms all baselines, even when they are provided with external information, achieving effective leakage mitigation without compromising fidelity or quality. These results underscore the value of attention control and pave the way for more semantically precise T2I models.

[91] UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos

Mingxuan Liu, Honglin He, Elisa Ricci, Wayne Wu, Bolei Zhou

Main category: cs.CV

TL;DR: UrbanVerse is a data-driven system that converts city-tour videos into physics-aware simulation scenes for training urban embodied AI agents, achieving improved navigation performance and strong sim-to-real transfer.

Details

Motivation: Existing simulation environments for urban AI agents lack scalability and fail to capture real-world complexity, limiting training effectiveness for real-world deployment.

Method: UrbanVerse consists of UrbanVerse-100K (100k+ annotated urban 3D assets) and UrbanVerse-Gen (automatic pipeline that extracts scene layouts from videos and instantiates 3D simulations using retrieved assets).

Result: The system preserves real-world semantics and layouts with human-evaluated realism comparable to manually crafted scenes. Navigation policies show scaling power laws and improve success by +6.3% in simulation and +30.1% in sim-to-real transfer, completing a 300m real-world mission with only two interventions.

Conclusion: UrbanVerse enables scalable, high-fidelity urban simulation for embodied AI training, demonstrating strong generalization and real-world deployment capabilities.

Abstract: Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.

[92] NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks

Junliang Ye, Shenghao Xie, Ruowen Zhao, Zhengyi Wang, Hongyu Yan, Wenqiang Zu, Lei Ma, Jun Zhu

Main category: cs.CV

TL;DR: Nano3D is a training-free framework for precise 3D object editing that integrates FlowEdit with TRELLIS and introduces region-aware merging strategies (Voxel/Slat-Merge) to preserve structural fidelity without requiring masks.

Details

Motivation: Current 3D editing methods are inefficient, inconsistent, and often fail to preserve unedited regions. Most rely on editing multi-view renderings followed by reconstruction, which introduces artifacts and limits practicality.

Method: Integrates FlowEdit into TRELLIS to perform localized edits guided by front-view renderings, and introduces region-aware merging strategies (Voxel/Slat-Merge) that adaptively preserve structural fidelity by ensuring consistency between edited and unedited areas.

Result: Nano3D achieves superior 3D consistency and visual quality compared with existing methods. The authors also constructed Nano3D-Edit-100k, the first large-scale 3D editing dataset with over 100,000 high-quality 3D editing pairs.

Conclusion: This work addresses long-standing challenges in both algorithm design and data availability, significantly improving the generality and reliability of 3D editing, and laying the groundwork for the development of feed-forward 3D editing models.

Abstract: 3D object editing is essential for interactive content creation in gaming, animation, and robotics, yet current approaches remain inefficient, inconsistent, and often fail to preserve unedited regions. Most methods rely on editing multi-view renderings followed by reconstruction, which introduces artifacts and limits practicality. To address these challenges, we propose Nano3D, a training-free framework for precise and coherent 3D object editing without masks. Nano3D integrates FlowEdit into TRELLIS to perform localized edits guided by front-view renderings, and further introduces region-aware merging strategies, Voxel/Slat-Merge, which adaptively preserve structural fidelity by ensuring consistency between edited and unedited areas. Experiments demonstrate that Nano3D achieves superior 3D consistency and visual quality compared with existing methods. Based on this framework, we construct the first large-scale 3D editing datasets Nano3D-Edit-100k, which contains over 100,000 high-quality 3D editing pairs. This work addresses long-standing challenges in both algorithm design and data availability, significantly improving the generality and reliability of 3D editing, and laying the groundwork for the development of feed-forward 3D editing models. Project Page:https://jamesyjl.github.io/Nano3D

[93] ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents

Tingyu Lin, Marco Peer, Florian Kleber, Robert Sablatnig

Main category: cs.CV

TL;DR: ClapperText is a benchmark dataset for handwritten and printed text recognition in degraded, low-resource settings, derived from WWII-era archival video clapperboards containing production metadata.

Details

Motivation: To address challenges in historical document analysis where structured content appears in degraded, non-standard forms, including motion blur, handwriting variation, and cluttered backgrounds.

Method: Created from 127 WWII archival video segments with clapperboards, providing 9,813 annotated frames and 94,573 word-level text instances with transcription, semantic category, text type, and occlusion status annotations.

Result: Benchmarked six recognition and seven detection models showing substantial performance gains with fine-tuning despite small training set (18 videos), demonstrating suitability for few-shot learning.

Conclusion: ClapperText provides a realistic, culturally grounded resource for advancing robust OCR and document understanding in low-resource archival contexts.

Abstract: This paper presents ClapperText, a benchmark dataset for handwritten and printed text recognition in visually degraded and low-resource settings. The dataset is derived from 127 World War II-era archival video segments containing clapperboards that record structured production metadata such as date, location, and camera-operator identity. ClapperText includes 9,813 annotated frames and 94,573 word-level text instances, 67% of which are handwritten and 1,566 are partially occluded. Each instance includes transcription, semantic category, text type, and occlusion status, with annotations available as rotated bounding boxes represented as 4-point polygons to support spatially precise OCR applications. Recognizing clapperboard text poses significant challenges, including motion blur, handwriting variation, exposure fluctuations, and cluttered backgrounds, mirroring broader challenges in historical document analysis where structured content appears in degraded, non-standard forms. We provide both full-frame annotations and cropped word images to support downstream tasks. Using a consistent per-video evaluation protocol, we benchmark six representative recognition and seven detection models under zero-shot and fine-tuned conditions. Despite the small training set (18 videos), fine-tuning leads to substantial performance gains, highlighting ClapperText’s suitability for few-shot learning scenarios. The dataset offers a realistic and culturally grounded resource for advancing robust OCR and document understanding in low-resource archival contexts. The dataset and evaluation code are available at https://github.com/linty5/ClapperText.

[94] Constantly Improving Image Models Need Constantly Improving Benchmarks

Jiaxin Ge, Grace Luo, Heekyung Lee, Nishant Malpani, Long Lian, XuDong Wang, Aleksander Holynski, Trevor Darrell, Sewon Min, David M. Chan

Main category: cs.CV

TL;DR: ECHO is a framework for creating image generation benchmarks from real-world social media posts, revealing novel capabilities and improving model evaluation.

Details

Motivation: Existing benchmarks lag behind rapid advances in image generation and fail to capture emerging real-world use cases, creating a gap between community perceptions and formal evaluation.

Method: Construct benchmarks directly from social media posts showcasing novel prompts and user judgments, collecting over 31,000 prompts from GPT-4o Image Gen posts.

Result: ECHO discovers creative tasks absent from existing benchmarks, better distinguishes state-of-the-art models, and surfaces community feedback for metric design.

Conclusion: ECHO provides a framework for more responsive and relevant evaluation of image generation models by leveraging real-world evidence of model use.

Abstract: Recent advances in image generation, often driven by proprietary systems like GPT-4o Image Gen, regularly introduce new capabilities that reshape how users interact with these models. Existing benchmarks often lag behind and fail to capture these emerging use cases, leaving a gap between community perceptions of progress and formal evaluation. To address this, we present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use: social media posts that showcase novel prompts and qualitative user judgments. Applying this framework to GPT-4o Image Gen, we construct a dataset of over 31,000 prompts curated from such posts. Our analysis shows that ECHO (1) discovers creative and complex tasks absent from existing benchmarks, such as re-rendering product labels across languages or generating receipts with specified totals, (2) more clearly distinguishes state-of-the-art models from alternatives, and (3) surfaces community feedback that we use to inform the design of metrics for model quality (e.g., measuring observed shifts in color, identity, and structure). Our website is at https://echo-bench.github.io.

[95] DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification

Tingyu Lin, Armin Dadras, Florian Kleber, Robert Sablatnig

Main category: cs.CV

TL;DR: DGME-T enhances Video Swin Transformer with directional grid motion encoding to improve camera movement classification on both modern and archival film footage, achieving significant accuracy gains.

Details

Motivation: Camera movement classification models trained on modern footage degrade when applied to archival film due to noise, missing frames, and low contrast that obscure motion cues.

Method: Introduces DGME-T, a lightweight extension to Video Swin Transformer that injects directional grid motion encoding derived from optical flow via a learnable normalized late-fusion layer.

Result: DGME-T raises top-1 accuracy from 81.78% to 86.14% and macro F1 from 82.08% to 87.81% on modern clips, while improving WWII footage from 83.43% to 84.62% accuracy and 81.72% to 82.63% macro F1.

Conclusion: Structured motion priors and transformer representations are complementary, and even a small, carefully calibrated motion head can substantially enhance robustness in degraded film analysis.

Abstract: Camera movement classification (CMC) models trained on contemporary, high-quality footage often degrade when applied to archival film, where noise, missing frames, and low contrast obscure motion cues. We bridge this gap by assembling a unified benchmark that consolidates two modern corpora into four canonical classes and restructures the HISTORIAN collection into five balanced categories. Building on this benchmark, we introduce DGME-T, a lightweight extension to the Video Swin Transformer that injects directional grid motion encoding, derived from optical flow, via a learnable and normalised late-fusion layer. DGME-T raises the backbone’s top-1 accuracy from 81.78% to 86.14% and its macro F1 from 82.08% to 87.81% on modern clips, while still improving the demanding World-War-II footage from 83.43% to 84.62% accuracy and from 81.72% to 82.63% macro F1. A cross-domain study further shows that an intermediate fine-tuning stage on modern data increases historical performance by more than five percentage points. These results demonstrate that structured motion priors and transformer representations are complementary and that even a small, carefully calibrated motion head can substantially enhance robustness in degraded film analysis. Related resources are available at https://github.com/linty5/DGME-T.

[96] LoRAverse: A Submodular Framework to Retrieve Diverse Adapters for Diffusion Models

Mert Sonmezer, Matthew Zheng, Pinar Yanardag

Main category: cs.CV

TL;DR: A submodular framework is proposed to select diverse and relevant LoRA adapters from large databases, addressing challenges in navigating and utilizing the vast number of available models.

Details

Motivation: Users struggle to navigate and select suitable LoRA adapters from over 100K models on platforms like Civit.ai due to volume, diversity, and lack of structured organization.

Method: Framing the task as a combinatorial optimization problem and proposing a novel submodular framework for selecting relevant and diverse LoRA models.

Result: Quantitative and qualitative experiments show the method generates diverse outputs across a wide range of domains.

Conclusion: The proposed submodular framework effectively addresses the challenge of selecting optimal LoRA adapters from large databases, enabling better utilization of personalized diffusion models.

Abstract: Low-rank Adaptation (LoRA) models have revolutionized the personalization of pre-trained diffusion models by enabling fine-tuning through low-rank, factorized weight matrices specifically optimized for attention layers. These models facilitate the generation of highly customized content across a variety of objects, individuals, and artistic styles without the need for extensive retraining. Despite the availability of over 100K LoRA adapters on platforms like Civit.ai, users often face challenges in navigating, selecting, and effectively utilizing the most suitable adapters due to their sheer volume, diversity, and lack of structured organization. This paper addresses the problem of selecting the most relevant and diverse LoRA models from this vast database by framing the task as a combinatorial optimization problem and proposing a novel submodular framework. Our quantitative and qualitative experiments demonstrate that our method generates diverse outputs across a wide range of domains.

Mattia Segu, Marta Tintore Gazulla, Yongqin Xian, Luc Van Gool, Federico Tombari

Main category: cs.CV

TL;DR: MOBIUS is a family of foundation models for universal instance segmentation designed for efficient deployment on resource-constrained platforms, achieving significant computational reductions while maintaining state-of-the-art performance.

Details

Motivation: Current foundation models for instance-level perception have high computational costs that limit adoption on resource-constrained platforms like mobile devices, despite their excellent in-domain and zero-shot performance.

Method: Proposes three key innovations: (i) bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) streamlined, unified training strategy.

Result: Reduces pixel and transformer decoder FLOPs by up to 55% and 75% respectively while maintaining state-of-the-art performance, achieving these results in just one-third of the training iterations compared to efficient baselines.

Conclusion: MOBIUS establishes a new benchmark for efficient segmentation across both high-performance computing platforms and mobile devices, enabling Pareto-optimal downscaling without compromising performance.

Abstract: Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.

[98] Composition-Grounded Instruction Synthesis for Visual Reasoning

Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong, Zhuoran Yu, Pengyuan Li, Dhiraj Joshi, Rogerio Feris, Zexue He

Main category: cs.CV

TL;DR: COGS is a data-efficient framework that equips MLLMs with advanced reasoning abilities for artificial image domains by decomposing seed questions into primitive factors and systematically recomposing them to generate synthetic training data.

Details

Motivation: Pretrained MLLMs lack reasoning capabilities for domains with scarce annotations like charts, documents, and webpages, despite the abundance of such artificial image domains in practice.

Method: Decompose seed questions into primitive perception and reasoning factors, then systematically recompose them with new images to generate synthetic question-answer pairs with subquestions and intermediate answers for reinforcement learning.

Result: COGS substantially improves performance on unseen chart reasoning questions, especially on reasoning-heavy and compositional questions, and enables better cross-dataset transfer through factor-level mixture training.

Conclusion: The framework induces generalizable reasoning capabilities rather than dataset-specific overfitting and extends beyond charts to other domains like webpages.

Abstract: Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.

[99] Generalized Dynamics Generation towards Scannable Physical World Model

Yichen Li, Zhiyi Li, Brandon Feng, Dinghuai Zhang, Antonio Torralba

Main category: cs.CV

TL;DR: GDGen is a framework that unifies rigid body, articulated body, and soft body dynamics using a potential energy perspective, treating the world as one holistic entity and inferring physical properties from motion observations.

Details

Motivation: To develop generalist embodied agents in digital twin worlds with realistic interactive dynamics by creating a unified system that integrates diverse physical behaviors in scannable environments.

Method: Takes a potential energy perspective, extends classic elastodynamics with directional stiffness, uses a specialized network for material properties, and employs a neural field for geometry-agnostic deformation representation.

Result: GDGen robustly unifies diverse simulation paradigms and offers a versatile foundation for interactive virtual environments and training robotic agents in complex, dynamically rich scenarios.

Conclusion: The framework successfully integrates multiple physical dynamics systems into a unified, geometry-agnostic approach, enabling realistic interactive environments for embodied agent development.

Abstract: Digital twin worlds with realistic interactive dynamics presents a new opportunity to develop generalist embodied agents in scannable environments with complex physical behaviors. To this end, we present GDGen (Generalized Representation for Generalized Dynamics Generation), a framework that takes a potential energy perspective to seamlessly integrate rigid body, articulated body, and soft body dynamics into a unified, geometry-agnostic system. GDGen operates from the governing principle that the potential energy for any stable physical system should be low. This fresh perspective allows us to treat the world as one holistic entity and infer underlying physical properties from simple motion observations. We extend classic elastodynamics by introducing directional stiffness to capture a broad spectrum of physical behaviors, covering soft elastic, articulated, and rigid body systems. We propose a specialized network to model the extended material property and employ a neural field to represent deformation in a geometry-agnostic manner. Extensive experiments demonstrate that GDGen robustly unifies diverse simulation paradigms, offering a versatile foundation for creating interactive virtual environments and training robotic agents in complex, dynamically rich scenarios.

[100] Comprehensive language-image pre-training for 3D medical image understanding

Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma, Maximilian Ilse, Cynthia Lo, Olesya Melnichenko, Noel C. F. Codella, Maria Teodora Wetscherek, Klaus H. Maier-Hein, Panagiotis Korfiatis, Valentina Salvatelli, Javier Alvarez-Valle, Fernando Pérez-García

Main category: cs.CV

TL;DR: The paper introduces COLIPRI, a vision-language pre-training method for 3D medical images that addresses data scarcity by incorporating report generation objectives and combining vision-language with vision-only pre-training.

Details

Motivation: Current 3D vision-language encoders in medical imaging are limited by data availability, which restricts their capabilities for tasks like abnormality retrieval and prediction.

Method: Developed COLIPRI encoder family by injecting inductive biases through report generation objectives and pairing vision-language pre-training with vision-only pre-training, leveraging both image-only and paired image-text 3D datasets.

Result: COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification, while remaining competitive for semantic segmentation.

Conclusion: The proposed approach effectively addresses data scarcity in 3D medical imaging by combining multiple pre-training strategies and inductive biases, leading to superior performance across various medical imaging tasks.

Abstract: Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification and retrieval, and for downstream tasks such as segmentation and report generation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities or predicting likelihoods of abnormality. While the methodology holds promise, data availability limits the capabilities of current 3D VLEs. In this paper, we alleviate the lack of data by injecting additional inductive biases: introducing a report generation objective and pairing vision-language pre-training with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional inductive biases, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification, and remain competitive for semantic segmentation.

[101] Directional Reasoning Injection for Fine-Tuning MLLMs

Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chenliang Xu, Emad Barsoum, Zicheng Liu

Main category: cs.CV

TL;DR: DRIFT is a lightweight method that transfers reasoning knowledge from text-only LLMs to MLLMs through gradient-space injection, improving multimodal reasoning without resource-intensive training.

Details

Motivation: MLLMs lag behind text-only LLMs in reasoning ability, and existing methods like supervised fine-tuning or reinforcement learning are resource-intensive. Model merging shows inconsistent results across different model families.

Method: DRIFT precomputes a reasoning prior as parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning, preserving multimodal alignment while transferring reasoning knowledge.

Result: DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning on benchmarks like MathVista and MathVerse, matching or surpassing training-heavy methods at much lower cost.

Conclusion: DRIFT provides an efficient and effective approach to enhance MLLM reasoning capabilities without destabilizing multimodal alignment, offering a practical alternative to resource-intensive training methods.

Abstract: Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a “free lunch”: its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.

[102] A solution to generalized learning from small training sets found in everyday infant experiences

Frangil Ramirez, Elizabeth Clerkin, David J. Crandall, Linda B. Smith

Main category: cs.CV

TL;DR: Infant visual experiences have a “lumpy” similarity structure that helps them learn object categories from limited data, and mimicking this structure improves machine learning generalization.

Details

Motivation: To understand how infants learn object categories from limited visual experiences, despite typically requiring large datasets for robust learning.

Method: Analyzed egocentric images from 14 infants (7-11 months) and conducted computational experiments mimicking the lumpy similarity structure found in infant visual input.

Result: Infant visual input shows clusters of highly similar images interspersed with more variable ones across early-learned categories. Mimicking this structure improves machine learning generalization from small datasets.

Conclusion: The natural lumpy structure of infant visual experiences supports early category learning and offers principles for efficient learning across various problems and learners.

Abstract: Young children readily recognize and generalize visual objects labeled by common nouns, suggesting that these basic level object categories may be given. Yet if they are, how they arise remains unclear. We propose that the answer lies in the statistics of infant daily life visual experiences. Whereas large and diverse datasets typically support robust learning and generalization in human and machine learning, infants achieve this generalization from limited experiences. We suggest that the resolution of this apparent contradiction lies in the visual diversity of daily life, repeated experiences with single object instances. Analyzing egocentric images from 14 infants (aged 7 to 11 months) we show that their everyday visual input exhibits a lumpy similarity structure, with clusters of highly similar images interspersed with rarer, more variable ones, across eight early-learned categories. Computational experiments show that mimicking this structure in machines improves generalization from small datasets in machine learning. The natural lumpiness of infant experience may thus support early category learning and generalization and, more broadly, offer principles for efficient learning across a variety of problems and kinds of learners.

[103] SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images

Jiaxin Guo, Tongfan Guan, Wenzhen Dong, Wenzhao Zheng, Wenting Wang, Yue Wang, Yeung Yam, Yun-Hui Liu

Main category: cs.CV

TL;DR: SaLon3R is a novel framework for structure-aware, long-term 3D Gaussian Splatting reconstruction that eliminates redundancy through anchor primitives and resolves geometric inconsistencies using a 3D Point Transformer.

Details

Motivation: Existing 3DGS methods predict per-pixel Gaussians and combine all views, leading to substantial redundancies and geometric inconsistencies in long-duration video sequences.

Method: Uses compact anchor primitives with differentiable saliency-aware Gaussian quantization, coupled with a 3D Point Transformer that refines anchor attributes and saliency to resolve cross-frame inconsistencies.

Result: Achieves reconstruction of over 50 views at over 10 FPS with 50-90% redundancy removal, demonstrating state-of-the-art performance on novel view synthesis and depth estimation.

Conclusion: The approach effectively resolves artifacts and prunes redundant 3DGS in a single feed-forward pass without known camera parameters or test-time optimization, showing superior efficiency and generalization for long-term 3D reconstruction.

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled generalizable, on-the-fly reconstruction of sequential input views. However, existing methods often predict per-pixel Gaussians and combine Gaussians from all views as the scene representation, leading to substantial redundancies and geometric inconsistencies in long-duration video sequences. To address this, we propose SaLon3R, a novel framework for Structure-aware, Long-term 3DGS Reconstruction. To our best knowledge, SaLon3R is the first online generalizable GS method capable of reconstructing over 50 views in over 10 FPS, with 50% to 90% redundancy removal. Our method introduces compact anchor primitives to eliminate redundancy through differentiable saliency-aware Gaussian quantization, coupled with a 3D Point Transformer that refines anchor attributes and saliency to resolve cross-frame geometric and photometric inconsistencies. Specifically, we first leverage a 3D reconstruction backbone to predict dense per-pixel Gaussians and a saliency map encoding regional geometric complexity. Redundant Gaussians are compressed into compact anchors by prioritizing high-complexity regions. The 3D Point Transformer then learns spatial structural priors in 3D space from training data to refine anchor attributes and saliency, enabling regionally adaptive Gaussian decoding for geometric fidelity. Without known camera parameters or test-time optimization, our approach effectively resolves artifacts and prunes the redundant 3DGS in a single feed-forward pass. Experiments on multiple datasets demonstrate our state-of-the-art performance on both novel view synthesis and depth estimation, demonstrating superior efficiency, robustness, and generalization ability for long-term generalizable 3D reconstruction. Project Page: https://wrld.github.io/SaLon3R/.

[104] TGT: Text-Grounded Trajectories for Locally Controlled Video Generation

Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang, Bo Liu, Yiding Yang, Guang Chen, Longyin Wen, Alan Yuille, Chongyang Ma

Main category: cs.CV

TL;DR: TGT is a text-to-video generation framework that uses point trajectories paired with localized text descriptions to precisely control subject composition and motion in complex multi-object scenes.

Details

Motivation: Standard text-to-video methods have limited control over subject composition, especially in complex multi-object scenarios. Existing approaches using bounding boxes or segmentation masks struggle with precision and lack clear entity-trajectory correspondence.

Method: Proposes Text-Grounded Trajectories (TGT) framework with Location-Aware Cross-Attention (LACA) to integrate trajectory and text signals, dual-CFG scheme for separate local/global text guidance, and a data processing pipeline producing trajectories with localized descriptions from annotated video clips.

Result: TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared to prior approaches, enabling intuitive control of appearance and motion through point trajectories.

Conclusion: The TGT framework successfully addresses limitations in multi-object video generation by using text-grounded trajectories as intuitive motion handles, providing precise control over both appearance and motion in complex scenes.

Abstract: Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches. Website: https://textgroundedtraj.github.io.

[105] Deep generative priors for 3D brain analysis

Ana Lawry Aguila, Dina Zemlyanker, You Cheng, Sudeshna Das, Daniel C. Alexander, Oula Puonti, Annabel Sorby-Adams, W. Taylor Kimberly, Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: This paper presents a novel framework that combines diffusion models with Bayesian inverse problems to solve medical imaging tasks, using diffusion priors trained on brain MRI data for tasks like super-resolution and inpainting.

Details

Motivation: To bridge the gap between data-driven diffusion models and domain knowledge in medical imaging, addressing limitations of classical mathematical priors in capturing complex brain anatomy while leveraging the strengths of Bayesian inverse problems.

Method: Uses score-based diffusion priors trained on diverse brain MRI data, paired with flexible forward models for various image processing tasks including super-resolution, bias field correction, and inpainting.

Result: Achieves state-of-the-art performance on heterogeneous clinical and research MRI data, producing consistent high-quality solutions without requiring paired training datasets, and can refine outputs from existing deep learning methods.

Conclusion: Demonstrates the potential of diffusion priors as versatile tools for brain MRI analysis, successfully combining data-driven models with domain knowledge for robust performance.

Abstract: Diffusion models have recently emerged as powerful generative models in medical imaging. However, it remains a major challenge to combine these data-driven models with domain knowledge to guide brain imaging problems. In neuroimaging, Bayesian inverse problems have long provided a successful framework for inference tasks, where incorporating domain knowledge of the imaging process enables robust performance without requiring extensive training data. However, the anatomical modeling component of these approaches typically relies on classical mathematical priors that often fail to capture the complex structure of brain anatomy. In this work, we present the first general-purpose application of diffusion models as priors for solving a wide range of medical imaging inverse problems. Our approach leverages a score-based diffusion prior trained extensively on diverse brain MRI data, paired with flexible forward models that capture common image processing tasks such as super-resolution, bias field correction, inpainting, and combinations thereof. We further demonstrate how our framework can refine outputs from existing deep learning methods to improve anatomical fidelity. Experiments on heterogeneous clinical and research MRI data show that our method achieves state-of-the-art performance producing consistent, high-quality solutions without requiring paired training datasets. These results highlight the potential of diffusion priors as versatile tools for brain MRI analysis.

Huiling Zheng, Xian Zhong, Bin Liu, Yi Xiao, Bihan Wen, Xiaofeng Li

Main category: cs.CV

TL;DR: Proposes PAD, a frequency-aware framework that decouples phase (shared) and amplitude (complementary) components in Fourier domain for SAR-RGB fusion, achieving state-of-the-art land cover classification.

Details

Motivation: Address modality heterogeneity and underexploited spectral complementarity in SAR-RGB fusion, where existing methods fail to decouple shared structural features from complementary radiometric attributes.

Method: Phase-Amplitude Decoupling (PAD) with Phase Spectrum Correction (PSC) for geometric consistency and Amplitude Spectrum Fusion (ASF) with frequency-adaptive MLPs to integrate SAR’s morphological sensitivity and RGB’s spectral richness.

Result: Extensive experiments on WHU-OPT-SAR and DDHR-SK datasets demonstrate state-of-the-art performance in land cover classification.

Conclusion: Establishes a new paradigm for physics-aware multi-modal fusion in remote sensing by explicitly leveraging frequency domain properties for improved feature representation.

Abstract: The fusion of Synthetic Aperture Radar (SAR) and RGB imagery for land cover classification remains challenging due to modality heterogeneity and underexploited spectral complementarity. Existing approaches often fail to decouple shared structural features from modality-complementary radiometric attributes, resulting in feature conflicts and information loss. To address this, we propose Phase-Amplitude Decoupling (PAD), a frequency-aware framework that separates phase (modality-shared) and amplitude (modality-complementary) components in the Fourier domain. This design reinforces shared structures while preserving complementary characteristics, thereby enhancing fusion quality. Unlike previous methods that overlook the distinct physical properties encoded in frequency spectra, PAD explicitly introduces amplitude-phase decoupling for multi-modal fusion. Specifically, PAD comprises two key components: 1) Phase Spectrum Correction (PSC), which aligns cross-modal phase features via convolution-guided scaling to improve geometric consistency; and 2) Amplitude Spectrum Fusion (ASF), which dynamically integrates high- and low-frequency patterns using frequency-adaptive multilayer perceptrons, effectively exploiting SAR’s morphological sensitivity and RGB’s spectral richness. Extensive experiments on WHU-OPT-SAR and DDHR-SK demonstrate state-of-the-art performance. This work establishes a new paradigm for physics-aware multi-modal fusion in remote sensing. The code will be available at https://github.com/RanFeng2/PAD.

[107] Fourier Transform Multiple Instance Learning for Whole Slide Image Classification

Anthony Bilic, Guangyu Sun, Ming Li, Md Sanzid Bin Hossain, Yu Tian, Wei Zhang, Laura Brattain, Dexter Hadley, Chen Chen

Main category: cs.CV

TL;DR: FFT-MIL enhances WSI classification by adding a frequency-domain branch to capture global context, improving performance across multiple MIL methods and datasets.

Details

Motivation: Existing MIL methods for WSI classification struggle with global dependencies due to large image sizes and local patch embeddings, limiting robust diagnostic predictions.

Method: Proposes FFT-MIL framework with frequency-domain branch using Fast Fourier Transform to extract low-frequency crops, processed through FFT-Block with convolutional layers and Min-Max normalization, then fused with spatial features.

Result: Integration improved macro F1 scores by 3.51% and AUC by 1.51% across six MIL methods on three datasets (BRACS, LUAD, IMP), showing consistent gains.

Conclusion: Frequency-domain learning effectively captures global dependencies in WSI classification, complementing spatial features and advancing MIL-based computational pathology.

Abstract: Whole Slide Image (WSI) classification relies on Multiple Instance Learning (MIL) with spatial patch features, yet existing methods struggle to capture global dependencies due to the immense size of WSIs and the local nature of patch embeddings. This limitation hinders the modeling of coarse structures essential for robust diagnostic prediction. We propose Fourier Transform Multiple Instance Learning (FFT-MIL), a framework that augments MIL with a frequency-domain branch to provide compact global context. Low-frequency crops are extracted from WSIs via the Fast Fourier Transform and processed through a modular FFT-Block composed of convolutional layers and Min-Max normalization to mitigate the high variance of frequency data. The learned global frequency feature is fused with spatial patch features through lightweight integration strategies, enabling compatibility with diverse MIL architectures. FFT-MIL was evaluated across six state-of-the-art MIL methods on three public datasets (BRACS, LUAD, and IMP). Integration of the FFT-Block improved macro F1 scores by an average of 3.51% and AUC by 1.51%, demonstrating consistent gains across architectures and datasets. These results establish frequency-domain learning as an effective and efficient mechanism for capturing global dependencies in WSI classification, complementing spatial features and advancing the scalability and accuracy of MIL-based computational pathology.

Xingrui Wang, Jiang Liu, Chao Huang, Xiaodong Yu, Ze Wang, Ximeng Sun, Jialian Wu, Alan Yuille, Emad Barsoum, Zicheng Liu

Main category: cs.CV

TL;DR: XModBench is a tri-modal benchmark for evaluating cross-modal consistency in omni-modal LLMs, revealing that current models struggle with spatial/temporal reasoning, exhibit modality disparities, and show directional imbalance.

Details

Motivation: Existing benchmarks focus on general cross-modal QA but don't assess whether OLLMs achieve true modality-invariant reasoning or suffer from modality-specific biases.

Method: Created XModBench with 60,828 multiple-choice questions spanning 5 task families, systematically covering all 6 modality compositions in question-answer pairs for fine-grained diagnosis.

Result: Even top model Gemini 2.5 Pro shows: (i) <60% accuracy on spatial/temporal reasoning, (ii) performance drops with audio vs text, (iii) lower consistency when vision serves as context vs text.

Conclusion: Current OLLMs are far from truly modality-invariant reasoning, and XModBench serves as a fundamental diagnostic tool for evaluating and improving cross-modal competence.

Abstract: Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM’s modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at https://xingruiwang.github.io/projects/XModBench/.

[109] Train a Unified Multimodal Data Quality Classifier with Synthetic Data

Weizhi Wang, Rongmei Lin, Shiyang Li, Colin Lockard, Ritesh Sarkhel, Sanket Lokegaonkar, Jingbo Shang, Xifeng Yan, Nasser Zalmout, Xian Li

Main category: cs.CV

TL;DR: UniFilter is a unified multimodal data quality classifier that filters high-quality image-text caption and interleaved data for MLLM pre-training, using semi-synthetic data generation and improving downstream performance.

Details

Motivation: High-quality data filtering for image-text interleaved document data in MLLMs is under-explored, and there's a challenge in collecting diverse labeled multimodal data.

Method: Train an efficient MLLM as Unified Multimodal Data Quality Classifier using semi-synthetic approach with raw images and generated text across four quality levels, then apply it to filter DataComp captions and OBELICS interleaved data.

Result: MLLMs pre-trained on UniFilter-curated data show significantly enhanced zero-shot reasoning, in-context learning capabilities, and stronger performance on various benchmarks after visual supervised fine-tuning.

Conclusion: UniFilter enables effective curation of high-quality multimodal data, improving MLLM performance, and the authors release training data, model checkpoints, and OBELICS-HQ dataset to the community.

Abstract: The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.

[110] Hyperparameter Optimization and Reproducibility in Deep Learning Model Training

Usman Afzaal, Ziyu Su, Usama Sajjad, Hao Lu, Mostafa Rezapour, Metin Nafi Gurcan, Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: This paper investigates reproducibility challenges in histopathology foundation model training, identifying optimal hyperparameter ranges and training configurations through systematic experiments on CLIP models trained on QUILT-1M dataset.

Details

Motivation: Reproducibility remains a critical challenge in foundation model training for histopathology due to software randomness, hardware non-determinism, and inconsistent hyperparameter reporting.

Method: Trained a CLIP model on QUILT-1M dataset and systematically evaluated the impact of different hyperparameter settings and augmentation strategies across three downstream histopathology datasets (PatchCamelyon, LC25000-Lung, and LC25000-Colon).

Result: Identified clear trends: RandomResizedCrop values of 0.7-0.8 performed best, distributed training without local loss improved stability, learning rates below 5.0e-5 degraded performance, and LC25000 (Colon) dataset provided the most reproducible benchmark.

Conclusion: Reproducibility in computational pathology depends on both transparent documentation and carefully chosen experimental configurations, with practical rules provided to guide future reproducible foundation model development for digital pathology.

Abstract: Reproducibility remains a critical challenge in foundation model training for histopathology, often hindered by software randomness, hardware non-determinism, and inconsistent hyperparameter reporting. To investigate these issues, we trained a CLIP model on the QUILT-1M dataset and systematically evaluated the impact of different hyperparameter settings and augmentation strategies across three downstream histopathology datasets (PatchCamelyon, LC25000-Lung, and LC25000-Colon). Despite variability across runs, we identified clear trends: RandomResizedCrop values of 0.7-0.8 outperformed more aggressive (0.6) or conservative (0.9) settings, distributed training without local loss improved stability, and learning rates below 5.0e-5 consistently degraded performance across all datasets. The LC25000 (Colon) dataset consistently provided the most reproducible benchmark. These findings highlight that reproducibility in computational pathology depends not only on transparent documentation but also on carefully chosen experimental configurations, and we provide practical rules to guide future efforts in developing reproducible foundation models for digital pathology.

[111] Salient Concept-Aware Generative Data Augmentation

Tianchen Zhao, Xuanbai Chen, Zhihua Li, Jun Fang, Dongsheng An, Xiang Xu, Zhuowen Tu, Yifan Xing

Main category: cs.CV

TL;DR: A personalized image generation framework that uses salient concept-aware image embeddings to balance fidelity and diversity in data augmentation, improving downstream model robustness.

Details

Motivation: Existing generative data augmentation methods struggle to balance fidelity and diversity, as they often preserve non-essential image attributes that conflict with text prompts intended to modify those elements.

Method: Proposes a framework using salient concept-aware image embedding model to reduce influence of irrelevant visual details during synthesis, maintaining better alignment between image and text inputs.

Result: Outperforms state-of-the-art augmentation methods across eight fine-grained vision datasets with 0.73% and 6.5% accuracy improvements under conventional and long-tail settings respectively.

Conclusion: The framework effectively enhances training dataset diversity while preserving class-discriminative features, improving downstream model robustness in both conventional and long-tail scenarios.

Abstract: Recent generative data augmentation methods conditioned on both image and text prompts struggle to balance between fidelity and diversity, as it is challenging to preserve essential image details while aligning with varied text prompts. This challenge arises because representations in the synthesis process often become entangled with non-essential input image attributes such as environmental contexts, creating conflicts with text prompts intended to modify these elements. To address this, we propose a personalized image generation framework that uses a salient concept-aware image embedding model to reduce the influence of irrelevant visual details during the synthesis process, thereby maintaining intuitive alignment between image and text inputs. By generating images that better preserve class-discriminative features with additional controlled variations, our framework effectively enhances the diversity of training datasets and thereby improves the robustness of downstream models. Our approach demonstrates superior performance across eight fine-grained vision datasets, outperforming state-of-the-art augmentation methods with averaged classification accuracy improvements by 0.73% and 6.5% under conventional and long-tail settings, respectively.

[112] CARDIUM: Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records

Daniela Vega, Hannah V. Ceballos, Javier S. Vera, Santiago Rodriguez, Alejandra Perez, Angela Castillo, Maria Escobar, Dario Londoño, Luis A. Sarmiento, Camila I. Castro, Nadiezhda Rodriguez, Juan C. Briceño, Pablo Arbeláez

Main category: cs.CV

TL;DR: CARDIUM is the first public multimodal dataset for prenatal CHD detection, combining fetal ultrasound/echocardiographic images with maternal clinical records. A cross-attention transformer architecture improves CHD detection by 11-50% over single-modality approaches.

Details

Motivation: Prenatal CHD diagnosis faces challenges with imbalanced, low-quality datasets and lack of multimodal integration, limiting AI model performance and clinical decision support.

Method: Proposed a robust multimodal transformer architecture with cross-attention mechanism to fuse feature representations from both image (ultrasound/echocardiographic) and tabular (clinical records) data.

Result: Achieved 11% improvement over image-only and 50% improvement over tabular-only approaches, with F1 score of 79.8 ± 4.8% on the CARDIUM dataset.

Conclusion: The CARDIUM dataset and multimodal transformer enable significant improvements in prenatal CHD detection, with public release to encourage further research in this field.

Abstract: Prenatal diagnosis of Congenital Heart Diseases (CHDs) holds great potential for Artificial Intelligence (AI)-driven solutions. However, collecting high-quality diagnostic data remains difficult due to the rarity of these conditions, resulting in imbalanced and low-quality datasets that hinder model performance. Moreover, no public efforts have been made to integrate multiple sources of information, such as imaging and clinical data, further limiting the ability of AI models to support and enhance clinical decision-making. To overcome these challenges, we introduce the Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records (CARDIUM) dataset, the first publicly available multimodal dataset consolidating fetal ultrasound and echocardiographic images along with maternal clinical records for prenatal CHD detection. Furthermore, we propose a robust multimodal transformer architecture that incorporates a cross-attention mechanism to fuse feature representations from image and tabular data, improving CHD detection by 11% and 50% over image and tabular single-modality approaches, respectively, and achieving an F1 score of 79.8 $\pm$ 4.8% in the CARDIUM dataset. We will publicly release our dataset and code to encourage further research on this unexplored field. Our dataset and code are available at https://github.com/BCVUniandes/Cardium, and at the project website https://bcv-uniandes.github.io/CardiumPage/

[113] The Face of Persuasion: Analyzing Bias and Generating Culture-Aware Ads

Aysan Aghazadeh, Adriana Kovashka

Main category: cs.CV

TL;DR: The paper investigates demographic bias in AI-generated ads, examining how different ad topics show bias and how identical ads with varying gender/race portrayals have different persuasiveness levels. It also explores country-specific ad targeting techniques.

Details

Motivation: To understand the potential of text-to-image models for customized visual advertising while examining demographic biases and persuasiveness disparities in AI-generated ads.

Method: Analyzed demographic bias across different ad topics, tested persuasiveness of identical ads with varying gender/race portrayals using model judgments, and experimented with country-specific ad targeting techniques.

Result: Found demographic bias in ads for different topics and disparate persuasiveness levels for identical ads with different demographic portrayals. Developed a technique for country-specific ad targeting.

Conclusion: Text-to-image models show promise for customized advertising but exhibit demographic biases that need addressing. Country-specific targeting techniques can help create more effective localized ads.

Abstract: Text-to-image models are appealing for customizing visual advertisements and targeting specific populations. We investigate this potential by examining the demographic bias within ads for different ad topics, and the disparate level of persuasiveness (judged by models) of ads that are identical except for gender/race of the people portrayed. We also experiment with a technique to target ads for specific countries. The code is available at https://github.com/aysanaghazadeh/FaceOfPersuasion

[114] DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion

Weijie Wang, Jiagang Zhu, Zeyu Zhang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Haoxiao Wang, Guan Huang, Xinze Chen, Yukun Zhou, Wenkang Qin, Duochao Shi, Haoyun Li, Guanghong Jia, Jiwen Lu

Main category: cs.CV

TL;DR: DriveGen3D is a framework for generating controllable dynamic 3D driving scenes by combining accelerated video generation with 3D reconstruction, addressing limitations in computational demands and temporal coherence.

Details

Motivation: Current methods for driving scene synthesis have prohibitive computational costs for long-term generation, focus only on 2D video without 3D representation, or are limited to static single-scene reconstruction.

Method: Uses FastDrive-DiT (efficient video diffusion transformer for high-resolution video synthesis under text and BEV layout guidance) and FastRecon3D (feed-forward reconstruction module for building 3D Gaussian representations across time).

Result: Enables real-time generation of extended driving videos (424×800 at 12 FPS) with SSIM of 0.811 and PSNR of 22.84 on novel view synthesis while maintaining parameter efficiency.

Conclusion: DriveGen3D successfully bridges the gap between long-term video generation and dynamic 3D scene reconstruction, providing highly controllable and computationally efficient synthesis of driving scenes.

Abstract: We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird’s-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward reconstruction module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. Together, these components enable real-time generation of extended driving videos (up to $424\times800$ at 12 FPS) and corresponding dynamic 3D scenes, achieving SSIM of 0.811 and PSNR of 22.84 on novel view synthesis, all while maintaining parameter efficiency.

[115] CuSfM: CUDA-Accelerated Structure-from-Motion

Jingrui Yu, Jun Liu, Kefei Ren, Joydeep Biswas, Rurui Ye, Keqiang Wu, Chirag Majithia, Di Zeng

Main category: cs.CV

TL;DR: cuSfM is a CUDA-accelerated offline Structure-from-Motion system that uses GPU parallelization to achieve high accuracy and speed in camera pose estimation, outperforming COLMAP while maintaining precision for offline applications.

Details

Motivation: To address the need for efficient and accurate camera pose estimation in autonomous navigation, robotic perception, and virtual simulation systems, where computational resources can be fully utilized for maximum accuracy in offline processing.

Method: Leverages GPU parallelization through CUDA acceleration to efficiently employ computationally intensive feature extractors, generating comprehensive and non-redundant data associations for precise camera pose estimation and globally consistent mapping.

Result: cuSfM achieves significantly improved accuracy and processing speed compared to COLMAP across various testing scenarios, while maintaining high precision and global consistency essential for offline SfM applications.

Conclusion: The system provides an effective solution for offline SfM with superior performance, and is released as an open-source Python wrapper (PyCuSfM) to facilitate research and applications in computer vision and robotics.

Abstract: Efficient and accurate camera pose estimation forms the foundational requirement for dense reconstruction in autonomous navigation, robotic perception, and virtual simulation systems. This paper addresses the challenge via cuSfM, a CUDA-accelerated offline Structure-from-Motion system that leverages GPU parallelization to efficiently employ computationally intensive yet highly accurate feature extractors, generating comprehensive and non-redundant data associations for precise camera pose estimation and globally consistent mapping. The system supports pose optimization, mapping, prior-map localization, and extrinsic refinement. It is designed for offline processing, where computational resources can be fully utilized to maximize accuracy. Experimental results demonstrate that cuSfM achieves significantly improved accuracy and processing speed compared to the widely used COLMAP method across various testing scenarios, while maintaining the high precision and global consistency essential for offline SfM applications. The system is released as an open-source Python wrapper implementation, PyCuSfM, available at https://github.com/nvidia-isaac/pyCuSFM, to facilitate research and applications in computer vision and robotics.

[116] Post-Processing Methods for Improving Accuracy in MRI Inpainting

Nishad Kulkarni, Krithika Iyer, Austin Tapp, Abhijeet Parida, Daniel Capellán-Martín, Zhifan Jiang, María J. Ledesma-Carbayo, Syed Muhammad Anwar, Marius George Linguraru

Main category: cs.CV

TL;DR: The paper proposes a pipeline combining model ensembling with post-processing strategies and a U-Net enhancement stage to improve brain MRI inpainting for tumor regions, enabling better application of automated analysis tools.

Details

Motivation: Standard MRI analysis tools fail with large lesions like tumors, so inpainting techniques are needed to synthesize healthy tissues in tumor areas to make these tools work reliably.

Method: Combines model ensembling with post-processing (median filtering, histogram matching, pixel averaging) and adds a lightweight U-Net enhancement stage for anatomical refinement.

Result: The proposed pipeline improves anatomical plausibility and visual fidelity of inpainted regions, yielding higher accuracy and more robust outcomes than individual baseline models.

Conclusion: By combining established models with targeted post-processing, the approach achieves improved and more accessible inpainting outcomes, supporting broader clinical deployment and sustainable research.

Abstract: Magnetic Resonance Imaging (MRI) is the primary imaging modality used in the diagnosis, assessment, and treatment planning for brain pathologies. However, most automated MRI analysis tools, such as segmentation and registration pipelines, are optimized for healthy anatomies and often fail when confronted with large lesions such as tumors. To overcome this, image inpainting techniques aim to locally synthesize healthy brain tissues in tumor regions, enabling the reliable application of general-purpose tools. In this work, we systematically evaluate state-of-the-art inpainting models and observe a saturation in their standalone performance. In response, we introduce a methodology combining model ensembling with efficient post-processing strategies such as median filtering, histogram matching, and pixel averaging. Further anatomical refinement is achieved via a lightweight U-Net enhancement stage. Comprehensive evaluation demonstrates that our proposed pipeline improves the anatomical plausibility and visual fidelity of inpainted regions, yielding higher accuracy and more robust outcomes than individual baseline models. By combining established models with targeted post-processing, we achieve improved and more accessible inpainting outcomes, supporting broader clinical deployment and sustainable, resource-conscious research. Our 2025 BraTS inpainting docker is available at https://hub.docker.com/layers/aparida12/brats2025/inpt.

[117] QCFace: Image Quality Control for boosting Face Representation & Recognition

Duc-Phuong Doan-Ngo, Thanh-Dang Diep, Thanh Nguyen-Duc, Thanh-Sach LE, Nam Thoai

Main category: cs.CV

TL;DR: QCFace introduces a hard margin strategy to decouple recognizability from identity representation in face recognition, overcoming mutual overlapping gradient problems and achieving state-of-the-art performance.

Details

Motivation: Current face recognition methods have two main drawbacks: (1) recognizability is only partially captured through soft margin constraints, leading to weaker quality representation and lower discrimination for low-quality faces; (2) mutual overlapping gradients between feature direction and magnitude cause instability and entangled representations where recognizability and identity are not cleanly separated.

Method: Proposes Quality Control Face (QCFace) - a hard margin strategy that overcomes mutual overlapping gradient problems. Uses a novel hard-margin-based loss function with a guidance factor for hypersphere planning, simultaneously optimizing for recognition ability and explicit recognizability representation.

Result: Extensive experiments confirm that QCFace provides robust and quantifiable recognizability encoding and achieves state-of-the-art performance in both verification and identification benchmarks compared to existing recognizability-based losses.

Conclusion: The hard margin strategy in QCFace successfully decouples recognizability from identity representation, enabling clear separation and improved performance in face recognition systems.

Abstract: Recognizability, a key perceptual factor in human face processing, strongly affects the performance of face recognition (FR) systems in both verification and identification tasks. Effectively using recognizability to enhance feature representation remains challenging. In deep FR, the loss function plays a crucial role in shaping how features are embedded. However, current methods have two main drawbacks: (i) recognizability is only partially captured through soft margin constraints, resulting in weaker quality representation and lower discrimination, especially for low-quality or ambiguous faces; (ii) mutual overlapping gradients between feature direction and magnitude introduce undesirable interactions during optimization, causing instability and confusion in hypersphere planning, which may result in poor generalization, and entangled representations where recognizability and identity are not cleanly separated. To address these issues, we introduce a hard margin strategy - Quality Control Face (QCFace), which overcomes the mutual overlapping gradient problem and enables the clear decoupling of recognizability from identity representation. Based on this strategy, a novel hard-margin-based loss function employs a guidance factor for hypersphere planning, simultaneously optimizing for recognition ability and explicit recognizability representation. Extensive experiments confirm that QCFace not only provides robust and quantifiable recognizability encoding but also achieves state-of-the-art performance in both verification and identification benchmarks compared to existing recognizability-based losses.

[118] Hyperbolic Structured Classification for Robust Single Positive Multi-label Learning

Yiming Lin, Shang Wang, Junkai Zhou, Qiufeng Wang, Xiao-Bo Jin, Kaizhu Huang

Main category: cs.CV

TL;DR: Proposes a hyperbolic classification framework for Single Positive Multi-Label Learning (SPMLL) using hyperbolic balls to model label relationships, achieving competitive performance with improved interpretability.

Details

Motivation: Address limitations in SPMLL where existing methods lack explicit geometric definitions for different relationship types and struggle to capture complex label relationships and hierarchical structures with only single positive annotations.

Method: Represents each label as a hyperbolic ball rather than point/vector, enabling geometric ball interactions to capture inclusion (hierarchical), overlap (co-occurrence), and separation (independence) relationships. Introduces temperature-adaptive hyperbolic ball classifier and physics-inspired double-well regularization.

Result: Extensive experiments on four benchmark datasets (MS-COCO, PASCAL VOC, NUS-WIDE, CUB-200-2011) show competitive performance with superior interpretability. Statistical analysis reveals strong correlation between learned embeddings and real-world co-occurrence patterns.

Conclusion: Establishes hyperbolic geometry as a more robust paradigm for structured classification under incomplete supervision, effectively modeling complex label relationships in SPMLL scenarios.

Abstract: Single Positive Multi-Label Learning (SPMLL) addresses the challenging scenario where each training sample is annotated with only one positive label despite potentially belonging to multiple categories, making it difficult to capture complex label relationships and hierarchical structures. While existing methods implicitly model label relationships through distance-based similarity, lacking explicit geometric definitions for different relationship types. To address these limitations, we propose the first hyperbolic classification framework for SPMLL that represents each label as a hyperbolic ball rather than a point or vector, enabling rich inter-label relationship modeling through geometric ball interactions. Our ball-based approach naturally captures multiple relationship types simultaneously: inclusion for hierarchical structures, overlap for co-occurrence patterns, and separation for semantic independence. Further, we introduce two key component innovations: a temperature-adaptive hyperbolic ball classifier and a physics-inspired double-well regularization that guides balls toward meaningful configurations. To validate our approach, extensive experiments on four benchmark datasets (MS-COCO, PASCAL VOC, NUS-WIDE, CUB-200-2011) demonstrate competitive performance with superior interpretability compared to existing methods. Furthermore, statistical analysis reveals strong correlation between learned embeddings and real-world co-occurrence patterns, establishing hyperbolic geometry as a more robust paradigm for structured classification under incomplete supervision.

[119] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models

Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, Alan Yuille

Main category: cs.CV

TL;DR: Spatial457 is a new synthetic dataset and benchmark for evaluating 6D spatial reasoning in large multimodal models, revealing significant performance degradation in 3D and 6D spatial tasks.

Details

Motivation: Existing benchmarks focus on 2D spatial understanding and lack comprehensive evaluation of 3D and 6D spatial reasoning capabilities in large multimodal models.

Method: Created Spatial457 dataset with 4 key spatial reasoning capabilities and cascading evaluation structure with 7 question types across 5 difficulty levels, from basic recognition to complex 6D spatial reasoning tasks.

Result: Models show performance decline as task complexity increases, especially in 3D reasoning and 6D spatial tasks. Introduced RPDR metric to quantify challenges and uncovered prediction biases across attributes.

Conclusion: Current LMMs have significant limitations in complex 3D and 6D spatial reasoning, and the Spatial457 benchmark provides a comprehensive framework to evaluate and improve these capabilities.

Abstract: Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present Spatial457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings. The code and data are released in https://github.com/XingruiWang/Spatial457.

[120] Latent Diffusion Model without Variational Autoencoder

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu

Main category: cs.CV

TL;DR: SVG introduces a novel latent diffusion model that replaces VAEs with self-supervised DINO features, creating semantically structured latent spaces for more efficient training and better generative quality.

Details

Motivation: VAE+diffusion models suffer from limited training efficiency, slow inference, and poor transferability due to VAE latent spaces lacking clear semantic separation and discriminative structure.

Method: SVG constructs a feature space using frozen DINO features for semantic discriminability, with a lightweight residual branch for fine-grained details. Diffusion models are trained directly on this structured latent space.

Result: SVG enables accelerated diffusion training, supports few-step sampling, improves generative quality, and preserves semantic and discriminative capabilities of self-supervised representations.

Conclusion: SVG provides a principled pathway toward task-general, high-quality visual representations by leveraging semantically structured latent spaces without VAEs.

Abstract: Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations.

Chunhao Lu, Qiang Lu, Meichen Dong, Jake Luo

Main category: cs.CV

TL;DR: MDM (Multi-modal Diffusion Mamba) is a unified architecture that uses a Mamba-based diffusion model with a shared variational autoencoder to process multiple modalities, achieving superior performance in high-dimensional data generation and multi-modal tasks.

Details

Motivation: Current end-to-end multi-modal models use separate encoders and decoders for different modalities, which hinders joint representation learning. The authors aim to unify multi-modal processing through a single architecture.

Method: MDM employs a Mamba-based multi-step selection diffusion model that progressively generates and refines modality-specific information through a unified variational autoencoder for both encoding and decoding.

Result: MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, Chameleon) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral in image generation, captioning, visual QA, text comprehension, and reasoning tasks.

Conclusion: MDM effectively unifies multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.

Abstract: Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM’s effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.

[122] Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation

Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu, Changxing Ding

Main category: cs.CV

TL;DR: CoMe is a progressive layer pruning framework that addresses limitations in structured pruning of large language models through concatenation-based layer merging and hierarchical distillation, achieving state-of-the-art performance with minimal accuracy loss.

Details

Motivation: Large Language Models have massive computational and storage demands, and existing structured pruning methods suffer from performance degradation, incompetent weight layer aggregation, and lack of effective post-training recovery mechanisms.

Method: Progressive layer pruning with channel sensitivity metric using activation intensity and weight norms, concatenation-based layer merging to fuse critical channels across adjacent layers, and hierarchical distillation protocol leveraging layer correspondences for knowledge transfer.

Result: On seven benchmarks, CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b’s parameters, the pruned model retains 83% of its original average accuracy.

Conclusion: CoMe effectively addresses key limitations in structured pruning through its progressive framework, concatenation-based merging, and hierarchical distillation, enabling significant model size reduction while maintaining high performance.

Abstract: Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weight layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels across adjacent layers, enabling progressive model size reduction. Finally, we propose a hierarchical distillation protocol that leverages the correspondences between the original and pruned model layers established during pruning, thereby enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b’s parameters, the pruned model retains 83% of its original average accuracy. Our code is available at https://github.com/MPI-Lab/CoMe.

[123] Proto-Former: Unified Facial Landmark Detection by Prototype Transformer

Shengkai Hu, Haozhe Qi, Jun Wan, Jiaxing Huang, Lefei Zhang, Hang Sun, Dacheng Tao

Main category: cs.CV

TL;DR: Proto-Former is a unified facial landmark detection framework that enables joint training across multiple datasets with different landmark definitions through adaptive prototype learning and expert selection mechanisms.

Details

Motivation: Existing facial landmark detection methods are limited to single-dataset training due to different landmark definitions across datasets, which hinders model generalization and prevents development of unified models.

Method: Proto-Former consists of an Adaptive Prototype-Aware Encoder (APAE) for feature extraction and prototype learning, a Progressive Prototype-Aware Decoder (PPAD) for prototype refinement and prompt generation, and a Prototype-Aware (PA) loss for stable expert selection and gradient conflict resolution.

Result: Extensive experiments on benchmark datasets show Proto-Former achieves superior performance compared to state-of-the-art methods, demonstrating effective multi-dataset training capability.

Conclusion: Proto-Former successfully addresses the limitations of single-dataset training in facial landmark detection by enabling unified multi-dataset training through prototype-based representation learning and stable expert selection mechanisms.

Abstract: Recent advances in deep learning have significantly improved facial landmark detection. However, existing facial landmark detection datasets often define different numbers of landmarks, and most mainstream methods can only be trained on a single dataset. This limits the model generalization to different datasets and hinders the development of a unified model. To address this issue, we propose Proto-Former, a unified, adaptive, end-to-end facial landmark detection framework that explicitly enhances dataset-specific facial structural representations (i.e., prototype). Proto-Former overcomes the limitations of single-dataset training by enabling joint training across multiple datasets within a unified architecture. Specifically, Proto-Former comprises two key components: an Adaptive Prototype-Aware Encoder (APAE) that performs adaptive feature extraction and learns prototype representations, and a Progressive Prototype-Aware Decoder (PPAD) that refines these prototypes to generate prompts that guide the model’s attention to key facial regions. Furthermore, we introduce a novel Prototype-Aware (PA) loss, which achieves optimal path finding by constraining the selection weights of prototype experts. This loss function effectively resolves the problem of prototype expert addressing instability during multi-dataset training, alleviates gradient conflicts, and enables the extraction of more accurate facial structure features. Extensive experiments on widely used benchmark datasets demonstrate that our Proto-Former achieves superior performance compared to existing state-of-the-art methods. The code is publicly available at: https://github.com/Husk021118/Proto-Former.

Joshua Li, Brendan Chharawala, Chang Shu, Xue Bin Peng, Pengcheng Xi

Main category: cs.CV

TL;DR: SHARE is a method that uses scene geometry to improve 3D human motion reconstruction from monocular RGB videos, achieving more accurate human placement in 3D space.

Details

Motivation: Current human motion reconstruction methods struggle with accurately placing humans in 3D space, which is important for realistic character interactions in gaming, AR/VR, and robotics applications.

Method: SHARE estimates human mesh and segmentation masks for each frame, creates scene point maps at keyframes, and iteratively refines human positions by comparing human mesh against human point maps extracted using masks. It preserves relative root joint positions between keyframes and non-keyframes during optimization.

Result: Extensive experiments show that SHARE outperforms existing methods, enabling more accurate 3D human placement while reconstructing surrounding scenes on both curated datasets and in-the-wild web videos.

Conclusion: SHARE successfully leverages scene geometry’s spatial cues to ground human motion reconstruction more accurately, facilitating realistic character interactions in various applications.

Abstract: Animating realistic character interactions with the surrounding environment is important for autonomous agents in gaming, AR/VR, and robotics. However, current methods for human motion reconstruction struggle with accurately placing humans in 3D space. We introduce Scene-Human Aligned REconstruction (SHARE), a technique that leverages the scene geometry’s inherent spatial cues to accurately ground human motion reconstruction. Each reconstruction relies solely on a monocular RGB video from a stationary camera. SHARE first estimates a human mesh and segmentation mask for every frame, alongside a scene point map at keyframes. It iteratively refines the human’s positions at these keyframes by comparing the human mesh against the human point map extracted from the scene using the mask. Crucially, we also ensure that non-keyframe human meshes remain consistent by preserving their relative root joint positions to keyframe root joints during optimization. Our approach enables more accurate 3D human placement while reconstructing the surrounding scene, facilitating use cases on both curated datasets and in-the-wild web videos. Extensive experiments demonstrate that SHARE outperforms existing methods.

[125] Cortical-SSM: A Deep State Space Model for EEG and ECoG Motor Imagery Decoding

Shuntaro Suzuki, Shunya Nagashima, Masayuki Hirata, Komei Sugiura

Main category: cs.CV

TL;DR: Cortical-SSM is a novel architecture using deep state space models to classify EEG/ECoG signals by capturing temporal, spatial, and frequency dependencies, outperforming baselines on motor imagery datasets.

Details

Motivation: EEG and ECoG signals for motor imagery classification have applications in communication assistance and rehabilitation but face challenges from physiological artifacts and limitations of Transformer models in capturing fine-grained dependencies.

Method: Proposed Cortical-SSM extends deep state space models to capture integrated dependencies across temporal, spatial, and frequency domains of EEG/ECoG signals.

Result: Outperformed baseline methods on three benchmarks: two large-scale public MI EEG datasets with 50+ subjects and a clinical MI ECoG dataset from an ALS patient. Visual explanations show the model captures neurophysiologically relevant regions.

Conclusion: Cortical-SSM effectively handles EEG/ECoG signal classification by capturing multi-domain dependencies and provides interpretable results that align with neurophysiological understanding.

Abstract: Classification of electroencephalogram (EEG) and electrocorticogram (ECoG) signals obtained during motor imagery (MI) has substantial application potential, including for communication assistance and rehabilitation support for patients with motor impairments. These signals remain inherently susceptible to physiological artifacts (e.g., eye blinking, swallowing), which pose persistent challenges. Although Transformer-based approaches for classifying EEG and ECoG signals have been widely adopted, they often struggle to capture fine-grained dependencies within them. To overcome these limitations, we propose Cortical-SSM, a novel architecture that extends deep state space models to capture integrated dependencies of EEG and ECoG signals across temporal, spatial, and frequency domains. We validated our method across three benchmarks: 1) two large-scale public MI EEG datasets containing more than 50 subjects, and 2) a clinical MI ECoG dataset recorded from a patient with amyotrophic lateral sclerosis. Our method outperformed baseline methods on the three benchmarks. Furthermore, visual explanations derived from our model indicate that it effectively captures neurophysiologically relevant regions of both EEG and ECoG signals.

[126] Adaptive transfer learning for surgical tool presence detection in laparoscopic videos through gradual freezing fine-tuning

Ana Davila, Jacinto Colan, Yasuhisa Hasegawa

Main category: cs.CV

TL;DR: A novel staged adaptive fine-tuning approach with linear probing and gradual freezing improves surgical tool detection in minimally invasive surgery, achieving 96.4% mAP on Cholec80 dataset and demonstrating cross-domain applicability.

Details

Motivation: Limited annotated data in surgical settings challenges training robust deep learning models for automated surgical tool detection, which could significantly benefit minimally invasive surgery.

Method: Two-step approach: linear probing stage to condition classification layers on pre-trained CNN, and gradual freezing stage to dynamically reduce fine-tunable layers. Uses CNN architectures (ResNet-50, DenseNet-121) pre-trained on ImageNet.

Result: Achieved 96.4% mean average precision on Cholec80 dataset, outperforming existing approaches. Method also validated on CATARACTS dataset, confirming cross-domain applicability in ophthalmic surgery.

Conclusion: Gradual freezing fine-tuning is a promising technique for improving tool detection in diverse surgical procedures and may have broader applications in general image classification tasks.

Abstract: Minimally invasive surgery can benefit significantly from automated surgical tool detection, enabling advanced analysis and assistance. However, the limited availability of annotated data in surgical settings poses a challenge for training robust deep learning models. This paper introduces a novel staged adaptive fine-tuning approach consisting of two steps: a linear probing stage to condition additional classification layers on a pre-trained CNN-based architecture and a gradual freezing stage to dynamically reduce the fine-tunable layers, aiming to regulate adaptation to the surgical domain. This strategy reduces network complexity and improves efficiency, requiring only a single training loop and eliminating the need for multiple iterations. We validated our method on the Cholec80 dataset, employing CNN architectures (ResNet-50 and DenseNet-121) pre-trained on ImageNet for detecting surgical tools in cholecystectomy endoscopic videos. Our results demonstrate that our method improves detection performance compared to existing approaches and established fine-tuning techniques, achieving a mean average precision (mAP) of 96.4%. To assess its broader applicability, the generalizability of the fine-tuning strategy was further confirmed on the CATARACTS dataset, a distinct domain of minimally invasive ophthalmic surgery. These findings suggest that gradual freezing fine-tuning is a promising technique for improving tool presence detection in diverse surgical procedures and may have broader applications in general image classification tasks.

[127] FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers

Haisheng Su, Junjie Zhang, Feixiang Song, Sanping Zhou, Wei Wu, Nanning Zheng, Junchi Yan

Main category: cs.CV

TL;DR: FreqPDE introduces frequency-aware positional depth embedding to enhance 3D object detection from multi-view 2D images by combining high-frequency edge details and low-frequency semantics, with cross-view scale-invariant depth prediction and hybrid depth supervision.

Details

Motivation: Current methods rely on depth prediction with LiDAR supervision, but suffer from depth discontinuity at object boundaries and poor detection of small objects due to sparse supervision and use of high-level features. Cross-view consistency and scale invariance are also overlooked.

Method: Three main modules: Frequency-aware Spatial Pyramid Encoder (FSPE) combines high-frequency edge and low-frequency semantic features; Cross-view Scale-invariant Depth Predictor (CSDP) estimates pixel-level depth with cross-view attention; Positional Depth Encoder (PDE) generates 3D depth-aware features. Uses hybrid depth supervision from both metric and distribution perspectives.

Result: Extensive experiments on nuScenes dataset demonstrate the effectiveness and superiority of the proposed method.

Conclusion: FreqPDE successfully addresses depth quality issues in 3D object detection by incorporating frequency-aware features, cross-view consistency, and scale invariance, achieving improved performance without requiring explicit LiDAR supervision during training.

Abstract: Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query decoding, which necessitates explicit supervision from LiDAR points during the training phase. However, the predicted depth quality is still unsatisfactory such as depth discontinuity of object boundaries and indistinction of small objects, which are mainly caused by the sparse supervision of projected points and the use of high-level image features for depth prediction. Besides, cross-view consistency and scale invariance are also overlooked in previous methods. In this paper, we introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder, which can be obtained through three main modules. Specifically, the Frequency-aware Spatial Pyramid Encoder (FSPE) constructs a feature pyramid by combining high-frequency edge clues and low-frequency semantics from different levels respectively. Then the Cross-view Scale-invariant Depth Predictor (CSDP) estimates the pixel-level depth distribution with cross-view and efficient channel attention mechanism. Finally, the Positional Depth Encoder (PDE) combines the 2D image features and 3D position embeddings to generate the 3D depth-aware features for query decoding. Additionally, hybrid depth supervision is adopted for complementary depth learning from both metric and distribution aspects. Extensive experiments conducted on the nuScenes dataset demonstrate the effectiveness and superiority of our proposed method.

[128] PFGS: Pose-Fused 3D Gaussian Splatting for Complete Multi-Pose Object Reconstruction

Ting-Yu Yen, Yu-Sheng Chiu, Shih-Hsuan Hung, Peter Wonka, Hung-Kuo Chu

Main category: cs.CV

TL;DR: PFGS is a pose-aware 3D Gaussian Splatting framework that reconstructs complete 3D objects by fusing multi-pose image captures through intelligent registration strategies.

Details

Motivation: Existing 3DGS methods assume single static poses, leading to incomplete reconstructions that miss occluded regions. PFGS addresses the practical need for complete object reconstruction from multi-pose captures.

Method: Iteratively fuses auxiliary pose images into a unified 3DGS representation using pose-aware fusion with global/local registration. Leverages background features for per-pose camera estimation and foundation models for cross-pose registration.

Result: PFGS consistently outperforms baselines in qualitative and quantitative evaluations, producing more complete reconstructions and higher-fidelity 3DGS models.

Conclusion: PFGS successfully overcomes limitations of existing methods by intelligently incorporating foundation models into registration, resolving background inconsistency issues and enabling complete object reconstruction from multi-pose captures.

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled high-quality, real-time novel-view synthesis from multi-view images. However, most existing methods assume the object is captured in a single, static pose, resulting in incomplete reconstructions that miss occluded or self-occluded regions. We introduce PFGS, a pose-aware 3DGS framework that addresses the practical challenge of reconstructing complete objects from multi-pose image captures. Given images of an object in one main pose and several auxiliary poses, PFGS iteratively fuses each auxiliary set into a unified 3DGS representation of the main pose. Our pose-aware fusion strategy combines global and local registration to merge views effectively and refine the 3DGS model. While recent advances in 3D foundation models have improved registration robustness and efficiency, they remain limited by high memory demands and suboptimal accuracy. PFGS overcomes these challenges by incorporating them more intelligently into the registration process: it leverages background features for per-pose camera pose estimation and employs foundation models for cross-pose registration. This design captures the best of both approaches while resolving background inconsistency issues. Experimental results demonstrate that PFGS consistently outperforms strong baselines in both qualitative and quantitative evaluations, producing more complete reconstructions and higher-fidelity 3DGS models.

[129] LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding

Peng Ren, Hai Yang

Main category: cs.CV

TL;DR: LILAC enables real-time, long-sequence arbitrary motion stylization using a streaming VAE-Diffusion architecture with causal decoding, achieving smooth transitions without future frame dependency.

Details

Motivation: Existing streaming motion generation approaches have computational overhead and temporal stability issues, while high-quality VAE-Diffusion methods are limited to offline processing. There's a need to bridge this gap for real-time applications.

Method: Extends offline VAE-Diffusion framework to online setting using latent-space streaming architecture with sliding-window causal design and decoded motion feature injection for smooth transitions.

Result: Achieves real-time arbitrary motion stylization without future frame dependency, maintaining good balance between stylization quality and responsiveness on benchmark datasets.

Conclusion: LILAC successfully bridges the gap between offline quality and real-time performance for motion stylization, enabling continuous and responsive character control in applications.

Abstract: Generating long and stylized human motions in real time is critical for applications that demand continuous and responsive character control. Despite its importance, existing streaming approaches often operate directly in the raw motion space, leading to substantial computational overhead and making it difficult to maintain temporal stability. In contrast, latent-space VAE-Diffusion-based frameworks alleviate these issues and achieve high-quality stylization, but they are generally confined to offline processing. To bridge this gap, LILAC (Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding) builds upon a recent high-performing offline framework for arbitrary motion stylization and extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design and the injection of decoded motion features to ensure smooth motion transitions. This architecture enables long-sequence real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture, achieving a favorable balance between stylization quality and responsiveness as demonstrated by experiments on benchmark datasets. Supplementary video and examples are available at the project page: https://pren1.github.io/lilac/

[130] MARIS: Marine Open-Vocabulary Instance Segmentation with Geometric Enhancement and Semantic Alignment

Bingyu Li, Feiyu Wang, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

Main category: cs.CV

TL;DR: The paper introduces MARIS, the first large-scale underwater open-vocabulary instance segmentation benchmark, and proposes a unified framework with geometric and semantic components to address visual degradation and semantic misalignment in underwater scenes.

Details

Motivation: Existing underwater instance segmentation approaches are limited to close-vocabulary prediction and cannot recognize novel marine categories. Transferring open-vocabulary segmentation from natural images to underwater scenes suffers from severe visual degradation and semantic misalignment.

Method: Proposes a unified framework with two components: Geometric Prior Enhancement Module (GPEM) that leverages part-level and structural cues for object consistency under degraded visual conditions, and Semantic Alignment Injection Mechanism (SAIM) that enriches language embeddings with domain-specific priors.

Result: The framework consistently outperforms existing open-vocabulary baselines in both In-Domain and Cross-Domain settings on the MARIS benchmark.

Conclusion: Establishes a strong foundation for future underwater perception research by addressing key challenges in underwater open-vocabulary instance segmentation.

Abstract: Most existing underwater instance segmentation approaches are constrained by close-vocabulary prediction, limiting their ability to recognize novel marine categories. To support evaluation, we introduce \textbf{MARIS} (\underline{Mar}ine Open-Vocabulary \underline{I}nstance \underline{S}egmentation), the first large-scale fine-grained benchmark for underwater Open-Vocabulary (OV) segmentation, featuring a limited set of seen categories and diverse unseen categories. Although OV segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions. To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (\textbf{GPEM}) leverages stable part-level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (\textbf{SAIM}) enriches language embeddings with domain-specific priors, mitigating semantic ambiguity and improving recognition of unseen categories. Experiments show that our framework consistently outperforms existing OV baselines both In-Domain and Cross-Domain setting on MARIS, establishing a strong foundation for future underwater perception research.

[131] Robust High-Resolution Multi-Organ Diffusion MRI Using Synthetic-Data-Tuned Prompt Learning

Chen Qian, Haoyu Zhang, Junnan Ma, Liuhong Zhu, Qingrui Cai, Yu Wang, Ruibo Song, Lv Li, Lin Mei, Xianwang Jiang, Qin Xu, Boyu Jiang, Ran Tao, Chunmiao Chen, Shufang Chen, Dongyun Liang, Qiu Guo, Jianzhong Lin, Taishan Kang, Mengtian Lu, Liyuan Fu, Ruibin Huang, Huijuan Wan, Xu Huang, Jianhua Wang, Di Guo, Hai Zhong, Jianjun Zhou, Xiaobo Qu

Main category: cs.CV

TL;DR: LoSP-Prompt is a reconstruction framework that enables high-resolution multi-shot DWI for body-wide tumor diagnostics by overcoming motion artifacts through physics-informed modeling and synthetic-data-driven prompt learning.

Details

Motivation: Clinical adoption of multi-shot DWI is limited by severe motion-induced phase artifacts from respiration and peristalsis, compounded by multi-organ, multi-slice, multi-direction and multi-b-value complexities.

Method: Models inter-shot phase variations as high-order Locally Smooth Phase (LoSP) integrated into low-rank Hankel matrix reconstruction. The algorithm’s rank parameter is automatically set via prompt learning trained exclusively on synthetic abdominal DWI data emulating physiological motion.

Result: Achieved twice the spatial resolution of clinical single-shot DWI, enhanced liver lesion conspicuity, generalized to seven anatomical regions with a single model, and outperformed state-of-the-art methods in image quality, artifact suppression, and noise reduction (4-5 points on 5-point scale).

Conclusion: The approach eliminates navigator signals and realistic data supervision, providing an interpretable, robust solution for high-resolution multi-organ multi-shot DWI with transformative potential for precision oncology.

Abstract: Clinical adoption of multi-shot diffusion-weighted magnetic resonance imaging (multi-shot DWI) for body-wide tumor diagnostics is limited by severe motion-induced phase artifacts from respiration, peristalsis, and so on, compounded by multi-organ, multi-slice, multi-direction and multi-b-value complexities. Here, we introduce a reconstruction framework, LoSP-Prompt, that overcomes these challenges through physics-informed modeling and synthetic-data-driven prompt learning. We model inter-shot phase variations as a high-order Locally Smooth Phase (LoSP), integrated into a low-rank Hankel matrix reconstruction. Crucially, the algorithm’s rank parameter is automatically set via prompt learning trained exclusively on synthetic abdominal DWI data emulating physiological motion. Validated across 10,000+ clinical images (43 subjects, 4 scanner models, 5 centers), LoSP-Prompt: (1) Achieved twice the spatial resolution of clinical single-shot DWI, enhancing liver lesion conspicuity; (2) Generalized to seven diverse anatomical regions (liver, kidney, sacroiliac, pelvis, knee, spinal cord, brain) with a single model; (3) Outperformed state-of-the-art methods in image quality, artifact suppression, and noise reduction (11 radiologists’ evaluations on a 5-point scale, $p<0.05$), achieving 4-5 points (excellent) on kidney DWI, 4 points (good to excellent) on liver, sacroiliac and spinal cord DWI, and 3-4 points (good) on knee and tumor brain. The approach eliminates navigator signals and realistic data supervision, providing an interpretable, robust solution for high-resolution multi-organ multi-shot DWI. Its scanner-agnostic performance signifies transformative potential for precision oncology.

[132] Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang

Main category: cs.CV

TL;DR: LoD is a framework that detects unknown jailbreak attacks in LVLMs by shifting from attack-specific to task-specific learning, using safety-oriented representation and unsupervised classification.

Details

Motivation: Existing jailbreak detection methods either lack generalization to unseen attacks or have limited accuracy and efficiency due to attack-specific learning or heuristic approaches.

Method: Proposes Learning to Detect (LoD) with Multi-modal Safety Concept Activation Vector for safety representation learning and Safety Pattern Auto-Encoder for unsupervised attack classification.

Result: Achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency compared to existing methods.

Conclusion: LoD provides an effective general framework for detecting unknown jailbreak attacks in LVLMs by focusing on task-specific learning rather than attack-specific learning.

Abstract: Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.

[133] Semantic4Safety: Causal Insights from Zero-shot Street View Imagery Segmentation for Urban Road Safety

Huan Chen, Ting Han, Siyu Chen, Zhihao Guo, Yiping Chen, Meiliu Wu

Main category: cs.CV

TL;DR: Semantic4Safety uses zero-shot semantic segmentation on street-view imagery to create 11 streetscape indicators and analyzes 30,000 accident records in Austin using XGBoost, SHAP, and causal inference methods to identify accident-type-specific risk factors.

Details

Motivation: Address two key challenges in using street-view imagery for traffic risk analysis: constructing meaningful accident-related indicators and quantifying their causal impacts across different accident types.

Method: Proposed Semantic4Safety framework that applies zero-shot semantic segmentation to derive 11 interpretable streetscape indicators, integrates road type context, uses XGBoost multi-class classifier with SHAP for interpretation, and applies GPS weighting and ATE estimation for causal inference.

Result: Found heterogeneous, accident-type-specific causal patterns: scene complexity, exposure, and roadway geometry features dominate predictive power; larger drivable area and emergency space reduce risk, while excessive visual openness increases risk.

Conclusion: Semantic4Safety bridges predictive modeling with causal inference to support targeted interventions and high-risk corridor diagnosis, offering a scalable, data-informed tool for urban road safety planning.

Abstract: Street-view imagery (SVI) offers a fine-grained lens on traffic risk, yet two fundamental challenges persist: (1) how to construct street-level indicators that capture accident-related features, and (2) how to quantify their causal impacts across different accident types. To address these challenges, we propose Semantic4Safety, a framework that applies zero-shot semantic segmentation to SVIs to derive 11 interpretable streetscape indicators, and integrates road type as contextual information to analyze approximately 30,000 accident records in Austin. Specifically, we train an eXtreme Gradient Boosting (XGBoost) multi-class classifier and use Shapley Additive Explanations (SHAP) to interpret both global and local feature contributions, and then apply Generalized Propensity Score (GPS) weighting and Average Treatment Effect (ATE) estimation to control confounding and quantify causal effects. Results uncover heterogeneous, accident-type-specific causal patterns: features capturing scene complexity, exposure, and roadway geometry dominate predictive power; larger drivable area and emergency space reduce risk, whereas excessive visual openness can increase it. By bridging predictive modeling with causal inference, Semantic4Safety supports targeted interventions and high-risk corridor diagnosis, offering a scalable, data-informed tool for urban road safety planning.

[134] Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation

Feifei Zhang, Zhenhong Jia, Sensen Song, Fei Shi, Dayong Ren

Main category: cs.CV

TL;DR: PCMambaNet introduces a Predictive-Corrective paradigm that decouples modeling to accelerate learning, achieving state-of-the-art brain MRI segmentation accuracy in just 1-5 epochs by leveraging anatomical knowledge and focusing computational resources on diagnostically relevant regions.

Details

Motivation: End-to-end deep learning suffers from slow convergence and heavy reliance on large datasets, limiting efficiency in data-scarce domains like medical imaging where traditional methods require extensive training.

Method: PCMambaNet uses two modules: Predictive Prior Module (PPM) generates coarse approximations using bilateral symmetry to identify diagnostically relevant asymmetric regions, and Corrective Residual Network (CRN) learns residual errors to refine challenging regions and pathological boundaries.

Result: Extensive experiments on high-resolution brain MRI segmentation show PCMambaNet achieves state-of-the-art accuracy while converging within only 1-5 epochs, a performance unattainable by conventional end-to-end models.

Conclusion: The Predictive-Corrective paradigm effectively mitigates data inefficiency and overfitting by explicitly incorporating domain knowledge to simplify learning objectives, enabling dramatic acceleration in medical imaging tasks.

Abstract: Despite the remarkable success of the end-to-end paradigm in deep learning, it often suffers from slow convergence and heavy reliance on large-scale datasets, which fundamentally limits its efficiency and applicability in data-scarce domains such as medical imaging. In this work, we introduce the Predictive-Corrective (PC) paradigm, a framework that decouples the modeling task to fundamentally accelerate learning. Building upon this paradigm, we propose a novel network, termed PCMambaNet. PCMambaNet is composed of two synergistic modules. First, the Predictive Prior Module (PPM) generates a coarse approximation at low computational cost, thereby anchoring the search space. Specifically, the PPM leverages anatomical knowledge-bilateral symmetry-to predict a ‘focus map’ of diagnostically relevant asymmetric regions. Next, the Corrective Residual Network (CRN) learns to model the residual error, focusing the network’s full capacity on refining these challenging regions and delineating precise pathological boundaries. Extensive experiments on high-resolution brain MRI segmentation demonstrate that PCMambaNet achieves state-of-the-art accuracy while converging within only 1-5 epochs-a performance unattainable by conventional end-to-end models. This dramatic acceleration highlights that by explicitly incorporating domain knowledge to simplify the learning objective, PCMambaNet effectively mitigates data inefficiency and overfitting.

[135] Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning

Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang

Main category: cs.CV

TL;DR: Proposes EARL, an evidence-aware reinforcement learning framework for Video LLMs that dynamically selects relevant frames and performs localized re-sampling to improve long-form video reasoning by prioritizing evidence purity.

Details

Motivation: Existing Video LLMs suffer from information dilution due to static uniform frame sampling and lack rigorous reward mechanisms for evidence purity in pixel-space video reasoning agents.

Method: Evidence-aware reinforcement learning (EARL) framework that transforms the model into an active interrogator, dynamically selecting relevant frames and performing localized re-sampling around key frames for fine-grained temporal detail.

Result: Achieves state-of-the-art performance on five benchmarks: 59.8% on LongVideoBench, 69.0% on MVBench, and 64.9% on VideoMME with a 7B model.

Conclusion: The framework demonstrates the importance of prioritizing evidence purity and the effectiveness of dynamic frame selection with localized re-sampling for improved video reasoning.

Abstract: Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: “Select Less, Reason More.” Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8% on LongVideoBench, 69.0% on MVBench and 64.9% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.

[136] MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention

Nengbo Zhang, Hann Woei Ho

Main category: cs.CV

TL;DR: MAVR-Net is a multi-view learning framework for Micro Aerial Vehicle action recognition that combines RGB frames, optical flow, and segmentation masks with cross-view attention to improve motion recognition accuracy.

Details

Motivation: Vision-based recognition models using only RGB data often fail to capture complex spatial-temporal characteristics of MAV motion, limiting their ability to distinguish different actions.

Method: Combines three complementary data types (RGB frames, optical flow, segmentation masks) using ResNet-based encoders, multi-scale feature pyramid, cross-view attention module, and multi-view alignment loss for semantic consistency.

Result: Achieved 97.8%, 96.5%, and 92.8% accuracy on Short MAV, Medium MAV, and Long MAV benchmark datasets, clearly outperforming existing approaches.

Conclusion: The multi-view approach with cross-view attention effectively improves MAV motion recognition robustness and accuracy by leveraging complementary data modalities.

Abstract: Recognizing the motion of Micro Aerial Vehicles (MAVs) is crucial for enabling cooperative perception and control in autonomous aerial swarms. Yet, vision-based recognition models relying only on RGB data often fail to capture the complex spatial temporal characteristics of MAV motion, which limits their ability to distinguish different actions. To overcome this problem, this paper presents MAVR-Net, a multi-view learning-based MAV action recognition framework. Unlike traditional single-view methods, the proposed approach combines three complementary types of data, including raw RGB frames, optical flow, and segmentation masks, to improve the robustness and accuracy of MAV motion recognition. Specifically, ResNet-based encoders are used to extract discriminative features from each view, and a multi-scale feature pyramid is adopted to preserve the spatiotemporal details of MAV motion patterns. To enhance the interaction between different views, a cross-view attention module is introduced to model the dependencies among various modalities and feature scales. In addition, a multi-view alignment loss is designed to ensure semantic consistency and strengthen cross-view feature representations. Experimental results on benchmark MAV action datasets show that our method clearly outperforms existing approaches, achieving 97.8%, 96.5%, and 92.8% accuracy on the Short MAV, Medium MAV, and Long MAV datasets, respectively.

[137] DPTrack:Directional Kernel-Guided Prompt Learning for Robust Nighttime Aerial Tracking

Zhiqiang Zhu, Xinbo Gao, Wen Lu, Jie Li, Zhaoyang Wang, Mingqian Ge

Main category: cs.CV

TL;DR: DPTrack is a prompt-based aerial tracker for nighttime scenarios that uses directional kernels enriched with fine-grained attribute cues to generate precise prompts, overcoming limitations of existing trackers that rely solely on spatial localization supervision.

Details

Motivation: Existing nighttime aerial trackers based on prompt learning perform poorly because they rely only on spatial localization supervision, which fails to provide fine-grained cues for target features and produces vague prompts.

Method: DPTrack hierarchically captures object’s topological structure to enrich feature representation, encodes topology-aware features into directional kernels that encapsulate fine-grained attribute cues, and uses a kernel-guided prompt module with channel-category correspondence to propagate kernels across search regions and generate precise prompts with spatial gating.

Result: Extensive evaluations on established benchmarks demonstrate DPTrack’s superior performance compared to existing nighttime aerial trackers.

Conclusion: DPTrack effectively addresses the limitations of existing prompt-based nighttime aerial trackers by incorporating fine-grained attribute cues through directional kernels, enabling more accurate target feature localization and robust nighttime tracking performance.

Abstract: Existing nighttime aerial trackers based on prompt learning rely solely on spatial localization supervision, which fails to provide fine-grained cues that point to target features and inevitably produces vague prompts. This limitation impairs the tracker’s ability to accurately focus on the object features and results in trackers still performing poorly. To address this issue, we propose DPTrack, a prompt-based aerial tracker designed for nighttime scenarios by encoding the given object’s attribute features into the directional kernel enriched with fine-grained cues to generate precise prompts. Specifically, drawing inspiration from visual bionics, DPTrack first hierarchically captures the object’s topological structure, leveraging topological attributes to enrich the feature representation. Subsequently, an encoder condenses these topology-aware features into the directional kernel, which serves as the core guidance signal that explicitly encapsulates the object’s fine-grained attribute cues. Finally, a kernel-guided prompt module built on channel-category correspondence attributes propagates the kernel across the features of the search region to pinpoint the positions of target features and convert them into precise prompts, integrating spatial gating for robust nighttime tracking. Extensive evaluations on established benchmarks demonstrate DPTrack’s superior performance. Our code will be available at https://github.com/zzq-vipsl/DPTrack.

[138] Improving Micro-Expression Recognition with Phase-Aware Temporal Augmentation

Vu Tram Anh Khuong, Luu Tu Nguyen, Thanh Ha Le, Thi Duyen Ngo

Main category: cs.CV

TL;DR: A phase-aware temporal augmentation method that decomposes micro-expression sequences into onset-to-apex and apex-to-offset phases, generating separate dynamic images for each phase to improve micro-expression recognition performance.

Details

Motivation: Micro-expression recognition is limited by scarce annotated datasets and existing methods overlook temporal augmentation strategies that can better exploit motion characteristics.

Method: Proposes dual-phase dynamic image augmentation that decomposes expression sequences into onset-to-apex and apex-to-offset phases, generating separate dynamic images for each phase to enrich motion diversity.

Result: Achieves consistent performance improvements across six deep architectures on CASME-II and SAMM datasets, with up to 10% relative improvement when combined with spatial augmentations.

Conclusion: The proposed temporal augmentation is simple, model-agnostic, and effective for robust micro-expression recognition in low-resource settings.

Abstract: Micro-expressions (MEs) are brief, involuntary facial movements that reveal genuine emotions, typically lasting less than half a second. Recognizing these subtle expressions is critical for applications in psychology, security, and behavioral analysis. Although deep learning has enabled significant advances in micro-expression recognition (MER), its effectiveness is limited by the scarcity of annotated ME datasets. This data limitation not only hinders generalization but also restricts the diversity of motion patterns captured during training. Existing MER studies predominantly rely on simple spatial augmentations (e.g., flipping, rotation) and overlook temporal augmentation strategies that can better exploit motion characteristics. To address this gap, this paper proposes a phase-aware temporal augmentation method based on dynamic image. Rather than encoding the entire expression as a single onset-to-offset dynamic image (DI), our approach decomposes each expression sequence into two motion phases: onset-to-apex and apex-to-offset. A separate DI is generated for each phase, forming a Dual-phase DI augmentation strategy. These phase-specific representations enrich motion diversity and introduce complementary temporal cues that are crucial for recognizing subtle facial transitions. Extensive experiments on CASME-II and SAMM datasets using six deep architectures, including CNNs, Vision Transformer, and the lightweight LEARNet, demonstrate consistent performance improvements in recognition accuracy, unweighted F1-score, and unweighted average recall, which are crucial for addressing class imbalance in MER. When combined with spatial augmentations, our method achieves up to a 10% relative improvement. The proposed augmentation is simple, model-agnostic, and effective in low-resource settings, offering a promising direction for robust and generalizable MER.

[139] MRASfM: Multi-Camera Reconstruction and Aggregation through Structure-from-Motion in Driving Scenes

Lingfeng Xuan, Chang Nie, Yiqing Xu, Zhe Liu, Yanzi Miao, Hesheng Wang

Main category: cs.CV

TL;DR: MRASfM is a specialized Structure from Motion framework for multi-camera driving scenes that improves pose estimation reliability, removes road surface outliers, and boosts efficiency through multi-camera unit optimization and scene aggregation.

Details

Motivation: Standard SfM struggles with driving scenes due to unreliable pose estimation, excessive road surface outliers, and low efficiency when applied to multi-camera systems on vehicles.

Method: Leverages fixed spatial relationships between cameras for reliable pose estimation, uses plane model to remove road surface outliers, treats multi-camera set as single unit in Bundle Adjustment, and employs coarse-to-fine scene association and assembly for multi-scene aggregation.

Result: Achieves 0.124 absolute pose error on nuScenes dataset, demonstrates robustness in challenging conditions, and shows state-of-the-art performance in large-scale validation.

Conclusion: MRASfM effectively addresses the specific challenges of applying SfM to driving scenes with multi-camera systems, providing reliable pose estimation, clean reconstruction, and improved efficiency.

Abstract: Structure from Motion (SfM) estimates camera poses and reconstructs point clouds, forming a foundation for various tasks. However, applying SfM to driving scenes captured by multi-camera systems presents significant difficulties, including unreliable pose estimation, excessive outliers in road surface reconstruction, and low reconstruction efficiency. To address these limitations, we propose a Multi-camera Reconstruction and Aggregation Structure-from-Motion (MRASfM) framework specifically designed for driving scenes. MRASfM enhances the reliability of camera pose estimation by leveraging the fixed spatial relationships within the multi-camera system during the registration process. To improve the quality of road surface reconstruction, our framework employs a plane model to effectively remove erroneous points from the triangulated road surface. Moreover, treating the multi-camera set as a single unit in Bundle Adjustment (BA) helps reduce optimization variables to boost efficiency. In addition, MRASfM achieves multi-scene aggregation through scene association and assembly modules in a coarse-to-fine fashion. We deployed multi-camera systems on actual vehicles to validate the generalizability of MRASfM across various scenes and its robustness in challenging conditions through real-world applications. Furthermore, large-scale validation results on public datasets show the state-of-the-art performance of MRASfM, achieving 0.124 absolute pose error on the nuScenes dataset.

[140] Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI

Syed Abdul Gaffar Shakhadri, Kruthika KR, Kartik Basavaraj Angadi

Main category: cs.CV

TL;DR: Shakti VLM is a family of vision-language models (1B and 4B parameters) that achieves competitive performance through architectural innovations and efficient training strategies rather than massive data volumes.

Details

Motivation: To address data efficiency challenges in multimodal learning by developing models that achieve strong performance without requiring extensive training data like recent VLMs.

Method: Uses architectural innovations including QK-Normalization for attention stability, hybrid normalization techniques, enhanced positional encoding, and a three-stage training strategy to optimize learning efficiency.

Result: Shakti-VLM-1B and Shakti-VLM-4B excel in document understanding, visual reasoning, OCR extraction, and general multimodal reasoning, demonstrating competitive performance with fewer tokens.

Conclusion: High performance in multimodal learning can be achieved through thoughtful model design and training strategies rather than sheer data volume, making Shakti an efficient solution for enterprise-scale multimodal tasks.

Abstract: We introduce Shakti VLM, a family of vision-language models in the capacity of 1B and 4B parameters designed to address data efficiency challenges in multimodal learning. While recent VLMs achieve strong performance through extensive training data, Shakti models leverage architectural innovations to attain competitive results with fewer tokens. Key advancements include QK-Normalization for attention stability, hybrid normalization techniques, and enhanced positional encoding. A three-stage training strategy further optimizes learning efficiency. Evaluations show that Shakti-Shakti-VLM-1B and Shakti-VLM-4B excel in document understanding, Visual Reasoning, OCR extraction, and general multimodal reasoning. Our results highlight that high performance can be achieved through model design and training strategy rather than sheer data volume, making Shakti an efficient solution for enterprise-scale multimodal tasks.

Jinghao Huang, Yaxiong Chen, Ganchao Liu

Main category: cs.CV

TL;DR: Proposes MSAM for drone video-text retrieval, addressing unique challenges of drone videos through multi-semantic adaptive learning and cross-modal feature fusion.

Details

Motivation: Drone videos have overhead perspectives, structural homogeneity, and diverse semantic expressions that challenge existing cross-modal methods designed for ground-level views.

Method: Multi-Semantic Adaptive Mining (MSAM) with dynamic frame analysis, adaptive semantic construction, distribution-driven learning, diversity terms, and cross-modal interactive feature fusion pooling.

Result: Extensive experiments on self-constructed datasets show MSAM outperforms existing methods in drone video-text retrieval.

Conclusion: MSAM effectively addresses drone video characteristics and provides superior retrieval performance, with code and datasets to be made publicly available.

Abstract: With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.

[142] A Novel Combined Optical Flow Approach for Comprehensive Micro-Expression Recognition

Vu Tram Anh Khuong, Thi Bich Phuong Man, Luu Tu Nguyen, Thanh Ha Le, Thi Duyen Ngo

Main category: cs.CV

TL;DR: This paper proposes a Combined Optical Flow (COF) method that integrates both onset-to-apex and apex-to-offset phases for micro-expression recognition, outperforming single optical flow-based approaches.

Details

Motivation: Most existing Micro-Expression Recognition methods focus only on the onset-to-apex phase and neglect the apex-to-offset phase, which contains important temporal dynamics for emotion recognition.

Method: The study introduces Combined Optical Flow (COF) that integrates both onset-to-apex and apex-to-offset phases to provide more comprehensive motion analysis for micro-expression feature representation.

Result: Experimental results on CASMEII and SAMM datasets demonstrate that COF outperforms single optical flow-based methods in micro-expression recognition performance.

Conclusion: The proposed COF method effectively captures micro-expression dynamics by considering both motion phases, leading to improved recognition accuracy.

Abstract: Facial micro-expressions are brief, involuntary facial movements that reveal hidden emotions. Most Micro-Expression Recognition (MER) methods that rely on optical flow typically focus on the onset-to-apex phase, neglecting the apex-to-offset phase, which holds key temporal dynamics. This study introduces a Combined Optical Flow (COF), integrating both phases to enhance feature representation. COF provides a more comprehensive motion analysis, improving MER performance. Experimental results on CASMEII and SAMM datasets show that COF outperforms single optical flow-based methods, demonstrating its effectiveness in capturing micro-expression dynamics.

[143] Iterative Motion Compensation for Canonical 3D Reconstruction from UAV Plant Images Captured in Windy Conditions

Andre Rochow, Jonas Marcic, Svetlana Seliunina, Sven Behnke

Main category: cs.CV

TL;DR: A pipeline for generating high-quality 3D reconstructions of agricultural plants using autonomous UAV image capture and iterative motion correction to handle wind and downwash effects.

Details

Motivation: 3D phenotyping is crucial for understanding plant growth, yield prediction, and disease control, but is challenged by environmental factors like wind and UAV downwash.

Method: Uses autonomous UAV image capture with ArUco markers, integrates state-of-the-art 3D reconstruction methods, and employs iterative deformation with optical flow to correct leaf motion between input images and intermediate reconstructions.

Result: The pipeline improves reconstruction quality of state-of-the-art methods, enables extraction of high-resolution 3D meshes, and includes a public dataset of multiple crops captured over time.

Conclusion: The proposed pipeline effectively handles motion challenges in plant 3D reconstruction and will be publicly released with source code and datasets to advance agricultural phenotyping research.

Abstract: 3D phenotyping of plants plays a crucial role for understanding plant growth, yield prediction, and disease control. We present a pipeline capable of generating high-quality 3D reconstructions of individual agricultural plants. To acquire data, a small commercially available UAV captures images of a selected plant. Apart from placing ArUco markers, the entire image acquisition process is fully autonomous, controlled by a self-developed Android application running on the drone’s controller. The reconstruction task is particularly challenging due to environmental wind and downwash of the UAV. Our proposed pipeline supports the integration of arbitrary state-of-the-art 3D reconstruction methods. To mitigate errors caused by leaf motion during image capture, we use an iterative method that gradually adjusts the input images through deformation. Motion is estimated using optical flow between the original input images and intermediate 3D reconstructions rendered from the corresponding viewpoints. This alignment gradually reduces scene motion, resulting in a canonical representation. After a few iterations, our pipeline improves the reconstruction of state-of-the-art methods and enables the extraction of high-resolution 3D meshes. We will publicly release the source code of our reconstruction pipeline. Additionally, we provide a dataset consisting of multiple plants from various crops, captured across different points in time.

[144] Rethinking Efficient Hierarchical Mixing Architecture for Low-light RAW Image Enhancement

Xianmin Chen, Peiliang Huang, Longfei Han, Dingwen Zhang, Junwei Han

Main category: cs.CV

TL;DR: HiMA: A hierarchical architecture combining Transformer and Mamba modules for efficient low-light RAW image enhancement, with Local Distribution Adjustment and Multi-prior Fusion modules to address uneven illumination and enhance details.

Details

Motivation: To overcome limitations in existing deep learning approaches for low-light RAW image enhancement, particularly the challenge of achieving both strong enhancement quality and high efficiency simultaneously.

Method: Proposes HiMA with hierarchical mixing of Transformer (large-scale features) and Mamba (small-scale features), Local Distribution Adjustment for uneven illumination, and Multi-prior Fusion integrating spatial and frequency-domain priors.

Result: Outperforms state-of-the-art approaches on multiple public datasets with superior performance and fewer parameters.

Conclusion: HiMA provides an efficient and effective solution for low-light RAW image enhancement, successfully addressing the quality-efficiency trade-off through its hierarchical architecture and specialized modules.

Abstract: Low-light RAW image enhancement remains a challenging task. Although numerous deep learning based approaches have been proposed, they still suffer from inherent limitations. A key challenge is how to simultaneously achieve strong enhancement quality and high efficiency. In this paper, we rethink the architecture for efficient low-light image signal processing (ISP) and introduce a Hierarchical Mixing Architecture (HiMA). HiMA leverages the complementary strengths of Transformer and Mamba modules to handle features at large and small scales, respectively, thereby improving efficiency while avoiding the ambiguities observed in prior two-stage frameworks. To further address uneven illumination with strong local variations, we propose Local Distribution Adjustment (LoDA), which adaptively aligns feature distributions across different local regions. In addition, to fully exploit the denoised outputs from the first stage, we design a Multi-prior Fusion (MPF) module that integrates spatial and frequency-domain priors for detail enhancement. Extensive experiments on multiple public datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior performance with fewer parameters. Code will be released at https://github.com/Cynicarlos/HiMA.

[145] Exploring Conditions for Diffusion models in Robotic Control

Heeseong Shin, Byeongho Heo, Dongyoon Han, Seungryong Kim, Taekyung Kim

Main category: cs.CV

TL;DR: ORCA introduces learnable task and visual prompts to adapt pre-trained text-to-image diffusion models for robotic control, achieving state-of-the-art performance without fine-tuning the base model.

Details

Motivation: Pre-trained visual representations are often task-agnostic and frozen during policy learning. The domain gap between diffusion model training data and robotic environments makes naive textual conditioning ineffective for control tasks.

Method: Proposes ORCA with learnable task prompts that adapt to the control environment and visual prompts that capture frame-specific details, enabling task-adaptive representations from pre-trained diffusion models.

Result: Achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods by effectively bridging the domain gap.

Conclusion: Task-adaptive visual representations through carefully designed conditions (task and visual prompts) can effectively leverage pre-trained diffusion models for robotic control without model fine-tuning.

Abstract: While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions - a successful strategy in other vision domains - yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model’s training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.

[146] Latent Feature Alignment: Discovering Biased and Interpretable Subpopulations in Face Recognition Models

Ignacio Serna

Main category: cs.CV

TL;DR: LFA is an attribute-label-free algorithm that uses latent directions to identify biased subpopulations in face recognition models, outperforming standard clustering methods while discovering interpretable semantic attributes.

Details

Motivation: Face recognition models exhibit systematic biases affecting certain subpopulations, but conventional bias evaluation requires expensive labeled attributes limited to predefined categories.

Method: Latent Feature Alignment (LFA) uses latent directions to identify subpopulations, enabling semantically coherent grouping and discovery of interpretable directions corresponding to attributes like age, ethnicity, or attire.

Result: Across four state-of-the-art recognition models and two benchmarks, LFA consistently outperforms k-means and nearest-neighbor search in intra-group semantic coherence while uncovering interpretable latent directions aligned with demographic and contextual attributes.

Conclusion: LFA serves as a practical method for representation auditing of face recognition models, enabling identification and interpretation of biased subpopulations without predefined attribute annotations.

Abstract: Modern face recognition models achieve high overall accuracy but continue to exhibit systematic biases that disproportionately affect certain subpopulations. Conventional bias evaluation frameworks rely on labeled attributes to form subpopulations, which are expensive to obtain and limited to predefined categories. We introduce Latent Feature Alignment (LFA), an attribute-label-free algorithm that uses latent directions to identify subpopulations. This yields two main benefits over standard clustering: (i) semantically coherent grouping, where faces sharing common attributes are grouped together more reliably than by proximity-based methods, and (ii) discovery of interpretable directions, which correspond to semantic attributes such as age, ethnicity, or attire. Across four state-of-the-art recognition models (ArcFace, CosFace, ElasticFace, PartialFC) and two benchmarks (RFW, CelebA), LFA consistently outperforms k-means and nearest-neighbor search in intra-group semantic coherence, while uncovering interpretable latent directions aligned with demographic and contextual attributes. These results position LFA as a practical method for representation auditing of face recognition models, enabling practitioners to identify and interpret biased subpopulations without predefined attribute annotations.

[147] Balanced Multi-Task Attention for Satellite Image Classification: A Systematic Approach to Achieving 97.23% Accuracy on EuroSAT Without Pre-Training

Aditya Vir

Main category: cs.CV

TL;DR: Custom CNN architectures for satellite land use classification achieve 97.23% accuracy on EuroSAT without pre-trained models, using a novel balanced multi-task attention mechanism that combines spatial and spectral feature extraction.

Details

Motivation: To develop specialized convolutional neural network architectures for satellite imagery classification that don't rely on pre-trained models, addressing specific failure modes in satellite land use classification.

Method: Three progressive architectural iterations: baseline CNN, CBAM-enhanced, and balanced multi-task attention mechanism combining Coordinate Attention for spatial features and Squeeze-Excitation blocks for spectral features with learnable fusion parameter. Uses progressive DropBlock regularization and class-balanced loss weighting.

Result: Achieved 97.23% test accuracy on EuroSAT dataset, Cohen’s Kappa of 0.9692, all classes exceeding 94.46% accuracy. Performance within 1.34% of fine-tuned ResNet-50 while requiring no external data. Learnable fusion parameter converged to alpha ≈0.57.

Conclusion: Systematic architectural design is effective for domain-specific applications, with spatial and spectral modalities having near-equal importance for satellite imagery classification. The approach validates custom CNN architectures as viable alternatives to transfer learning.

Abstract: This work presents a systematic investigation of custom convolutional neural network architectures for satellite land use classification, achieving 97.23% test accuracy on the EuroSAT dataset without reliance on pre-trained models. Through three progressive architectural iterations (baseline: 94.30%, CBAM-enhanced: 95.98%, and balanced multi-task attention: 97.23%) we identify and address specific failure modes in satellite imagery classification. Our principal contribution is a novel balanced multi-task attention mechanism that combines Coordinate Attention for spatial feature extraction with Squeeze-Excitation blocks for spectral feature extraction, unified through a learnable fusion parameter. Experimental results demonstrate that this learnable parameter autonomously converges to alpha approximately 0.57, indicating near-equal importance of spatial and spectral modalities for satellite imagery. We employ progressive DropBlock regularization (5-20% by network depth) and class-balanced loss weighting to address overfitting and confusion pattern imbalance. The final 12-layer architecture achieves Cohen’s Kappa of 0.9692 with all classes exceeding 94.46% accuracy, demonstrating confidence calibration with a 24.25% gap between correct and incorrect predictions. Our approach achieves performance within 1.34% of fine-tuned ResNet-50 (98.57%) while requiring no external data, validating the efficacy of systematic architectural design for domain-specific applications. Complete code, trained models, and evaluation scripts are publicly available.

Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov

Main category: cs.CV

TL;DR: OmniVinci is an open-source omni-modal LLM that achieves superior performance in cross-modal understanding with 6x fewer training tokens than Qwen2.5-Omni, using innovations in model architecture and data curation.

Details

Motivation: To advance machine intelligence by developing multimodal perception capabilities similar to human sensing across vision, audio, and other modalities.

Method: Three key architectural innovations: OmniAlignNet for vision-audio alignment, Temporal Embedding Grouping for relative temporal alignment, and Constrained Rotary Time Embedding for absolute temporal encoding. Plus a curation pipeline generating 24M multimodal conversations.

Result: Outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), using only 0.2T tokens vs 1.2T tokens (6x reduction).

Conclusion: Modalities reinforce each other in perception and reasoning, demonstrating advantages in robotics, medical AI, and smart factory applications.

Abstract: Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni’s 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.

[149] Diffusion Bridge Networks Simulate Clinical-grade PET from MRI for Dementia Diagnostics

Yitong Li, Ralph Buchert, Benita Schmitz-Koep, Timo Grimmer, Björn Ommer, Dennis M. Hedderich, Igor Yakushev, Christian Wachinger

Main category: cs.CV

TL;DR: SiM2P is a 3D diffusion framework that converts MRI scans into simulated FDG-PET images, significantly improving diagnostic accuracy for dementia from 75.0% to 84.7% while making PET imaging more accessible.

Details

Motivation: FDG-PET is valuable for dementia diagnosis but less accessible and more expensive than MRI. The goal is to leverage widely available MRI to simulate diagnostic-quality PET images.

Method: 3D diffusion bridge-based framework that learns probabilistic mapping from MRI and patient information to simulate FDG-PET images. Can be deployed locally with only 20 site-specific cases and basic demographics.

Result: In blinded clinical study, SiM2P improved diagnostic accuracy from 75.0% to 84.7% (p<0.05) for differentiating Alzheimer’s, frontotemporal dementia, and healthy controls. Simulated PET received higher diagnostic certainty and better interrater agreement than MRI alone.

Conclusion: SiM2P makes PET diagnostic benefits more accessible, potentially improving early dementia detection and differential diagnosis in resource-limited settings.

Abstract: Positron emission tomography (PET) with 18F-Fluorodeoxyglucose (FDG) is an established tool in the diagnostic workup of patients with suspected dementing disorders. However, compared to the routinely available magnetic resonance imaging (MRI), FDG-PET remains significantly less accessible and substantially more expensive. Here, we present SiM2P, a 3D diffusion bridge-based framework that learns a probabilistic mapping from MRI and auxiliary patient information to simulate FDG-PET images of diagnostic quality. In a blinded clinical reader study, two neuroradiologists and two nuclear medicine physicians rated the original MRI and SiM2P-simulated PET images of patients with Alzheimer’s disease, behavioral-variant frontotemporal dementia, and cognitively healthy controls. SiM2P significantly improved the overall diagnostic accuracy of differentiating between three groups from 75.0% to 84.7% (p<0.05). Notably, the simulated PET images received higher diagnostic certainty ratings and achieved superior interrater agreement compared to the MRI images. Finally, we developed a practical workflow for local deployment of the SiM2P framework. It requires as few as 20 site-specific cases and only basic demographic information. This approach makes the established diagnostic benefits of FDG-PET imaging more accessible to patients with suspected dementing disorders, potentially improving early detection and differential diagnosis in resource-limited settings. Our code is available at https://github.com/Yiiitong/SiM2P.

[150] Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation

Xiaoming Zhu, Xu Huang, Qinghongbing Xie, Zhi Deng, Junsheng Yu, Yirui Guan, Zhongyuan Liu, Lin Zhu, Qijun Zhao, Ligang Liu, Long Zeng

Main category: cs.CV

TL;DR: A novel vision-guided 3D layout generation system that uses image generation models and robust image parsing to create coherent 3D scenes from prompts, outperforming existing methods in layout richness and quality.

Details

Motivation: Traditional optimization-based methods are constrained by manual rules, deep generative models struggle with richness and diversity, and LLM-based approaches lack robustness in capturing complex spatial relationships.

Method: 1) Build high-quality asset library with 2,037 scene assets and 147 3D layouts; 2) Use image generation model expanded from prompts and fine-tuned on asset library; 3) Develop robust image parsing module to recover 3D layouts; 4) Optimize scene layout using scene graphs and visual semantics.

Result: Extensive user testing shows the algorithm significantly outperforms existing methods in layout richness and quality.

Conclusion: The proposed vision-guided approach effectively addresses limitations of existing methods and produces high-quality 3D scene layouts with logical coherence and visual alignment.

Abstract: Generating artistic and coherent 3D scene layouts is crucial in digital content creation. Traditional optimization-based methods are often constrained by cumbersome manual rules, while deep generative models face challenges in producing content with richness and diversity. Furthermore, approaches that utilize large language models frequently lack robustness and fail to accurately capture complex spatial relationships. To address these challenges, this paper presents a novel vision-guided 3D layout generation system. We first construct a high-quality asset library containing 2,037 scene assets and 147 3D scene layouts. Subsequently, we employ an image generation model to expand prompt representations into images, fine-tuning it to align with our asset library. We then develop a robust image parsing module to recover the 3D layout of scenes based on visual semantics and geometric information. Finally, we optimize the scene layout using scene graphs and overall visual semantics to ensure logical coherence and alignment with the images. Extensive user testing demonstrates that our algorithm significantly outperforms existing methods in terms of layout richness and quality. The code and dataset will be available at https://github.com/HiHiAllen/Imaginarium.

[151] Unmasking Facial DeepFakes: A Robust Multiview Detection Framework for Natural Images

Sami Belguesmia, Mohand Saïd Allili, Assia Hamadene

Main category: cs.CV

TL;DR: A multi-view architecture for DeepFake detection that analyzes facial features at global, middle, and local levels with specialized encoders, plus a face orientation encoder for pose robustness.

Details

Motivation: Existing DeepFake detection methods struggle with pose variations, occlusions, and artifacts in real-world conditions, requiring more robust detection approaches.

Method: Proposes a multi-view architecture with three specialized encoders: global view for boundary inconsistencies, middle view for texture/color alignment, and local view for distortions in expressive facial regions. Includes face orientation encoder for pose classification.

Result: Experimental results on challenging datasets show superior performance compared to conventional single-view approaches, achieving robust detection under various pose and lighting conditions.

Conclusion: The multi-view architecture with specialized encoders effectively enhances DeepFake detection by comprehensively analyzing facial features at multiple levels, overcoming limitations of single-view methods.

Abstract: DeepFake technology has advanced significantly in recent years, enabling the creation of highly realistic synthetic face images. Existing DeepFake detection methods often struggle with pose variations, occlusions, and artifacts that are difficult to detect in real-world conditions. To address these challenges, we propose a multi-view architecture that enhances DeepFake detection by analyzing facial features at multiple levels. Our approach integrates three specialized encoders, a global view encoder for detecting boundary inconsistencies, a middle view encoder for analyzing texture and color alignment, and a local view encoder for capturing distortions in expressive facial regions such as the eyes, nose, and mouth, where DeepFake artifacts frequently occur. Additionally, we incorporate a face orientation encoder, trained to classify face poses, ensuring robust detection across various viewing angles. By fusing features from these encoders, our model achieves superior performance in detecting manipulated images, even under challenging pose and lighting conditions.Experimental results on challenging datasets demonstrate the effectiveness of our method, outperforming conventional single-view approaches

[152] Lightweight CycleGAN Models for Cross-Modality Image Transformation and Experimental Quality Assessment in Fluorescence Microscopy

Mohammad Soltaninezhad, Yashar Rouzbahani, Jhonatan Contreras, Rohan Chippalkatti, Daniel Kwaku Abankwa, Christian Eggeling, Thomas Bocklitz

Main category: cs.CV

TL;DR: A lightweight CycleGAN for fluorescence microscopy modality transfer that reduces parameters from 41.8M to ~9K while maintaining performance, and serves as a diagnostic tool for image quality assessment.

Details

Motivation: To address the computational cost and environmental impact of deep learning models in scientific applications, particularly for unpaired dataset modality transfer in fluorescence microscopy.

Method: Replaced traditional channel-doubling strategy in U-Net-based generator with fixed channel approach, creating a lightweight CycleGAN architecture for confocal to super-resolution STED/deconvolved STED transfer.

Result: Achieved drastic parameter reduction from 41.8 million to approximately nine thousand, with superior performance, faster training, and lower memory usage.

Conclusion: The model serves as both a modality transfer tool and a practical diagnostic tool for validating experimental accuracy and image fidelity in microscopy workflows.

Abstract: Lightweight deep learning models offer substantial reductions in computational cost and environmental impact, making them crucial for scientific applications. We present a lightweight CycleGAN for modality transfer in fluorescence microscopy (confocal to super-resolution STED/deconvolved STED), addressing the common challenge of unpaired datasets. By replacing the traditional channel-doubling strategy in the U-Net-based generator with a fixed channel approach, we drastically reduce trainable parameters from 41.8 million to approximately nine thousand, achieving superior performance with faster training and lower memory usage. We also introduce the GAN as a diagnostic tool for experimental and labeling quality. When trained on high-quality images, the GAN learns the characteristics of optimal imaging; deviations between its generated outputs and new experimental images can reveal issues such as photobleaching, artifacts, or inaccurate labeling. This establishes the model as a practical tool for validating experimental accuracy and image fidelity in microscopy workflows.

[153] Standardization for improved Spatio-Temporal Image Fusion

Harkaitz Goyena, Peter M. Atkinson, Unai Pérez-Goya, M. Dolores Ugarte

Main category: cs.CV

TL;DR: Two standardization methods for STIF: traditional upscaling and ABSIS sharpening. ABSIS significantly improves USTFIP accuracy by up to 49.46% (spectral) and 78.40% (spatial).

Details

Motivation: To facilitate STIF methods by standardizing images with different spatial/spectral resolutions from different sensors.

Method: Two approaches: 1) Traditional upscaling of fine-resolution images, 2) ABSIS sharpening that blends fine-resolution series features with coarse-resolution distinctive attributes.

Result: Both methods significantly increase USTFIP accuracy, with ABSIS achieving up to 49.46% spectral and 78.40% spatial accuracy improvements.

Conclusion: Standardization approaches, particularly ABSIS sharpening, substantially enhance STIF method performance and accuracy.

Abstract: Spatio-Temporal Image Fusion (STIF) methods usually require sets of images with matching spatial and spectral resolutions captured by different sensors. To facilitate the application of STIF methods, we propose and compare two different standardization approaches. The first method is based on traditional upscaling of the fine-resolution images. The second method is a sharpening approach called Anomaly Based Satellite Image Standardization (ABSIS) that blends the overall features found in the fine-resolution image series with the distinctive attributes of a specific coarse-resolution image to produce images that more closely resemble the outcome of aggregating the fine-resolution images. Both methods produce a significant increase in accuracy of the Unpaired Spatio Temporal Fusion of Image Patches (USTFIP) STIF method, with the sharpening approach increasing the spectral and spatial accuracies of the fused images by up to 49.46% and 78.40%, respectively.

Zhen Sun, Lei Tan, Yunhang Shen, Chengmao Cai, Xing Sun, Pingyang Dai, Liujuan Cao, Rongrong Ji

Main category: cs.CV

TL;DR: FlexiReID is a flexible multimodal person re-identification framework supporting 7 retrieval modes across 4 modalities (RGB, infrared, sketches, text) using adaptive mixture-of-experts and cross-modal query fusion.

Details

Motivation: Existing multimodal Re-ID methods focus on limited cross-modal settings and fail to support arbitrary query-retrieval combinations, hindering practical deployment.

Method: Introduces adaptive mixture-of-experts mechanism to dynamically integrate diverse modality features and cross-modal query fusion module to enhance multimodal feature extraction.

Result: Achieves state-of-the-art performance and offers strong generalization in complex scenarios, demonstrated through extensive experiments.

Conclusion: FlexiReID provides a comprehensive solution for flexible multimodal person re-identification with superior performance and practical applicability.

Abstract: Multimodal person re-identification (Re-ID) aims to match pedestrian images across different modalities. However, most existing methods focus on limited cross-modal settings and fail to support arbitrary query-retrieval combinations, hindering practical deployment. We propose FlexiReID, a flexible framework that supports seven retrieval modes across four modalities: rgb, infrared, sketches, and text. FlexiReID introduces an adaptive mixture-of-experts (MoE) mechanism to dynamically integrate diverse modality features and a cross-modal query fusion module to enhance multimodal feature extraction. To facilitate comprehensive evaluation, we construct CIRS-PEDES, a unified dataset extending four popular Re-ID datasets to include all four modalities. Extensive experiments demonstrate that FlexiReID achieves state-of-the-art performance and offers strong generalization in complex scenarios.

[155] Quantized FCA: Efficient Zero-Shot Texture Anomaly Detection

Andrei-Timotei Ardelean, Patrick Rückbeil, Tim Weyrich

Main category: cs.CV

TL;DR: QFCA is a real-time method for zero-shot anomaly localization in textures that achieves 10x speedup over existing methods with minimal accuracy loss through quantized feature correspondence analysis and PCA preprocessing.

Details

Motivation: Existing anomaly localization methods have high running times that make them impractical for real-world deployment in scenarios like assembly line monitoring.

Method: Proposes QFCA - a quantized version of feature correspondence analysis that works on histograms of quantized values, plus a PCA-based feature preprocessing step to enhance contrast between normal and anomalous features.

Result: Achieves 10x speedup with little to no loss in accuracy, and improves detection precision on complex textures through the PCA preprocessing.

Conclusion: QFCA compares favorably with existing methods and enables practical real-time deployment for texture anomaly localization.

Abstract: Zero-shot anomaly localization is a rising field in computer vision research, with important progress in recent years. This work focuses on the problem of detecting and localizing anomalies in textures, where anomalies can be defined as the regions that deviate from the overall statistics, violating the stationarity assumption. The main limitation of existing methods is their high running time, making them impractical for deployment in real-world scenarios, such as assembly line monitoring. We propose a real-time method, named QFCA, which implements a quantized version of the feature correspondence analysis (FCA) algorithm. By carefully adapting the patch statistics comparison to work on histograms of quantized values, we obtain a 10x speedup with little to no loss in accuracy. Moreover, we introduce a feature preprocessing step based on principal component analysis, which enhances the contrast between normal and anomalous features, improving the detection precision on complex textures. Our method is thoroughly evaluated against prior art, comparing favorably with existing methods. Project page: https://reality.tf.fau.de/pub/ardelean2025quantized.html

[156] Lightweight Data-Free Denoising for Detail-Preserving Biomedical Image Restoration

Tomáš Chobola, Julia A. Schnabel, Tingying Peng

Main category: cs.CV

TL;DR: Noise2Detail (N2D) is an ultra-lightweight self-supervised denoising model that achieves fast inference and high-quality restoration without clean reference images, making it ideal for biomedical imaging where clean training data is scarce.

Details

Motivation: Current self-supervised denoising methods face computational and memory constraints, forcing trade-offs between speed and quality. Real-world applications, especially in biomedical imaging, need efficient solutions that don't require clean training data.

Method: Built on Noise2Noise framework, N2D uses a multistage denoising pipeline that disrupts noise spatial correlations to create smooth intermediate structures, then refines details directly from noisy input without clean references or explicit noise modeling.

Result: Extensive testing shows N2D outperforms existing dataset-free techniques while using significantly fewer computational resources, achieving both fast denoising and high-quality image restoration.

Conclusion: N2D provides an efficient, low-cost, data-free denoising solution that overcomes challenges in biomedical imaging where clean training data is scarce and fast inference is needed for practical applications.

Abstract: Current self-supervised denoising techniques achieve impressive results, yet their real-world application is frequently constrained by substantial computational and memory demands, necessitating a compromise between inference speed and reconstruction quality. In this paper, we present an ultra-lightweight model that addresses this challenge, achieving both fast denoising and high quality image restoration. Built upon the Noise2Noise training framework-which removes the reliance on clean reference images or explicit noise modeling-we introduce an innovative multistage denoising pipeline named Noise2Detail (N2D). During inference, this approach disrupts the spatial correlations of noise patterns to produce intermediate smooth structures, which are subsequently refined to recapture fine details directly from the noisy input. Extensive testing reveals that Noise2Detail surpasses existing dataset-free techniques in performance, while requiring only a fraction of the computational resources. This combination of efficiency, low computational cost, and data-free approach make it a valuable tool for biomedical imaging, overcoming the challenges of scarce clean training data-due to rare and complex imaging modalities-while enabling fast inference for practical use.

[157] Deep Learning Based Domain Adaptation Methods in Remote Sensing: A Comprehensive Survey

Shuchang Lyu, Qi Zhao, Zheng Zhou, Meng Li, You Zhou, Dingding Yao, Guangliang Cheng, Huiyu Zhou, Zhenwei Shi

Main category: cs.CV

TL;DR: This paper provides a comprehensive survey of deep learning-based domain adaptation methods in remote sensing, covering methodology taxonomy, datasets, performance analysis, and future research directions.

Details

Motivation: Domain adaptation is crucial for remote sensing applications due to data distribution differences caused by variations in sensors, geographical landscapes, and environmental conditions. Deep learning has emerged as a powerful tool for cross-domain knowledge transfer in this field.

Method: The survey organizes existing algorithms from multiple perspectives including task categorization, input mode, supervision paradigm, and algorithmic granularity. It introduces preliminary knowledge, provides structured taxonomy, and reviews widely used datasets.

Result: The paper presents a systematic overview of state-of-the-art methods and their performance, offering a more comprehensive understanding compared to previous surveys that focused on limited subfields.

Conclusion: This survey serves to inspire the research community, foster understanding, and guide future work in domain adaptation for remote sensing by addressing broader range of tasks and providing organized taxonomy.

Abstract: Domain adaptation is a crucial and increasingly important task in remote sensing, aiming to transfer knowledge from a source domain a differently distributed target domain. It has broad applications across various real-world applications, including remote sensing element interpretation, ecological environment monitoring, and urban/rural planning. However, domain adaptation in remote sensing poses significant challenges due to differences in data, such as variations in ground sampling distance, imaging modes from various sensors, geographical landscapes, and environmental conditions. In recent years, deep learning has emerged as a powerful tool for feature representation and cross-domain knowledge transfer, leading to widespread adoption in remote sensing tasks. In this paper, we present a comprehensive survey of significant advancements in deep learning based domain adaptation for remote sensing. We first introduce the preliminary knowledge to clarify key concepts, mathematical notations, and the taxonomy of methodologies. We then organize existing algorithms from multiple perspectives, including task categorization, input mode, supervision paradigm, and algorithmic granularity, providing readers with a structured understanding of the field. Next, we review widely used datasets and summarize the performance of state-of-the-art methods to provide an overview of current progress. We also identify open challenges and potential directions to guide future research in domain adaptation for remote sensing. Compared to previous surveys, this work addresses a broader range of domain adaptation tasks in remote sensing, rather than concentrating on a few subfields. It also presents a systematic taxonomy, providing a more comprehensive and organized understanding of the field. As a whole, this survey can inspire the research community, foster understanding, and guide future work in the field.

[158] Valeo Near-Field: a novel dataset for pedestrian intent detection

Antonyo Musabini, Rachid Benmokhtar, Jagdish Bhanushali, Victor Galizzi, Bertrand Luvison, Xavier Perrotton

Main category: cs.CV

TL;DR: A novel multi-modal dataset for pedestrian intention detection featuring synchronized fisheye cameras, lidar, ultrasonic sensors, and motion capture data with detailed 3D body pose annotations and pedestrian positions.

Details

Motivation: To address real-world challenges in pedestrian detection and intention prediction, including sensor occlusions, dynamic environments, and hardware constraints in near-field vehicle scenarios.

Method: Created a comprehensive dataset with synchronized multi-modal data collection across diverse scenarios, including detailed 3D body joint annotations from motion capture and accurate 3D pedestrian positions from lidar data.

Result: Released a portion of the dataset with benchmark suite and baseline performance metrics using custom neural network architectures, providing evaluation metrics for accuracy, efficiency, and scalability on embedded systems.

Conclusion: The dataset serves as a foundation for advancing pedestrian detection, 3D pose estimation, and trajectory/intention prediction algorithms, with provided baseline metrics and future research directions to encourage adoption and enhancement.

Abstract: This paper presents a novel dataset aimed at detecting pedestrians’ intentions as they approach an ego-vehicle. The dataset comprises synchronized multi-modal data, including fisheye camera feeds, lidar laser scans, ultrasonic sensor readings, and motion capture-based 3D body poses, collected across diverse real-world scenarios. Key contributions include detailed annotations of 3D body joint positions synchronized with fisheye camera images, as well as accurate 3D pedestrian positions extracted from lidar data, facilitating robust benchmarking for perception algorithms. We release a portion of the dataset along with a comprehensive benchmark suite, featuring evaluation metrics for accuracy, efficiency, and scalability on embedded systems. By addressing real-world challenges such as sensor occlusions, dynamic environments, and hardware constraints, this dataset offers a unique resource for developing and evaluating state-of-the-art algorithms in pedestrian detection, 3D pose estimation and 4D trajectory and intention prediction. Additionally, we provide baseline performance metrics using custom neural network architectures and suggest future research directions to encourage the adoption and enhancement of the dataset. This work aims to serve as a foundation for researchers seeking to advance the capabilities of intelligent vehicles in near-field scenarios.

[159] Uncertainty-Aware Extreme Point Tracing for Weakly Supervised Ultrasound Image Segmentation

Lei Shi, Gang Li, Junxing Zhang

Main category: cs.CV

TL;DR: A weakly supervised medical image segmentation method using only four extreme points as annotation, leveraging SAM2 for initial pseudo labels and refining them with uncertainty-aware techniques to achieve performance comparable to fully supervised methods.

Details

Motivation: To reduce the high annotation cost of pixel-level labels in medical image segmentation while maintaining segmentation quality.

Method: Uses extreme points to create bounding boxes for SAM2 prompting, then refines pseudo labels with enhanced FGEPM algorithm with Monte Carlo dropout uncertainty, plus dual-branch USC loss and box alignment loss for spatial consistency.

Result: Achieves performance comparable to and even surpassing fully supervised methods on BUSI and UNS ultrasound datasets with significantly reduced annotation cost.

Conclusion: The proposed weakly supervised framework is effective and practical for ultrasound image segmentation, offering a viable alternative to costly fully supervised approaches.

Abstract: Automatic medical image segmentation is a fundamental step in computer-aided diagnosis, yet fully supervised approaches demand extensive pixel-level annotations that are costly and time-consuming. To alleviate this burden, we propose a weakly supervised segmentation framework that leverages only four extreme points as annotation. Specifically, bounding boxes derived from the extreme points are used as prompts for the Segment Anything Model 2 (SAM2) to generate reliable initial pseudo labels. These pseudo labels are progressively refined by an enhanced Feature-Guided Extreme Point Masking (FGEPM) algorithm, which incorporates Monte Carlo dropout-based uncertainty estimation to construct a unified gradient uncertainty cost map for boundary tracing. Furthermore, a dual-branch Uncertainty-aware Scale Consistency (USC) loss and a box alignment loss are introduced to ensure spatial consistency and precise boundary alignment during training. Extensive experiments on two public ultrasound datasets, BUSI and UNS, demonstrate that our method achieves performance comparable to, and even surpassing fully supervised counterparts while significantly reducing annotation cost. These results validate the effectiveness and practicality of the proposed weakly supervised framework for ultrasound image segmentation.

[160] Towards Label-Free Brain Tumor Segmentation: Unsupervised Learning with Multimodal MRI

Gerard Comas-Quiles, Carles Garcia-Cabrera, Julia Dietlmeier, Noel E. O’Connor, Ferran Marques

Main category: cs.CV

TL;DR: Proposes MViT-AE, an unsupervised multimodal vision transformer autoencoder for brain tumor detection using only healthy brain MRIs, achieving clinically meaningful tumor localization without manual labels.

Details

Motivation: Addresses limitations of supervised learning for brain tumor segmentation when annotated datasets are limited, costly, or inconsistent, providing a scalable alternative for neuroimaging workflows.

Method: Uses multimodal vision transformer autoencoder trained on healthy brain MRIs with early-late fusion strategy across MRI sequences, plus SAM-based post-processing for contour refinement.

Result: Achieves lesion-wise Dice scores of 0.437 (Whole Tumor), 0.316 (Tumor Core), 0.350 (Enhancing Tumor) on test set, and 89.4% anomaly detection rate on validation set.

Conclusion: Transformer-based unsupervised models show potential as scalable, label-efficient tools for neuro-oncological imaging despite challenges in detecting small or non-enhancing lesions.

Abstract: Unsupervised anomaly detection (UAD) presents a complementary alternative to supervised learning for brain tumor segmentation in magnetic resonance imaging (MRI), particularly when annotated datasets are limited, costly, or inconsistent. In this work, we propose a novel Multimodal Vision Transformer Autoencoder (MViT-AE) trained exclusively on healthy brain MRIs to detect and localize tumors via reconstruction-based error maps. This unsupervised paradigm enables segmentation without reliance on manual labels, addressing a key scalability bottleneck in neuroimaging workflows. Our method is evaluated in the BraTS-GoAT 2025 Lighthouse dataset, which includes various types of tumors such as gliomas, meningiomas, and pediatric brain tumors. To enhance performance, we introduce a multimodal early-late fusion strategy that leverages complementary information across multiple MRI sequences, and a post-processing pipeline that integrates the Segment Anything Model (SAM) to refine predicted tumor contours. Despite the known challenges of UAD, particularly in detecting small or non-enhancing lesions, our method achieves clinically meaningful tumor localization, with lesion-wise Dice Similarity Coefficient of 0.437 (Whole Tumor), 0.316 (Tumor Core), and 0.350 (Enhancing Tumor) on the test set, and an anomaly Detection Rate of 89.4% on the validation set. These findings highlight the potential of transformer-based unsupervised models to serve as scalable, label-efficient tools for neuro-oncological imaging.

[161] Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Lihao Liu, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Zhongying Deng, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Ming Hu, Junjun He

Main category: cs.CV

TL;DR: UniMedVL is a unified multimodal medical AI model that simultaneously handles both image understanding and generation tasks within a single architecture, addressing the gap between separate medical image interpretation and generation systems.

Details

Motivation: Existing medical AI systems disrupt the unified diagnostic process by separating image understanding models (which interpret but cannot generate) from image generation models (which synthesize but cannot explain), creating gaps in multimodal capabilities.

Method: Proposed a multi-level framework using Observation-Knowledge-Analysis (OKA) paradigm: created UniMed-5M dataset with 5.6M multimodal samples, used Progressive Curriculum Learning for medical knowledge integration, and developed UniMedVL as the first unified multimodal model for simultaneous image understanding and generation.

Result: UniMedVL achieves superior performance on five medical image understanding benchmarks and matches specialized models in generation quality across eight medical imaging modalities. The unified architecture enables bidirectional knowledge sharing where generation tasks enhance visual understanding features.

Conclusion: Integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks, demonstrating that unified multimodal architectures can overcome limitations of specialized single-task models.

Abstract: Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at https://github.com/uni-medical/UniMedVL.

[162] Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen

Main category: cs.CV

TL;DR: Ditto is a framework that generates large-scale, high-quality video editing training data by combining image editing diversity with video generation, using efficient models and AI-driven quality control to create Ditto-1M dataset, resulting in state-of-the-art video editing model Editto.

Details

Motivation: The scarcity of large-scale, high-quality training data is severely hampering progress in instruction-based video editing, which promises to democratize content creation.

Method: Ditto features a novel data generation pipeline that fuses image editor diversity with in-context video generation, uses efficient distilled model architecture with temporal enhancer for cost-quality trade-off, and employs intelligent agent for instruction crafting and quality control at scale.

Result: The framework generated Ditto-1M dataset with one million high-fidelity video editing examples using over 12,000 GPU-days. Training Editto model on this dataset with curriculum learning achieved superior instruction-following ability and state-of-the-art performance.

Conclusion: Ditto framework successfully addresses the data scarcity challenge in instruction-based video editing and establishes new state-of-the-art performance through scalable, high-quality data generation and efficient model training.

Abstract: Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.

[163] SEGA: A Stepwise Evolution Paradigm for Content-Aware Layout Generation with Design Prior

Haoran Wang, Bo Zhao, Jinghui Wang, Hanzhang Wang, Huan Yang, Wei Ji, Hao Liu, Xinyan Xiao

Main category: cs.CV

TL;DR: SEGA introduces a stepwise evolution paradigm for content-aware layout generation, using a hierarchical reasoning framework with coarse-to-fine strategy and layout design principles to generate harmonious layouts with background images.

Details

Motivation: Existing single-step reasoning methods fail when faced with complex element layout planning due to lack of feedback-based self-correction mechanisms, leading to high failure rates.

Method: SEGA employs a hierarchical reasoning framework: coarse-level module estimates rough layout planning, then refining module performs fine-level reasoning. Layout design principles are incorporated as prior knowledge.

Result: Achieves state-of-the-art results on multiple benchmark datasets, demonstrating effectiveness of the approach.

Conclusion: The stepwise evolution paradigm with hierarchical reasoning and layout design principles significantly improves content-aware layout generation performance.

Abstract: In this paper, we study the content-aware layout generation problem, which aims to automatically generate layouts that are harmonious with a given background image. Existing methods usually deal with this task with a single-step reasoning framework. The lack of a feedback-based self-correction mechanism leads to their failure rates significantly increasing when faced with complex element layout planning. To address this challenge, we introduce SEGA, a novel Stepwise Evolution Paradigm for Content-Aware Layout Generation. Inspired by the systematic mode of human thinking, SEGA employs a hierarchical reasoning framework with a coarse-to-fine strategy: first, a coarse-level module roughly estimates the layout planning results; then, another refining module performs fine-level reasoning regarding the coarse planning results. Furthermore, we incorporate layout design principles as prior knowledge into the model to enhance its layout planning ability. Besides, we present GenPoster-100K that is a new large-scale poster dataset with rich meta-information annotation. The experiments demonstrate the effectiveness of our approach by achieving the state-of-the-art results on multiple benchmark datasets. Our project page is at: https://brucew91.github.io/SEGA.github.io/

[164] NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation

Yitong Sun, Yao Huang, Ruochen Zhang, Huanran Chen, Shouwei Ruan, Ranjie Duan, Xingxing Wei

Main category: cs.CV

TL;DR: NDM is a noise-driven framework that detects and mitigates implicit sexual content in text-to-image generation while preserving model quality, using noise-based detection and adaptive negative guidance.

Details

Motivation: Text-to-image models are vulnerable to generating inappropriate content from implicit sexual prompts that bypass existing detection methods, creating ethical concerns without compromising generative quality.

Method: Proposes noise-based detection using early-stage predicted noise separability and noise-enhanced adaptive negative guidance that optimizes initial noise by suppressing prominent regions’ attention.

Result: NDM demonstrates superior performance over state-of-the-art methods (SLD, UCE, RECE) on both natural and adversarial datasets, achieving high accuracy in detecting and mitigating implicit malicious content.

Conclusion: NDM effectively addresses the challenge of implicit sexual content in T2I generation through noise-driven detection and mitigation while maintaining the model’s original generative capabilities.

Abstract: Despite the impressive generative capabilities of text-to-image (T2I) diffusion models, they remain vulnerable to generating inappropriate content, especially when confronted with implicit sexual prompts. Unlike explicit harmful prompts, these subtle cues, often disguised as seemingly benign terms, can unexpectedly trigger sexual content due to underlying model biases, raising significant ethical concerns. However, existing detection methods are primarily designed to identify explicit sexual content and therefore struggle to detect these implicit cues. Fine-tuning approaches, while effective to some extent, risk degrading the model’s generative quality, creating an undesirable trade-off. To address this, we propose NDM, the first noise-driven detection and mitigation framework, which could detect and mitigate implicit malicious intention in T2I generation while preserving the model’s original generative capabilities. Specifically, we introduce two key innovations: first, we leverage the separability of early-stage predicted noise to develop a noise-based detection method that could identify malicious content with high accuracy and efficiency; second, we propose a noise-enhanced adaptive negative guidance mechanism that could optimize the initial noise by suppressing the prominent region’s attention, thereby enhancing the effectiveness of adaptive negative guidance for sexual mitigation. Experimentally, we validate NDM on both natural and adversarial datasets, demonstrating its superior performance over existing SOTA methods, including SLD, UCE, and RECE, etc. Code and resources are available at https://github.com/lorraine021/NDM.

[165] Semantic segmentation with coarse annotations

Jort de Jong, Mike Holenderski

Main category: cs.CV

TL;DR: A regularization method for semantic segmentation with coarse annotations that encourages segmented pixels to form SLIC-superpixels, improving boundary alignment in encoder-decoder networks.

Details

Motivation: Fine pixel-level annotations are expensive and difficult to obtain, while coarse annotations are more accessible but lead to poor boundary alignment in segmentation models.

Method: Proposes a regularization method for encoder-decoder architectures with superpixel upsampling that encourages segmented pixels to form SLIC-superpixels based on color and position, independent of segmentation annotations.

Result: Applied to FCN-16 architecture and evaluated on SUIM, Cityscapes, and PanNuke datasets, showing significant improvement in boundary recall compared to state-of-the-art models when trained on coarse annotations.

Conclusion: The proposed regularization method effectively improves boundary alignment in semantic segmentation models trained with coarse annotations by leveraging SLIC-superpixel constraints.

Abstract: Semantic segmentation is the task of classifying each pixel in an image. Training a segmentation model achieves best results using annotated images, where each pixel is annotated with the corresponding class. When obtaining fine annotations is difficult or expensive, it may be possible to acquire coarse annotations, e.g. by roughly annotating pixels in an images leaving some pixels around the boundaries between classes unlabeled. Segmentation with coarse annotations is difficult, in particular when the objective is to optimize the alignment of boundaries between classes. This paper proposes a regularization method for models with an encoder-decoder architecture with superpixel based upsampling. It encourages the segmented pixels in the decoded image to be SLIC-superpixels, which are based on pixel color and position, independent of the segmentation annotation. The method is applied to FCN-16 fully convolutional network architecture and evaluated on the SUIM, Cityscapes, and PanNuke data sets. It is shown that the boundary recall improves significantly compared to state-of-the-art models when trained on coarse annotations.

[166] Controlling the image generation process with parametric activation functions

Ilia Pavlov

Main category: cs.CV

TL;DR: A system for interactive understanding and control of generative models by replacing activation functions with parametric ones.

Details

Motivation: To develop interpretable tools that leverage direct interaction with internal mechanisms of image generative models, which have received little attention despite increasing model fidelity and ubiquity.

Method: Allow users to replace activation functions with parametric functions and set their parameters, providing an alternative approach to control network output. Demonstrated on StyleGAN2 (FFHQ) and BigGAN (ImageNet).

Result: Users can develop better understanding of models through interaction and experimentation with parametric activation functions.

Conclusion: The system enables interpretable control over generative models by allowing interactive modification of activation functions.

Abstract: As image generative models continue to increase not only in their fidelity but also in their ubiquity the development of tools that leverage direct interaction with their internal mechanisms in an interpretable way has received little attention In this work we introduce a system that allows users to develop a better understanding of the model through interaction and experimentation By giving users the ability to replace activation functions of a generative network with parametric ones and a way to set the parameters of these functions we introduce an alternative approach to control the networks output We demonstrate the use of our method on StyleGAN2 and BigGAN networks trained on FFHQ and ImageNet respectively.

[167] QSilk: Micrograin Stabilization and Adaptive Quantile Clipping for Detail-Friendly Latent Diffusion

Denis Rychkovskiy

Main category: cs.CV

TL;DR: QSilk is a lightweight stabilization layer for latent diffusion models that improves high-frequency fidelity and suppresses activation spikes through micro clamping and adaptive quantile clipping.

Details

Motivation: To address issues with high-frequency fidelity and rare activation spikes in latent diffusion models, particularly at low step counts and ultra-high resolutions, without requiring training or fine-tuning.

Method: Combines per-sample micro clamping to gently limit extreme values and Adaptive Quantile Clip (AQClip) that adapts value corridors per region using local structure statistics or attention entropy guidance.

Result: Yields cleaner, sharper results at low step counts and ultra-high resolutions with negligible overhead, showing consistent improvements across SD/SDXL backbones and synergy with CFG/Rescale.

Conclusion: QSilk provides an effective stabilization solution that improves latent diffusion model performance without training requirements, enabling higher guidance without artifacts.

Abstract: We present QSilk, a lightweight, always-on stabilization layer for latent diffusion that improves high-frequency fidelity while suppressing rare activation spikes. QSilk combines (i) a per-sample micro clamp that gently limits extreme values without washing out texture, and (ii) Adaptive Quantile Clip (AQClip), which adapts the allowed value corridor per region. AQClip can operate in a proxy mode using local structure statistics or in an attention entropy guided mode (model confidence). Integrated into the CADE 2.5 rendering pipeline, QSilk yields cleaner, sharper results at low step counts and ultra-high resolutions with negligible overhead. It requires no training or fine-tuning and exposes minimal user controls. We report consistent qualitative improvements across SD/SDXL backbones and show synergy with CFG/Rescale, enabling slightly higher guidance without artifacts.

[168] Towards more holistic interpretability: A lightweight disentangled Concept Bottleneck Model

Gaoxiang Huang, Songning Lai, Yutao Yue

Main category: cs.CV

TL;DR: LDCBM improves Concept Bottleneck Models by automatically grouping visual features into meaningful components without region annotation, achieving better concept alignment and performance.

Details

Motivation: Existing CBMs suffer from input-to-concept mapping bias and limited controllability, which restricts their practical value and damages the responsibility of concept-based methods.

Method: Introduces a lightweight Disentangled Concept Bottleneck Model with filter grouping loss and joint concept supervision to automatically group visual features into semantically meaningful components.

Result: Experiments on three diverse datasets show LDCBM achieves higher concept and class accuracy, outperforming previous CBMs in both interpretability and classification performance.

Conclusion: By grounding concepts in visual evidence, LDCBM overcomes fundamental limitations of prior models and enhances the reliability of interpretable AI.

Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by predicting human-understandable concepts as intermediate representations. However, existing CBMs often suffer from input-to-concept mapping bias and limited controllability, which restricts their practical value, directly damage the responsibility of strategy from concept-based methods. We propose a lightweight Disentangled Concept Bottleneck Model (LDCBM) that automatically groups visual features into semantically meaningful components without region annotation. By introducing a filter grouping loss and joint concept supervision, our method improves the alignment between visual patterns and concepts, enabling more transparent and robust decision-making. Notably, Experiments on three diverse datasets demonstrate that LDCBM achieves higher concept and class accuracy, outperforming previous CBMs in both interpretability and classification performance. By grounding concepts in visual evidence, our method overcomes a fundamental limitation of prior models and enhances the reliability of interpretable AI.

[169] ReCon: Region-Controllable Data Augmentation with Rectification and Alignment for Object Detection

Haowei Zhu, Tianxiang Pan, Rui Qin, Jun-Hai Yong, Bin Wang

Main category: cs.CV

TL;DR: ReCon is a novel data augmentation framework that enhances structure-controllable generative models for object detection by integrating region-guided rectification and region-aligned cross-attention during diffusion sampling.

Details

Motivation: Large-scale annotated datasets are costly and time-consuming to obtain, while current generative approaches for data augmentation suffer from content-position mismatches, semantic leakage, and require complex post-processing or extensive fine-tuning.

Method: ReCon integrates region-guided rectification using feedback from pre-trained perception models to fix misgenerated regions during diffusion sampling, and proposes region-aligned cross-attention to ensure spatial-semantic alignment between image regions and textual cues.

Result: Extensive experiments show ReCon substantially improves generated data quality and trainability, achieving consistent performance gains across various datasets, backbone architectures, and data scales.

Conclusion: ReCon effectively overcomes limitations of current generative approaches for data augmentation in object detection by enhancing semantic consistency and image fidelity through integrated rectification and alignment mechanisms.

Abstract: The scale and quality of datasets are crucial for training robust perception models. However, obtaining large-scale annotated data is both costly and time-consuming. Generative models have emerged as a powerful tool for data augmentation by synthesizing samples that adhere to desired distributions. However, current generative approaches often rely on complex post-processing or extensive fine-tuning on massive datasets to achieve satisfactory results, and they remain prone to content-position mismatches and semantic leakage. To overcome these limitations, we introduce ReCon, a novel augmentation framework that enhances the capacity of structure-controllable generative models for object detection. ReCon integrates region-guided rectification into the diffusion sampling process, using feedback from a pre-trained perception model to rectify misgenerated regions within diffusion sampling process. We further propose region-aligned cross-attention to enforce spatial-semantic alignment between image regions and their textual cues, thereby improving both semantic consistency and overall image fidelity. Extensive experiments demonstrate that ReCon substantially improve the quality and trainability of generated data, achieving consistent performance gains across various datasets, backbone architectures, and data scales. Our code is available at https://github.com/haoweiz23/ReCon .

[170] ERNet: Efficient Non-Rigid Registration Network for Point Sequences

Guangzhao He, Yuxi Xiao, Zhen Xu, Xiaowei Zhou, Sida Peng

Main category: cs.CV

TL;DR: ERNet: An efficient feed-forward model for non-rigid point cloud registration that handles noisy/partial inputs using deformation graphs and achieves state-of-the-art performance with 4x speedup.

Details

Motivation: Address challenges in non-rigid point cloud registration: local minima due to non-convex objectives under noisy/partial inputs, and error accumulation over long sequences causing tracking failures.

Method: Two-stage pipeline: first estimates frame-wise coarse graph nodes for robust initialization, then refines their trajectories over time in sliding-window fashion using temporal information.

Result: Outperforms previous state-of-the-art on DeformingThings4D and D-FAUST datasets, achieves more than 4x speedup compared to previous best method.

Conclusion: ERNet provides accurate, robust and efficient sequential registration for non-rigidly deforming point clouds through scalable data-driven approach and effective temporal modeling.

Abstract: Registering an object shape to a sequence of point clouds undergoing non-rigid deformation is a long-standing challenge. The key difficulties stem from two factors: (i) the presence of local minima due to the non-convexity of registration objectives, especially under noisy or partial inputs, which hinders accurate and robust deformation estimation, and (ii) error accumulation over long sequences, leading to tracking failures. To address these challenges, we introduce to adopt a scalable data-driven approach and propose ERNet, an efficient feed-forward model trained on large deformation datasets. It is designed to handle noisy and partial inputs while effectively leveraging temporal information for accurate and consistent sequential registration. The key to our design is predicting a sequence of deformation graphs through a two-stage pipeline, which first estimates frame-wise coarse graph nodes for robust initialization, before refining their trajectories over time in a sliding-window fashion. Extensive experiments show that our proposed approach (i) outperforms previous state-of-the-art on both the DeformingThings4D and D-FAUST datasets, and (ii) achieves more than 4x speedup compared to the previous best, offering significant efficiency improvement.

[171] VISTA: A Test-Time Self-Improving Video Generation Agent

Do Xuan Long, Xingchen Wan, Hootan Nakhost, Chen-Yu Lee, Tomas Pfister, Sercan Ö. Arık

Main category: cs.CV

TL;DR: VISTA is a multi-agent system that autonomously improves text-to-video generation through iterative prompt refinement, outperforming existing methods with up to 60% win rate.

Details

Motivation: Existing test-time optimization methods struggle with video generation's multi-faceted nature, and generated video quality remains critically dependent on precise user prompts.

Method: VISTA decomposes user ideas into temporal plans, identifies best videos through pairwise tournaments, critiques with specialized agents (visual, audio, contextual), and uses a reasoning agent to rewrite prompts iteratively.

Result: VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines, with human evaluators preferring VISTA outputs in 66.4% of comparisons.

Conclusion: The proposed multi-agent iterative refinement approach effectively enhances text-to-video synthesis quality and user intent alignment compared to existing methods.

Abstract: Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.

[172] Neuro-Symbolic Spatial Reasoning in Segmentation

Jiayi Lin, Jiabo Huang, Shaogang Gong

Main category: cs.CV

TL;DR: RelateSeg introduces neuro-symbolic spatial reasoning to open-vocabulary semantic segmentation by encoding spatial relations as first-order logic constraints in a neural network, achieving state-of-the-art performance.

Details

Motivation: Current vision-language model approaches for open-vocabulary semantic segmentation lack understanding of spatial relations between objects in scenes, limiting their performance on complex images with multiple objects.

Method: RelateSeg automatically extracts spatial relations (e.g., <cat, to-right-of, person>) and encodes them as first-order logic formulas using pseudo categories. Each pixel predicts both semantic and spatial pseudo categories simultaneously, with relational constraints enforced through fuzzy logic relaxation in an end-to-end deep network.

Result: Achieves state-of-the-art performance in average mIoU across four benchmark datasets, with particular advantages on images containing multiple categories. The method only introduces a single auxiliary loss function and no additional parameters.

Conclusion: Neuro-symbolic spatial reasoning is effective for open-vocabulary semantic segmentation, validating the proposed approach of combining first-order logic constraints with neural networks to improve spatial-relationally consistent segmentation.

Abstract: Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of categories, requiring generalization to unseen and unlabelled objects. Using vision-language models (VLMs) to correlate local image patches with potential unseen object categories suffers from a lack of understanding of spatial relations of objects in a scene. To solve this problem, we introduce neuro-symbolic (NeSy) spatial reasoning in OVSS. In contrast to contemporary VLM correlation-based approaches, we propose Relational Segmentor (RelateSeg) to impose explicit spatial relational constraints by first order logic (FOL) formulated in a neural network architecture. This is the first attempt to explore NeSy spatial reasoning in OVSS. Specifically, RelateSeg automatically extracts spatial relations, e.g., <cat, to-right-of, person>, and encodes them as first-order logic formulas using our proposed pseudo categories. Each pixel learns to predict both a semantic category (e.g., “cat”) and a spatial pseudo category (e.g., “right of person”) simultaneously, enforcing relational constraints (e.g., a “cat” pixel must lie to the right of a “person”). Finally, these logic constraints are formulated in a deep network architecture by fuzzy logic relaxation, enabling end-to-end learning of spatial-relationally consistent segmentation. RelateSeg achieves state-of-the-art performance in terms of average mIoU across four benchmark datasets and particularly shows clear advantages on images containing multiple categories, with the cost of only introducing a single auxiliary loss function and no additional parameters, validating the effectiveness of NeSy spatial reasoning in OVSS.

[173] 3DPR: Single Image 3D Portrait Relight using Generative Priors

Pramod Rao, Abhimitra Meka, Xilong Zhou, Gereon Fox, Mallikarjun B R, Fangneng Zhan, Tim Weyrich, Bernd Bickel, Hanspeter Pfister, Wojciech Matusik, Thabo Beeler, Mohamed Elgharib, Marc Habermann, Christian Theobalt

Main category: cs.CV

TL;DR: 3DPR is an image-based relighting model that uses generative priors from multi-view OLAT images and a pre-trained generative head model to render novel, relit views of human heads from monocular portrait images.

Details

Motivation: Traditional graphics approaches for relighting human heads from monocular images are limited by model assumptions and approximations. The goal is to create more accurate and realistic relighting without explicit decomposition into geometry, material, and lighting.

Method: Uses encoder-based inversion to embed input portraits into a pre-trained generative head model’s latent space, then employs a triplane-based reflectance network trained on a new large-scale 4K OLAT dataset to synthesize high-fidelity OLAT images for image-based relighting.

Result: 3DPR outperforms previous methods in preserving identity and capturing lighting effects like specularities, self-shadows, and subsurface scattering, producing physically accurate environmental relighting results.

Conclusion: The proposed approach successfully leverages generative priors and a novel reflectance network to achieve high-quality relighting of human heads from single images, demonstrating superior performance compared to existing methods.

Abstract: Rendering novel, relit views of a human head, given a monocular portrait image as input, is an inherently underconstrained problem. The traditional graphics solution is to explicitly decompose the input image into geometry, material and lighting via differentiable rendering; but this is constrained by the multiple assumptions and approximations of the underlying models and parameterizations of these scene components. We propose 3DPR, an image-based relighting model that leverages generative priors learnt from multi-view One-Light-at-A-Time (OLAT) images captured in a light stage. We introduce a new diverse and large-scale multi-view 4K OLAT dataset of 139 subjects to learn a high-quality prior over the distribution of high-frequency face reflectance. We leverage the latent space of a pre-trained generative head model that provides a rich prior over face geometry learnt from in-the-wild image datasets. The input portrait is first embedded in the latent manifold of such a model through an encoder-based inversion process. Then a novel triplane-based reflectance network trained on our lightstage data is used to synthesize high-fidelity OLAT images to enable image-based relighting. Our reflectance network operates in the latent space of the generative head model, crucially enabling a relatively small number of lightstage images to train the reflectance model. Combining the generated OLATs according to a given HDRI environment maps yields physically accurate environmental relighting results. Through quantitative and qualitative evaluations, we demonstrate that 3DPR outperforms previous methods, particularly in preserving identity and in capturing lighting effects such as specularities, self-shadows, and subsurface scattering. Project Page: https://vcai.mpi-inf.mpg.de/projects/3dpr/

[174] Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

Joongwon Chae, Lihui Luo, Xi Yuan, Dongmei Yu, Zhenglin Chen, Lian Zhang, Peiwu Qin

Main category: cs.CV

TL;DR: Memory-SAM is a training-free, human-prompt-free pipeline that automatically generates prompts from prior cases to guide SAM2 for tongue segmentation, achieving superior performance over supervised methods without manual intervention.

Details

Motivation: Accurate tongue segmentation is crucial for TCM analysis, but supervised models require large annotated datasets and SAM-family models remain prompt-driven, creating barriers for practical deployment.

Method: Uses dense DINOv3 features and FAISS retrieval to automatically generate foreground/background point prompts from a small memory of prior cases, then guides SAM2 without manual clicks or model fine-tuning.

Result: Achieves mIoU 0.9863 on mixed test data (600 images), significantly outperforming FCN (0.8188) and detector-to-box SAM baseline (0.1839), with clear gains under real-world conditions.

Conclusion: Retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging without requiring manual intervention or model training.

Abstract: Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.

[175] BLIP3o-NEXT: Next Frontier of Native Image Generation

Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, Ran Xu

Main category: cs.CV

TL;DR: BLIP3o-NEXT is a fully open-source foundation model that unifies text-to-image generation and image editing in a single architecture, achieving state-of-the-art performance through an Autoregressive + Diffusion hybrid approach.

Details

Motivation: To advance the frontier of native image generation by creating a unified model for both text-to-image generation and image editing, while identifying key insights about architecture efficiency, reinforcement learning, data quality, and image editing challenges.

Method: Uses an Autoregressive + Diffusion architecture where an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, then uses these hidden states as conditioning signals for a diffusion model to generate high-fidelity images.

Result: Achieves superior performance over existing models on various text-to-image and image-editing benchmarks, demonstrating strong image generation and editing capabilities with improved coherence and realism.

Conclusion: The hybrid architecture successfully integrates autoregressive models’ reasoning strength with diffusion models’ fine-detail rendering ability, while identifying that data quality and scale remain decisive factors for model performance.

Abstract: We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.

[176] BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models

Kaushitha Silva, Mansitha Eashwara, Sanduni Ubayasiri, Ruwan Tennakoon, Damayanthi Herath

Main category: cs.CV

TL;DR: BiomedXPro is an evolutionary framework that uses LLMs to generate diverse, interpretable prompt pairs for biomedical diagnosis, outperforming existing methods and providing verifiable clinical alignment.

Details

Motivation: Current prompt optimization methods produce uninterpretable latent vectors or single prompts, lacking transparency and failing to capture the multi-faceted nature of clinical diagnosis, which limits trustworthiness in high-stakes medical settings.

Method: Uses evolutionary framework with LLM as both biomedical knowledge extractor and adaptive optimizer to automatically generate diverse ensemble of interpretable natural-language prompt pairs for disease diagnosis.

Result: Consistently outperforms state-of-the-art prompt-tuning methods across multiple biomedical benchmarks, especially in data-scarce few-shot settings. Shows strong semantic alignment between discovered prompts and statistically significant clinical features.

Conclusion: BiomedXPro provides verifiable basis for model predictions through diverse ensemble of interpretable prompts, representing critical step toward more trustworthy and clinically-aligned AI systems.

Abstract: The clinical adoption of biomedical vision-language models is hindered by prompt optimization techniques that produce either uninterpretable latent vectors or single textual prompts. This lack of transparency and failure to capture the multi-faceted nature of clinical diagnosis, which relies on integrating diverse observations, limits their trustworthiness in high-stakes settings. To address this, we introduce BiomedXPro, an evolutionary framework that leverages a large language model as both a biomedical knowledge extractor and an adaptive optimizer to automatically generate a diverse ensemble of interpretable, natural-language prompt pairs for disease diagnosis. Experiments on multiple biomedical benchmarks show that BiomedXPro consistently outperforms state-of-the-art prompt-tuning methods, particularly in data-scarce few-shot settings. Furthermore, our analysis demonstrates a strong semantic alignment between the discovered prompts and statistically significant clinical features, grounding the model’s performance in verifiable concepts. By producing a diverse ensemble of interpretable prompts, BiomedXPro provides a verifiable basis for model predictions, representing a critical step toward the development of more trustworthy and clinically-aligned AI systems.

[177] LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal

Shr-Ruei Tsai, Wei-Cheng Chang, Jie-Ying Lee, Chih-Hai Su, Yu-Lun Liu

Main category: cs.CV

TL;DR: LightsOut is a diffusion-based outpainting framework that enhances Single Image Flare Removal (SIFR) by reconstructing off-frame light sources, improving performance of existing methods without retraining.

Details

Motivation: Lens flare degrades image quality and affects computer vision tasks. Current SIFR methods perform poorly when off-frame light sources are incomplete or absent.

Method: Uses a diffusion-based outpainting framework with multitask regression module and LoRA fine-tuned diffusion model to reconstruct off-frame light sources realistically and physically consistently.

Result: Comprehensive experiments show LightsOut consistently boosts performance of existing SIFR methods across challenging scenarios.

Conclusion: LightsOut serves as a universally applicable plug-and-play preprocessing solution for flare removal without requiring additional retraining of existing methods.

Abstract: Lens flare significantly degrades image quality, impacting critical computer vision tasks like object detection and autonomous driving. Recent Single Image Flare Removal (SIFR) methods perform poorly when off-frame light sources are incomplete or absent. We propose LightsOut, a diffusion-based outpainting framework tailored to enhance SIFR by reconstructing off-frame light sources. Our method leverages a multitask regression module and LoRA fine-tuned diffusion model to ensure realistic and physically consistent outpainting results. Comprehensive experiments demonstrate LightsOut consistently boosts the performance of existing SIFR methods across challenging scenarios without additional retraining, serving as a universally applicable plug-and-play preprocessing solution. Project page: https://ray-1026.github.io/lightsout/

[178] Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery

Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, Yu-Lun Liu

Main category: cs.CV

TL;DR: Skyfall-GS is a framework for creating large-scale 3D urban scenes by combining satellite imagery for geometry with diffusion models for textures, enabling real-time exploration without costly 3D annotations.

Details

Motivation: The motivation is to address the challenge of synthesizing large-scale, explorable 3D urban scenes without relying on expensive 3D scans for training, by leveraging readily available satellite imagery and diffusion models.

Method: The method synergizes satellite imagery for coarse geometry with open-domain diffusion models for high-quality close-up appearances, using a curriculum-driven iterative refinement strategy to progressively enhance geometric completeness and photorealistic textures.

Result: Extensive experiments show that Skyfall-GS provides improved cross-view consistent geometry and more realistic textures compared to state-of-the-art approaches.

Conclusion: Skyfall-GS successfully creates city-block scale 3D scenes without costly 3D annotations, featuring real-time immersive exploration capabilities.

Abstract: Synthesizing large-scale, explorable, and geometrically accurate 3D urban scenes is a challenging yet valuable task in providing immersive and embodied applications. The challenges lie in the lack of large-scale and high-quality real-world 3D scans for training generalizable generative models. In this paper, we take an alternative route to create large-scale 3D scenes by synergizing the readily available satellite imagery that supplies realistic coarse geometry and the open-domain diffusion model for creating high-quality close-up appearances. We propose \textbf{Skyfall-GS}, the first city-block scale 3D scene creation framework without costly 3D annotations, also featuring real-time, immersive 3D exploration. We tailor a curriculum-driven iterative refinement strategy to progressively enhance geometric completeness and photorealistic textures. Extensive experiments demonstrate that Skyfall-GS provides improved cross-view consistent geometry and more realistic textures compared to state-of-the-art approaches. Project page: https://skyfall-gs.jayinnn.dev/

[179] CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs

Ao Wang, Hui Chen, Zijia Lin, Sicheng Zhao, Jungong Han, Guiguang Ding

Main category: cs.CV

TL;DR: CAIT is a joint compression method for Vision Transformers that combines token merging and channel pruning to achieve high accuracy, fast inference, and good transferability to downstream tasks.

Details

Motivation: Vision Transformers have high computation costs that limit deployment on resource-limited devices. Existing compression methods either sparsely drop tokens or brutally remove channels, leading to sub-optimal performance-speed balance and poor transferability to spatial tasks like semantic segmentation.

Method: Proposes CAIT with two strategies: 1) Asymmetric Token Merging (ATME) to integrate neighboring tokens while preserving spatial structure, and 2) Consistent Dynamic Channel Pruning (CDCP) to dynamically prune unimportant channels uniformly across multi-head self-attention modules.

Result: Extensive experiments on multiple benchmark datasets show state-of-the-art performance across various Vision Transformers, achieving a good balance between model performance and inference speed.

Conclusion: CAIT successfully addresses ViT compression challenges by jointly applying token merging and channel pruning, enabling efficient deployment while maintaining accuracy and transferability to downstream vision tasks.

Abstract: Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks recently. However, their heavy computation costs remain daunting for resource-limited devices. To address this, researchers have dedicated themselves to compressing redundant information in ViTs for acceleration. However, existing approaches generally sparsely drop redundant image tokens by token pruning or brutally remove channels by channel pruning, leading to a sub-optimal balance between model performance and inference speed. Moreover, they struggle when transferring compressed models to downstream vision tasks that require the spatial structure of images, such as semantic segmentation. To tackle these issues, we propose CAIT, a joint \underline{c}ompression method for ViTs that achieves a harmonious blend of high \underline{a}ccuracy, fast \underline{i}nference speed, and favorable \underline{t}ransferability to downstream tasks. Specifically, we introduce an asymmetric token merging (ATME) strategy to effectively integrate neighboring tokens. It can successfully compress redundant token information while preserving the spatial structure of images. On top of it, we further design a consistent dynamic channel pruning (CDCP) strategy to dynamically prune unimportant channels in ViTs. Thanks to CDCP, insignificant channels in multi-head self-attention modules of ViTs can be pruned uniformly, significantly enhancing the model compression. Extensive experiments on multiple benchmark datasets show that our proposed method can achieve state-of-the-art performance across various ViTs.

[180] MotionScript: Natural Language Descriptions for Expressive 3D Human Motions

Payam Jome Yazdian, Rachel Lagasse, Hamid Mohammadi, Eric Liu, Li Cheng, Angelica Lim

Main category: cs.CV

TL;DR: MotionScript is a framework that generates detailed natural language descriptions of 3D human motions, providing fine-grained structured descriptions for expressive actions and interactions beyond standard motion datasets.

Details

Motivation: Existing motion datasets use broad action labels or generic captions, lacking the ability to capture the full complexity of human movement including expressive actions and detailed interactions.

Method: MotionScript systematically translates 3D motion into structured natural language without requiring training data, serving as both a descriptive tool and training resource for text-to-motion models.

Result: By augmenting motion datasets with MotionScript captions, the framework demonstrates significant improvements in out-of-distribution motion generation, enabling LLMs to generate motions beyond existing data.

Conclusion: MotionScript provides an interpretable bridge between intuitive descriptions and motion synthesis, opening new applications in animation, virtual human simulation, and robotics.

Abstract: We introduce MotionScript, a novel framework for generating highly detailed, natural language descriptions of 3D human motions. Unlike existing motion datasets that rely on broad action labels or generic captions, MotionScript provides fine-grained, structured descriptions that capture the full complexity of human movement including expressive actions (e.g., emotions, stylistic walking) and interactions beyond standard motion capture datasets. MotionScript serves as both a descriptive tool and a training resource for text-to-motion models, enabling the synthesis of highly realistic and diverse human motions from text. By augmenting motion datasets with MotionScript captions, we demonstrate significant improvements in out-of-distribution motion generation, allowing large language models (LLMs) to generate motions that extend beyond existing data. Additionally, MotionScript opens new applications in animation, virtual human simulation, and robotics, providing an interpretable bridge between intuitive descriptions and motion synthesis. To the best of our knowledge, this is the first attempt to systematically translate 3D motion into structured natural language without requiring training data.

[181] Diffusion Models are Efficient Data Generators for Human Mesh Recovery

Yongtao Ge, Wenjia Wang, Yongfan Chen, Fanzhou Wang, Lei Yang, Hao Chen, Chunhua Shen

Main category: cs.CV

TL;DR: HumanWild is a synthetic data generation pipeline using diffusion models to create diverse human images with 3D mesh annotations, addressing limitations of current mocap and CG datasets for 3D human pose and shape estimation.

Details

Motivation: Current 3D human pose and shape estimation methods rely on confined indoor mocap datasets or CG-rendered data, which lack adequate human identities and authentic in-the-wild backgrounds needed for real-world generalization.

Method: Proposes HumanWild pipeline using diffusion models: collects large human-centric dataset with annotations, trains multi-condition ControlNet using SMPL-X parametric model to generate human images with initial labels, creates flexible pipeline for complex scenes.

Result: Generated large-scale, in-the-wild human images with high-quality 3D mesh annotations, reducing manual collection needs. Dataset covers diverse viewpoints, environments, and human identities.

Conclusion: Synthetic data from generative models complements CG data for better real-world generalization. HumanWild enables scaling 3D human recovery to in-the-wild scenes through flexible, customizable data generation.

Abstract: Despite remarkable progress having been made on the problem of 3D human pose and shape estimation (HPS), current state-of-the-art methods rely heavily on either confined indoor mocap datasets or datasets generated by a rendering engine using computer graphics (CG). Both categories of datasets exhibit inadequacies in furnishing adequate human identities and authentic in-the-wild background scenes, which are crucial for accurately simulating real-world distributions. In this work, we show that synthetic data created by generative models is complementary to CG-rendered data for achieving remarkable generalization performance on diverse real-world scenes. We propose an effective data generation pipeline based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. Specifically, we first collect a large-scale human-centric dataset with comprehensive annotations, e.g, text captions, the depth map, and surface normal images. To generate a wide variety of human images with initial labels, we train a customized, multi-condition ControlNet model. The key to this process is using a 3D parametric model, e.g, SMPL-X, to create various condition inputs easily. Our data generation pipeline is both flexible and customizable, making it adaptable to multiple real-world tasks, such as human interaction in complex scenes and humans captured by wide-angle lenses. By relying solely on generative models, we can produce large-scale, in-the-wild human images with high-quality annotations, significantly reducing the need for manual image collection and annotation. The generated dataset encompasses a wide range of viewpoints, environments, and human identities, ensuring its versatility across different scenarios. We hope that our work could pave the way for scaling up 3D human recovery to in-the-wild scenes.

[182] HumorDB: Can AI understand graphical humor?

Vedaant Jain, Felipe dos Santos Alves Feitosa, Gabriel Kreiman

Main category: cs.CV

TL;DR: This paper introduces HumorDB, a controlled dataset for evaluating AI’s visual humor understanding through binary classification, funniness rating, and pairwise comparison tasks, revealing gaps between AI and human performance.

Details

Motivation: To address the challenge of complex scene understanding by using visual humor as a test case that requires interpreting scene element interactions and cognitive knowledge.

Method: Created HumorDB dataset with diverse images (photos, cartoons, sketches, AI-generated) including contrastive pairs; evaluated humans, vision models, and vision-language models on three humor understanding tasks.

Result: AI systems lag behind human humor understanding; vision-language models outperform vision-only models but struggle with abstract sketches and subtle humor cues; models often fail to focus on correct humorous regions.

Conclusion: Effective visual humor understanding requires architectures that can detect subtle contextual features and bridge visual perception with abstract reasoning.

Abstract: Despite significant advancements in image segmentation and object detection, understanding complex scenes remains a significant challenge. Here, we focus on graphical humor as a paradigmatic example of image interpretation that requires elucidating the interaction of different scene elements in the context of prior cognitive knowledge. This paper introduces \textbf{HumorDB}, a novel, controlled, and carefully curated dataset designed to evaluate and advance visual humor understanding by AI systems. The dataset comprises diverse images spanning photos, cartoons, sketches, and AI-generated content, including minimally contrastive pairs where subtle edits differentiate between humorous and non-humorous versions. We evaluate humans, state-of-the-art vision models, and large vision-language models on three tasks: binary humor classification, funniness rating prediction, and pairwise humor comparison. The results reveal a gap between current AI systems and human-level humor understanding. While pretrained vision-language models perform better than vision-only models, they still struggle with abstract sketches and subtle humor cues. Analysis of attention maps shows that even when models correctly classify humorous images, they often fail to focus on the precise regions that make the image funny. Preliminary mechanistic interpretability studies and evaluation of model explanations provide initial insights into how different architectures process humor. Our results identify promising trends and current limitations, suggesting that an effective understanding of visual humor requires sophisticated architectures capable of detecting subtle contextual features and bridging the gap between visual perception and abstract reasoning. All the code and data are available here: \href{https://github.com/kreimanlab/HumorDB}{https://github.com/kreimanlab/HumorDB}

[183] CLOVER: Context-aware Long-term Object Viewpoint- and Environment- Invariant Representation Learning

Dongmyeong Lee, Amanda Adkins, Joydeep Biswas

Main category: cs.CV

TL;DR: CODa Re-ID is a new dataset for object re-identification with 1M+ observations of 557 objects across 8 classes under diverse conditions. CLOVER is a representation learning method that distinguishes static object instances without requiring segmentation, and MapCLOVER enables scalable descriptor summarization for object mapping.

Details

Motivation: Mobile robots need object-level understanding to re-identify previously seen instances across different viewpoints and appearance variations (weather, lighting). Existing methods are limited to specific classes, require segmentation, and don't adequately address outdoor scenes and illumination changes.

Method: Proposed CLOVER - a representation learning method for object observations that distinguishes static object instances without requiring foreground segmentation. Also introduced MapCLOVER for scalably summarizing CLOVER descriptors for object maps and matching new observations.

Result: CLOVER achieves superior performance in static object re-identification under varying lighting conditions and viewpoint changes. The method can generalize to unseen instances and classes.

Conclusion: The proposed CLOVER and MapCLOVER methods effectively address object re-identification challenges in real-world environments, handling viewpoint and illumination variations without requiring segmentation, and demonstrating strong generalization capabilities.

Abstract: Mobile service robots can benefit from object-level understanding of their environments, including the ability to distinguish object instances and re-identify previously seen instances. Object re-identification is challenging across different viewpoints and in scenes with significant appearance variation arising from weather or lighting changes. Existing works on object re-identification either focus on specific classes or require foreground segmentation. Further, these methods, along with object re-identification datasets, have limited consideration of challenges such as outdoor scenes and illumination changes. To address this problem, we introduce CODa Re-ID: an in-the-wild object re-identification dataset containing 1,037,814 observations of 557 objects across 8 classes under diverse lighting conditions and viewpoints. Further, we propose CLOVER, a representation learning method for object observations that can distinguish between static object instances without requiring foreground segmentation. We also introduce MapCLOVER, a method for scalably summarizing CLOVER descriptors for use in object maps and matching new observations to summarized descriptors. Our results show that CLOVER achieves superior performance in static object re-identification under varying lighting conditions and viewpoint changes and can generalize to unseen instances and classes.

[184] Self-supervised Multi-future Occupancy Forecasting for Autonomous Driving

Bernard Lange, Masha Itkina, Jiachen Li, Mykel J. Kochenderfer

Main category: cs.CV

TL;DR: LOPR is a stochastic LiDAR occupancy grid map prediction framework that operates in latent space, integrates multiple sensor modalities, and outperforms prior deterministic approaches.

Details

Motivation: Existing deterministic L-OGM prediction methods produce unrealistic predictions, fail to capture environmental stochasticity, and don't effectively integrate additional AV sensor modalities.

Method: Performs stochastic L-OGM prediction in latent space using generative architecture, conditions on RGB cameras, maps, and planned trajectories, with single-step decoder for real-time or diffusion-based batch decoder for refinement.

Result: All variants outperform prior approaches qualitatively and quantitatively on nuScenes and Waymo Open datasets.

Conclusion: LOPR demonstrates superior performance in stochastic environment prediction by leveraging latent space modeling and multi-modal sensor integration.

Abstract: Environment prediction frameworks are critical for the safe navigation of autonomous vehicles (AVs) in dynamic settings. LiDAR-generated occupancy grid maps (L-OGMs) offer a robust bird’s-eye view for the scene representation, enabling self-supervised joint scene predictions while exhibiting resilience to partial observability and perception detection failures. Prior approaches have focused on deterministic L-OGM prediction architectures within the grid cell space. While these methods have seen some success, they frequently produce unrealistic predictions and fail to capture the stochastic nature of the environment. Additionally, they do not effectively integrate additional sensor modalities present in AVs. Our proposed framework, Latent Occupancy Prediction (LOPR), performs stochastic L-OGM prediction in the latent space of a generative architecture and allows for conditioning on RGB cameras, maps, and planned trajectories. We decode predictions using either a single-step decoder, which provides high-quality predictions in real-time, or a diffusion-based batch decoder, which can further refine the decoded frames to address temporal consistency issues and reduce compression losses. Our experiments on the nuScenes and Waymo Open datasets show that all variants of our approach qualitatively and quantitatively outperform prior approaches.

Lei Yang, Xinyu Zhang, Jun Li, Chen Wang, Jiaqi Ma, Zhiying Song, Tong Zhao, Ziying Song, Li Wang, Mo Zhou, Yang Shen, Kai Wu, Chen Lv

Main category: cs.CV

TL;DR: V2X-Radar is the first large-scale real-world multi-modal dataset for cooperative perception featuring 4D Radar, addressing the gap in existing datasets that focus primarily on cameras and LiDAR.

Details

Motivation: Current cooperative perception datasets lack 4D Radar data, which is crucial for robust perception in adverse weather conditions. This gap limits the development of comprehensive autonomous driving systems.

Method: Collected data using connected vehicle platforms and intelligent roadside units equipped with 4D Radar, LiDAR, and multi-view cameras across various weather conditions and scenarios.

Result: Created V2X-Radar dataset with 20K LiDAR frames, 40K camera images, 20K 4D Radar data, and 350K annotated boxes across five categories, organized into three sub-datasets for different research domains.

Conclusion: V2X-Radar bridges the 4D Radar gap in cooperative perception and provides comprehensive benchmarks for cooperative, roadside, and single-vehicle perception research.

Abstract: Modern autonomous vehicle perception systems often struggle with occlusions and limited perception range. Previous studies have demonstrated the effectiveness of cooperative perception in extending the perception range and overcoming occlusions, thereby enhancing the safety of autonomous driving. In recent years, a series of cooperative perception datasets have emerged; however, these datasets primarily focus on cameras and LiDAR, neglecting 4D Radar, a sensor used in single-vehicle autonomous driving to provide robust perception in adverse weather conditions. In this paper, to bridge the gap created by the absence of 4D Radar datasets in cooperative perception, we present V2X-Radar, the first large-scale, real-world multi-modal dataset featuring 4D Radar. V2X-Radar dataset is collected using a connected vehicle platform and an intelligent roadside unit equipped with 4D Radar, LiDAR, and multi-view cameras. The collected data encompasses sunny and rainy weather conditions, spanning daytime, dusk, and nighttime, as well as various typical challenging scenarios. The dataset consists of 20K LiDAR frames, 40K camera images, and 20K 4D Radar data, including 350K annotated boxes across five categories. To support various research domains, we have established V2X-Radar-C for cooperative perception, V2X-Radar-I for roadside perception, and V2X-Radar-V for single-vehicle perception. Furthermore, we provide comprehensive benchmarks across these three sub-datasets. We will release all datasets and benchmark codebase at http://openmpd.com/column/V2X-Radar and https://github.com/yanglei18/V2X-Radar.

[186] PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

Ao Wang, Hui Chen, Jiaxin Li, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Zijia Lin, Jungong Han, Guiguang Ding

Main category: cs.CV

TL;DR: PrefixKV is a method that optimizes KV cache size in large vision-language models by adaptively selecting important KV vectors per layer using binary search, improving inference efficiency while maintaining generation quality.

Details

Motivation: Large vision-language models suffer from high computational and memory overhead during inference due to extensive KV cache requirements. Existing methods overlook the varying importance of KV vectors across different layers, leading to performance degradation.

Method: Proposes PrefixKV which reframes KV cache size optimization as searching for optimal global prefix configuration. Uses adaptive layer-wise KV retention based on binary search to preserve maximum contextual information in each layer.

Result: Achieves state-of-the-art performance with superior inference efficiency and generation quality trade-offs. Extensive experiments demonstrate promising potential for practical applications.

Conclusion: PrefixKV effectively addresses KV cache optimization by considering layer-wise importance distributions, enabling efficient deployment of large vision-language models in practical scenarios while maintaining high generation quality.

Abstract: Recently, large vision-language models (LVLMs) have rapidly gained popularity for their strong generation and reasoning capabilities given diverse multimodal inputs. However, these models incur significant computational and memory overhead during inference, which greatly hinders the efficient deployment in practical scenarios. The extensive key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. Based on this, recent works have investigated ways to reduce the KV cache size for higher efficiency. Although effective, they generally overlook the distinct importance distributions of KV vectors across layers and maintain the same cache size for each layer during the next token prediction. This results in the significant contextual information loss for certain layers, leading to notable performance decline. To address this, we present PrefixKV, where “Prefix” means the top-ranked KV based on importance rather than position in the original sequence. It reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. With an adaptive layer-wise KV retention recipe based on binary search, the maximum contextual information can thus be preserved in each layer, facilitating the generation. Extensive experiments demonstrate that our method achieves the state-of-the-art performance compared with others. It exhibits superior inference efficiency and generation quality trade-offs, showing promising potential for practical applications. Code is available at https://github.com/THU-MIG/PrefixKV.

[187] Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation

Jitesh Jain, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, Jianwei Yang

Main category: cs.CV

TL;DR: VisPer-LM improves MLLMs by infusing visual perception knowledge into LLM representations during pretraining, addressing the language bias in current approaches and enhancing spatial reasoning capabilities.

Details

Motivation: Current MLLMs trained with natural language supervision tend to prioritize language comprehension over visual perception, which is critical for spatial reasoning tasks in embodied AI and robotics.

Method: Proposes VisPer-LM which couples optimization of predictive visual embedding with next token prediction during pretraining, infusing visual perception knowledge from expert vision encoders into LLM hidden representations.

Result: VisPer-LM outperforms single and multi-encoder baselines by average 2.5% across benchmarks, with 8.7% improvement on Depth task in CV-Bench. Shows improved visual representation quality through embedding optimization.

Conclusion: The approach successfully addresses the language bias in MLLMs and demonstrates that coupling visual embedding optimization with text prediction during pretraining significantly enhances visual perception capabilities.

Abstract: In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data, which are critical for tasks involving spatial reasoning in the domain of embodied AI and robotics. Is it possible to optimize both at the same time? In this work, we propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM’s (of an MLLM) hidden representations. We start by investigating MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Given this insight, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next (text) token prediction. Moreover, through extensive probing, we observe improved visual representation quality due to embedding optimization, underscoring the effectiveness of our probing setup. We demonstrate that our VisPer-LM outperforms the single and multi-encoder baselines, proving our approach’s superiority over explicitly feeding the corresponding features to the LLM. In particular, VisPer-LM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.

[188] Conformal Risk Control for Pulmonary Nodule Detection

Roel Hulsman, Valentin Comte, Lorenzo Bertolini, Tobias Wiesenthal, Antonio Puertas Gallardo, Mario Ceresa

Main category: cs.CV

TL;DR: This paper presents a case study on pulmonary nodule detection using conformal risk control (CRC) to quantify predictive uncertainty, showing that prediction sets with conformal guarantees help achieve reliable decisions in healthcare by trading off false positives while maintaining competitive sensitivity.

Details

Motivation: Quantitative tools in healthcare need reliable uncertainty quantification for transparent decision-making, especially in safety-critical domains like lung cancer screening where understanding predictive uncertainties is crucial.

Method: Enhanced an advanced pulmonary nodule detection model with conformal risk control (CRC) technique to provide prediction sets with formal statistical guarantees on model performance.

Result: The model achieved sensitivity competitive with individual radiologists with a slight increase in false positives, and demonstrated the risks of using off-the-shelf models when facing ontological uncertainty from radiologist disagreements.

Conclusion: Prediction sets with conformal guarantees are effective uncertainty measures in healthcare, allowing end-users to balance validity trade-offs while highlighting the dangers of ontological uncertainty in medical AI applications.

Abstract: Quantitative tools are increasingly appealing for decision support in healthcare, driven by the growing capabilities of advanced AI systems. However, understanding the predictive uncertainties surrounding a tool’s output is crucial for decision-makers to ensure reliable and transparent decisions. In this paper, we present a case study on pulmonary nodule detection for lung cancer screening, enhancing an advanced detection model with an uncertainty quantification technique called conformal risk control (CRC). We demonstrate that prediction sets with conformal guarantees are attractive measures of predictive uncertainty in the safety-critical healthcare domain, allowing end-users to achieve arbitrary validity by trading off false positives and providing formal statistical guarantees on model performance. Among ground-truth nodules annotated by at least three radiologists, our model achieves a sensitivity that is competitive with that generally achieved by individual radiologists, with a slight increase in false positives. Furthermore, we illustrate the risks of using off-the-shelve prediction models when faced with ontological uncertainty, such as when radiologists disagree on what constitutes the ground truth on pulmonary nodules.

[189] Unleashing the Potential of Pre-Trained Diffusion Models for Generalizable Person Re-Identification

Jiachen Li, Xiaojin Gong

Main category: cs.CV

TL;DR: DCAC is a novel method that combines discriminative Re-ID models with diffusion models using correlation-aware conditioning to improve domain generalization in person re-identification.

Details

Motivation: Existing DG Re-ID methods often fail to mitigate shortcut learning and achieve suboptimal performance on unseen target domains, despite using discriminative or contrastive learning frameworks.

Method: Integrates Re-ID model with pre-trained diffusion model through correlation-aware conditioning scheme that uses ID classification probabilities and learnable ID-wise prompts to inject dark knowledge about ID correlations.

Result: Achieves state-of-the-art performance on both single-source and multi-source DG Re-ID tasks, with comprehensive ablation studies validating effectiveness and robustness.

Conclusion: DCAC effectively enhances generalization capability of Re-ID features through bidirectional interaction between Re-ID and diffusion models, providing a robust solution for domain-generalizable person re-identification.

Abstract: Domain-generalizable re-identification (DG Re-ID) aims to train a model on one or more source domains and evaluate its performance on unseen target domains, a task that has attracted growing attention due to its practical relevance. While numerous methods have been proposed, most rely on discriminative or contrastive learning frameworks to learn generalizable feature representations. However, these approaches often fail to mitigate shortcut learning, leading to suboptimal performance. In this work, we propose a novel method called diffusion model-assisted representation learning with a correlation-aware conditioning scheme (DCAC) to enhance DG Re-ID. Our method integrates a discriminative and contrastive Re-ID model with a pre-trained diffusion model through a correlation-aware conditioning scheme. By incorporating ID classification probabilities generated from the Re-ID model with a set of learnable ID-wise prompts, the conditioning scheme injects dark knowledge that captures ID correlations to guide the diffusion process. Simultaneously, feedback from the diffusion model is back-propagated through the conditioning scheme to the Re-ID model, effectively improving the generalization capability of Re-ID features. Extensive experiments on both single-source and multi-source DG Re-ID tasks demonstrate that our method achieves state-of-the-art performance. Comprehensive ablation studies further validate the effectiveness of the proposed approach, providing insights into its robustness. Codes will be available at https://github.com/RikoLi/DCAC.

[190] KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection

Tuan-Vinh La, Minh-Hieu Nguyen, Minh-Son Dao

Main category: cs.CV

TL;DR: A novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations using bottom-up attention, CLIP, RoBERTa, and knowledge graph entities with Transformer-based classification.

Details

Motivation: Existing fake news detection methods have limitations: they focus only on global image context while ignoring local object details, and fail to incorporate external knowledge and entity relationships for deeper semantic understanding.

Method: Proposed framework integrates visual, textual, and knowledge representations using bottom-up attention for object details, CLIP for global image semantics, RoBERTa for text encoding, and knowledge graph entity retrieval with adaptive selection. Uses Transformer-based classifier for final prediction.

Result: Experimental results show the model outperforms recent approaches, demonstrating effectiveness of neighbor selection mechanism and multi-modal fusion for fake news detection.

Conclusion: Introduces knowledge-grounded multimodal reasoning paradigm, shifting fake news detection from feature fusion to semantically grounded verification through entity-level selection and NLI-guided filtering.

Abstract: Fake news detection remains a challenging problem due to the complex interplay between textual misinformation, manipulated images, and external knowledge reasoning. While existing approaches have achieved notable results in verifying veracity and cross-modal consistency, two key challenges persist: (1) Existing methods often consider only the global image context while neglecting local object-level details, and (2) they fail to incorporate external knowledge and entity relationships for deeper semantic understanding. To address these challenges, we propose a novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations. Our approach leverages bottom-up attention to capture fine-grained object details, CLIP for global image semantics, and RoBERTa for context-aware text encoding. We further enhance knowledge utilization by retrieving and adaptively selecting relevant entities from a knowledge graph. The fused multi-modal features are processed through a Transformer-based classifier to predict news veracity. Experimental results demonstrate that our model outperforms recent approaches, showcasing the effectiveness of neighbor selection mechanism and multi-modal fusion for fake news detection. Our proposal introduces a new paradigm: knowledge-grounded multimodal reasoning. By integrating explicit entity-level selection and NLI-guided filtering, we shift fake news detection from feature fusion to semantically grounded verification. For reproducibility and further research, the source code is publicly at \href{https://github.com/latuanvinh1998/KGAlign}{github.com/latuanvinh1998/KGAlign}.

[191] Methods and Trends in Detecting AI-Generated Images: A Comprehensive Review

Arpan Mahara, Naphtali Rishe

Main category: cs.CV

TL;DR: This survey provides a comprehensive review of state-of-the-art techniques for detecting synthetic images generated by advanced generative AI models, addressing gaps in previous reviews by covering multimodal frameworks, reasoning-based detection, and training-free methodologies.

Details

Motivation: The proliferation of generative models (GANs, Diffusion Models, VAEs) has enabled high-quality data synthesis but raised concerns about adversarial attacks, unethical usage, and societal harm. Prior reviews focused mainly on deepfake detection and overlooked recent advancements in synthetic image forensics.

Method: The review systematically examines core detection paradigms categorized into spatial-domain, frequency-domain, fingerprint-based, patch-based, training-free, and multimodal reasoning-based frameworks, with comparative analyses on public datasets.

Result: The survey provides detailed comparative analyses of detection methods to assess their generalizability, robustness, and interpretability, highlighting the effectiveness of different approaches across various detection frameworks.

Conclusion: Hybrid frameworks combining the efficiency of training-free approaches with semantic reasoning of multimodal models show potential for advancing trustworthy and explainable synthetic image forensics. The survey also identifies open challenges and future directions.

Abstract: The proliferation of generative models, such as Generative Adversarial Networks (GANs), Diffusion Models, and Variational Autoencoders (VAEs), has enabled the synthesis of high-quality multimedia data. However, these advancements have also raised significant concerns regarding adversarial attacks, unethical usage, and societal harm. Recognizing these challenges, researchers have increasingly focused on developing methodologies to detect synthesized data effectively, aiming to mitigate potential risks. Prior reviews have predominantly focused on deepfake detection and often overlook recent advancements in synthetic image forensics, particularly approaches that incorporate multimodal frameworks, reasoning-based detection, and training-free methodologies. To bridge this gap, this survey provides a comprehensive and up-to-date review of state-of-the-art techniques for detecting and classifying synthetic images generated by advanced generative AI models. The review systematically examines core detection paradigms, categorizes them into spatial-domain, frequency-domain, fingerprint-based, patch-based, training-free, and multimodal reasoning-based frameworks, and offers concise descriptions of their underlying principles. We further provide detailed comparative analyses of these methods on publicly available datasets to assess their generalizability, robustness, and interpretability. Finally, the survey highlights open challenges and future directions, emphasizing the potential of hybrid frameworks that combine the efficiency of training-free approaches with the semantic reasoning of multimodal models to advance trustworthy and explainable synthetic image forensics.

[192] NFIG: Autoregressive Image Generation with Next-Frequency Prediction

Zhihao Huang, Xi Qiu, Yukuo Ma, Yifu Zhou, Junjie Chen, Hongyuan Zhang, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: NFIG is a novel autoregressive image generation framework that decomposes image generation into frequency-guided stages, starting from low-frequency components for global structure and progressively adding higher-frequency details, achieving better performance with fewer steps.

Details

Motivation: Autoregressive models face challenges in image generation including capturing long-range dependencies, high computational costs, and defining meaningful autoregressive sequences that reflect natural image hierarchies.

Method: Decomposes image generation into multiple frequency-guided stages: first generates low-frequency components for global structure with fewer tokens, then progressively adds higher-frequency details following natural spectral hierarchy.

Result: Achieves state-of-the-art performance with 1.25× speedup compared to VAR-d20 and better performance (FID: 2.81) on ImageNet-256 benchmark, while using fewer steps.

Conclusion: Incorporating frequency-domain knowledge to guide autoregressive sequence design provides an efficient solution for image generation and offers insights for future research.

Abstract: Autoregressive models have achieved promising results in natural language processing. However, for image generation tasks, they encounter substantial challenges in effectively capturing long-range dependencies, managing computational costs, and most crucially, defining meaningful autoregressive sequences that reflect natural image hierarchies. To address these issues, we present \textbf{N}ext-\textbf{F}requency \textbf{I}mage \textbf{G}eneration (\textbf{NFIG}), a novel framework that decomposes the image generation process into multiple frequency-guided stages. Our approach first generates low-frequency components to establish global structure with fewer tokens, then progressively adds higher-frequency details, following the natural spectral hierarchy of images. This principled autoregressive sequence not only improves the quality of generated images by better capturing true causal relationships between image components, but also significantly reduces computational overhead during inference. Extensive experiments demonstrate that NFIG achieves state-of-the-art performance with fewer steps, offering a more efficient solution for image generation, with 1.25$\times$ speedup compared to VAR-d20 while achieving better performance (FID: 2.81) on the ImageNet-256 benchmark. We hope that our insight of incorporating frequency-domain knowledge to guide autoregressive sequence design will shed light on future research. We will make our code publicly available upon acceptance of the paper.

[193] YOLOE: Real-Time Seeing Anything

Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding

Main category: cs.CV

TL;DR: YOLOE is a highly efficient open-set object detection and segmentation model that integrates text prompts, visual prompts, and prompt-free mechanisms in a single framework, achieving real-time performance with minimal computational overhead.

Details

Motivation: Conventional object detection models like YOLO are limited to predefined categories, while existing open-set methods often compromise between performance and efficiency due to high computational demands or deployment complexity.

Method: YOLOE integrates three approaches: RepRTA for text prompts (re-parameterizable region-text alignment), SAVPE for visual prompts (semantic-activated visual prompt encoder), and LRPC for prompt-free scenarios (lazy region-prompt contrast) - all within a single efficient model.

Result: YOLOE achieves exceptional zero-shot performance with high efficiency: YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP on LVIS with 3x less training cost and 1.4x inference speedup. YOLOE-v8-L achieves 0.6 APb and 0.4 APm gains over closed-set YOLOv8-L on COCO with nearly 4x less training time.

Conclusion: YOLOE provides a comprehensive solution for open-set detection and segmentation that maintains high efficiency while supporting multiple prompt mechanisms, making real-time ‘seeing anything’ capabilities practical for deployment.

Abstract: Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE’s exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP$^b$ and 0.4 AP$^m$ gains over closed-set YOLOv8-L with nearly 4$\times$ less training time. Code and models are available at https://github.com/THU-MIG/yoloe.

[194] L2RSI: Cross-view LiDAR-based Place Recognition for Large-scale Urban Scenes via Remote Sensing Imagery

Ziwei Shi, Xiaoran Zhang, Wenjing Xu, Yan Xia, Yu Zang, Siqi Shen, Cheng Wang

Main category: cs.CV

TL;DR: L2RSI enables LiDAR place recognition using remote sensing imagery as map proxies, achieving 83.27% accuracy within 30m radius for top-1 retrieval in 100km² range.

Details

Motivation: To overcome the dependency on costly and time-consuming prior 3D maps for LiDAR-based place recognition by leveraging readily available overhead remote sensing imagery as map proxies.

Method: Proposes cross-view LiDAR place recognition using remote sensing imagery, learning feature alignment between point cloud and remote sensing submaps in semantic domain, with probability propagation based on particle estimation to refine position predictions.

Result: Within 100km² retrieval range, L2RSI accurately localizes 83.27% of point cloud submaps within 30m radius for top-1 retrieved location, demonstrating large-scale retrieval and cross-scene generalization without fine-tuning.

Conclusion: L2RSI provides an effective solution for large-scale LiDAR localization at reduced cost by using remote sensing imagery as map proxies, with strong performance in cross-view and cross-modal place recognition.

Abstract: We tackle the challenge of LiDAR-based place recognition, which traditionally depends on costly and time-consuming prior 3D maps. To overcome this, we first construct XA-L&RSI dataset, which encompasses approximately $110,000$ remote sensing submaps and $13,000$ LiDAR point cloud submaps captured in urban scenes, and propose a novel method, L2RSI, for cross-view LiDAR place recognition using high-resolution Remote Sensing Imagery. This approach enables large-scale localization capabilities at a reduced cost by leveraging readily available overhead images as map proxies. L2RSI addresses the dual challenges of cross-view and cross-modal place recognition by learning feature alignment between point cloud submaps and remote sensing submaps in the semantic domain. Additionally, we introduce a novel probability propagation method based on particle estimation to refine position predictions, effectively leveraging temporal and spatial information. This approach enables large-scale retrieval and cross-scene generalization without fine-tuning. Extensive experiments on XA-L&RSI demonstrate that, within a $100km^2$ retrieval range, L2RSI accurately localizes $83.27%$ of point cloud submaps within a $30m$ radius for top-$1$ retrieved location. Our project page is publicly available at https://shizw695.github.io/L2RSI/.

[195] UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection

Xin Jin, Haisheng Su, Kai Liu, Cong Ma, Wei Wu, Fei Hui, Junchi Yan

Main category: cs.CV

TL;DR: UniMamba integrates 3D convolution and State Space Models to efficiently capture both local and global spatial dependencies in LiDAR 3D detection, addressing limitations of Transformer-based approaches.

Details

Motivation: Transformer-based 3D detection methods destroy spatial structure during serialization and have limited receptive fields due to quadratic complexity and sequence grouping.

Method: Proposes UniMamba blocks with spatial locality modeling using 3D submanifold convolution, Z-order serialization, and local-global sequential aggregator with multi-head SSM in an encoder-decoder architecture.

Result: Achieves 70.2 mAP on nuScenes dataset and shows strong performance on Waymo and Argoverse 2 datasets.

Conclusion: UniMamba effectively captures local and global spatial contexts simultaneously, outperforming Transformer-based methods in 3D detection tasks.

Abstract: Recent advances in LiDAR 3D detection have demonstrated the effectiveness of Transformer-based frameworks in capturing the global dependencies from point cloud spaces, which serialize the 3D voxels into the flattened 1D sequence for iterative self-attention. However, the spatial structure of 3D voxels will be inevitably destroyed during the serialization process. Besides, due to the considerable number of 3D voxels and quadratic complexity of Transformers, multiple sequences are grouped before feeding to Transformers, leading to a limited receptive field. Inspired by the impressive performance of State Space Models (SSM) achieved in the field of 2D vision tasks, in this paper, we propose a novel Unified Mamba (UniMamba), which seamlessly integrates the merits of 3D convolution and SSM in a concise multi-head manner, aiming to perform “local and global” spatial context aggregation efficiently and simultaneously. Specifically, a UniMamba block is designed which mainly consists of spatial locality modeling, complementary Z-order serialization and local-global sequential aggregator. The spatial locality modeling module integrates 3D submanifold convolution to capture the dynamic spatial position embedding before serialization. Then the efficient Z-order curve is adopted for serialization both horizontally and vertically. Furthermore, the local-global sequential aggregator adopts the channel grouping strategy to efficiently encode both “local and global” spatial inter-dependencies using multi-head SSM. Additionally, an encoder-decoder architecture with stacked UniMamba blocks is formed to facilitate multi-scale spatial learning hierarchically. Extensive experiments are conducted on three popular datasets: nuScenes, Waymo and Argoverse 2. Particularly, our UniMamba achieves 70.2 mAP on the nuScenes dataset.

[196] A Plug-and-Play Learning-based IMU Bias Factor for Robust Visual-Inertial Odometry

Yang Yi, Kunqing Wang, Jinpu Zhang, Zhen Tan, Xiangke Wang, Hui Shen, Dewen Hu

Main category: cs.CV

TL;DR: Proposes IPNet, a plug-and-play module that infers IMU bias priors using raw IMU data to improve VIO robustness in challenging visual conditions.

Details

Motivation: Low-cost IMU bias estimation in VIO systems becomes unreliable when visual tracking fails, leading to localization errors and instability.

Method: Uses Inertial Prior Network (IPNet) with sliding window approach on raw IMU data to infer bias priors, eliminating dependency on visual features. Introduces iterative method for ground-truth bias computation in training.

Result: Significantly improves localization precision and robustness across two public datasets and a self-collected dataset.

Conclusion: The proposed method effectively prevents error propagation in challenging areas by providing reliable IMU bias priors independent of visual tracking quality.

Abstract: Accurate and reliable estimation of biases of low-cost Inertial Measurement Units (IMU) is a key factor to maintain the resilience of Visual-Inertial Odometry (VIO), particularly when visual tracking fails in challenging areas. In such cases, bias estimates from the VIO can deviate significantly from the real values because of the insufficient or erroneous vision features, compromising both localization accuracy and system stability. To address this challenge, we propose a novel plug-and-play module featuring the Inertial Prior Network (IPNet), which infers an IMU bias prior by implicitly capturing the motion characteristics of specific platforms. The core idea is inspired intuitively by the observation that different platforms exhibit distinctive motion patterns, while the integration of low-cost IMU measurements suffers from unbounded error that quickly accumulates over time. Therefore, these specific motion patterns can be exploited to infer the underlying IMU bias. In this work, we first directly infer the biases prior only using the raw IMU data using a sliding window approach, eliminating the dependency on recursive bias estimation combining visual features, thus effectively preventing error propagation in challenging areas. Moreover, to compensate for the lack of ground-truth bias in most visual-inertial datasets, we further introduce an iterative method to compute the mean per-sequence IMU bias for network training and release it to benefit society. The framework is trained and evaluated separately on two public datasets and a self-collected dataset. Extensive experiments show that our method significantly improves localization precision and robustness.

[197] Multi-identity Human Image Animation with Structural Video Diffusion

Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Yuwei Guo, Dahua Lin, Tianfan Xue, Bo Dai

Main category: cs.CV

TL;DR: Structural Video Diffusion is a novel framework for generating realistic multi-human videos from a single image, addressing limitations of existing methods in handling complex multi-identity interactions.

Details

Motivation: Existing video generation methods work well for single humans but fail with multiple individuals and object interactions due to difficulties in associating correct human appearance-pose pairs and modeling 3D-aware dynamics.

Method: Introduces identity-specific embeddings for consistent appearances and structural learning with depth and surface-normal cues to model human-object interactions. Also expands dataset with 25K new multi-human interaction videos.

Result: Achieves superior performance in generating lifelike, coherent videos for multiple subjects with dynamic and rich interactions.

Conclusion: The framework advances human-centric video generation by effectively handling complex multi-human scenarios with precise control and high visual quality.

Abstract: Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of human appearance and pose condition and model the distribution of 3D-aware dynamics. To address these limitations, we present \emph{Structural Video Diffusion}, a novel framework designed for generating realistic multi-human videos. Our approach introduces two core innovations: identity-specific embeddings to maintain consistent appearances across individuals and a structural learning mechanism that incorporates depth and surface-normal cues to model human-object interactions. Additionally, we expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios, providing a robust foundation for training. Experimental results demonstrate that Structural Video Diffusion achieves superior performance in generating lifelike, coherent videos for multiple subjects with dynamic and rich interactions, advancing the state of human-centric video generation. Code is available at https://github.com/zhenzhiwang/Multi-HumanVid

[198] Bolt3D: Generating 3D Scenes in Seconds

Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, Philipp Henzler

Main category: cs.CV

TL;DR: Bolt3D is a latent diffusion model that generates 3D scenes from images in under 7 seconds on a single GPU, achieving 300x faster inference than optimization-based methods.

Details

Motivation: To overcome the slow per-scene optimization required by prior multiview generative models for 3D reconstruction, which is computationally expensive and time-consuming.

Method: Leverages existing 2D diffusion network architectures to produce consistent 3D scene representations, trained on a large-scale multiview-consistent dataset created by applying state-of-the-art dense 3D reconstruction to existing multiview image datasets.

Result: Generates high-fidelity 3D scene representations directly from one or more images in less than 7 seconds on a single GPU, with up to 300 times faster inference compared to prior methods.

Conclusion: Bolt3D demonstrates that powerful 2D diffusion architectures can be effectively adapted for fast feed-forward 3D scene generation, significantly reducing computational costs while maintaining high quality.

Abstract: We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.

[199] CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image

Arindam Dutta, Meng Zheng, Zhongpai Gao, Benjamin Planche, Anwesha Choudhuri, Terrence Chen, Amit K. Roy-Chowdhury, Ziyan Wu

Main category: cs.CV

TL;DR: CHROME is a novel pipeline for reconstructing 3D clothed humans from single occluded images without requiring geometric priors or 3D supervision, achieving occlusion-resilience and multiview consistency.

Details

Motivation: Existing monocular clothed human reconstruction methods fail with occluded images and rely on hard-to-acquire geometric priors like SMPL annotations, leading to fragmented and inconsistent reconstructions in real-world scenarios.

Method: CHROME uses a multiview diffusion model to synthesize occlusion-free human images from occluded input with pose control for cross-view consistency, then trains a 3D reconstruction model to predict 3D Gaussians conditioned on both occluded input and synthesized views.

Result: CHROME achieves significant improvements in novel view synthesis (up to 3 dB PSNR) and geometric reconstruction under challenging occlusion conditions.

Conclusion: The proposed CHROME pipeline successfully addresses occlusion challenges in monocular 3D human reconstruction without requiring geometric priors or 3D supervision, producing multiview-consistent and accurate 3D representations.

Abstract: Reconstructing clothed humans from a single image is a fundamental task in computer vision with wide-ranging applications. Although existing monocular clothed human reconstruction solutions have shown promising results, they often rely on the assumption that the human subject is in an occlusion-free environment. Thus, when encountering in-the-wild occluded images, these algorithms produce multiview inconsistent and fragmented reconstructions. Additionally, most algorithms for monocular 3D human reconstruction leverage geometric priors such as SMPL annotations for training and inference, which are extremely challenging to acquire in real-world applications. To address these limitations, we propose CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-ConsistEncy from a Single Image, a novel pipeline designed to reconstruct occlusion-resilient 3D humans with multiview consistency from a single occluded image, without requiring either ground-truth geometric prior annotations or 3D supervision. Specifically, CHROME leverages a multiview diffusion model to first synthesize occlusion-free human images from the occluded input, compatible with off-the-shelf pose control to explicitly enforce cross-view consistency during synthesis. A 3D reconstruction model is then trained to predict a set of 3D Gaussians conditioned on both the occluded input and synthesized views, aligning cross-view details to produce a cohesive and accurate 3D representation. CHROME achieves significant improvements in terms of both novel view synthesis (upto 3 db PSNR) and geometric reconstruction under challenging conditions.

[200] SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians

Liam Schoneveld, Zhe Chen, Davide Davoli, Jiapeng Tang, Saimon Terazawa, Ko Nishino, Matthias Nießner

Main category: cs.CV

TL;DR: SHeaP is a self-supervised method that uses 2D Gaussians rigged to 3DMM meshes for improved 3D head reconstruction from monocular images/videos, outperforming existing approaches in geometry and expression accuracy.

Details

Motivation: To overcome limitations of differentiable mesh rendering in self-supervised 3D head reconstruction and address the scarcity of 3D ground truth data by leveraging abundant 2D videos.

Method: Predicts a 3DMM mesh and 2D Gaussians rigged to the mesh, reanimates the avatar to match target frames, and backpropagates photometric losses to both 3DMM and Gaussian prediction networks.

Result: Surpasses existing self-supervised approaches on NoW benchmark for neutral faces and new benchmark for non-neutral expressions, and produces highly expressive meshes that outperform state-of-the-art in emotion classification.

Conclusion: Using Gaussians for rendering substantially improves self-supervised 3D head reconstruction effectiveness, enabling high-quality geometry and expression modeling from 2D data alone.

Abstract: Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians). Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach. Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-of-the-art in emotion classification.

[201] X$^{2}$-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction

Weihao Yu, Yuanhao Cai, Ruyi Zha, Zhiwen Fan, Chenxin Li, Yixuan Yuan

Main category: cs.CV

TL;DR: X^2-Gaussian is a novel framework for continuous-time 4D-CT reconstruction that integrates dynamic radiative Gaussian splatting with self-supervised respiratory motion learning, eliminating phase discretization and external gating devices.

Details

Motivation: Traditional 4D CT reconstruction methods use phase-binning workflows that discretize temporal resolution into fixed phases, causing motion misalignment and limiting clinical practicality due to dependency on respiratory gating devices.

Method: The approach uses a spatiotemporal encoder-decoder architecture to predict time-varying Gaussian deformations and introduces a physiology-driven periodic consistency loss that learns patient-specific breathing cycles directly from projections via differentiable optimization.

Result: The method achieves state-of-the-art performance with a 9.93 dB PSNR gain over traditional methods and 2.25 dB improvement against prior Gaussian splatting techniques.

Conclusion: X^2-Gaussian advances high-fidelity 4D CT reconstruction by unifying continuous motion modeling with hardware-free period learning, enabling more practical dynamic clinical imaging.

Abstract: Four-dimensional computed tomography (4D CT) reconstruction is crucial for capturing dynamic anatomical changes but faces inherent limitations from conventional phase-binning workflows. Current methods discretize temporal resolution into fixed phases with respiratory gating devices, introducing motion misalignment and restricting clinical practicality. In this paper, We propose X$^2$-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning. Our approach models anatomical dynamics through a spatiotemporal encoder-decoder architecture that predicts time-varying Gaussian deformations, eliminating phase discretization. To remove dependency on external gating devices, we introduce a physiology-driven periodic consistency loss that learns patient-specific breathing cycles directly from projections via differentiable optimization. Extensive experiments demonstrate state-of-the-art performance, achieving a 9.93 dB PSNR gain over traditional methods and 2.25 dB improvement against prior Gaussian splatting techniques. By unifying continuous motion modeling with hardware-free period learning, X$^2$-Gaussian advances high-fidelity 4D CT reconstruction for dynamic clinical imaging. Code is publicly available at: https://x2-gaussian.github.io/.

[202] CMaP-SAM: Contraction Mapping Prior for SAM-driven Few-shot Segmentation

Shuai Chen, Fanman Meng, Liming Lei, Haoran Wei, Chenhao Wu, Qingbo Wu, Linfeng Xu, Hongliang Li

Main category: cs.CV

TL;DR: CMaP-SAM introduces contraction mapping theory to optimize position priors for SAM-driven few-shot segmentation, addressing structural correlation utilization and information loss issues in existing methods.

Details

Motivation: Recent FSS methods leveraging SAM face two critical limitations: insufficient utilization of structural correlations in query images, and significant information loss when converting continuous position priors to discrete point prompts.

Method: Three key components: (1) contraction mapping module formulating position prior optimization as Banach contraction mapping with convergence guarantees; (2) adaptive distribution alignment module bridging continuous priors with SAM’s binary mask prompt encoder; (3) foreground-background decoupled refinement architecture.

Result: Achieves state-of-the-art performance with 71.1 mIoU on PASCAL-$5^i$ and 56.1 on COCO-$20^i$ datasets.

Conclusion: CMaP-SAM effectively addresses the limitations of SAM-based FSS methods through contraction mapping theory, demonstrating superior performance on standard benchmarks.

Abstract: Few-shot segmentation (FSS) aims to segment new classes using few annotated images. While recent FSS methods have shown considerable improvements by leveraging Segment Anything Model (SAM), they face two critical limitations: insufficient utilization of structural correlations in query images, and significant information loss when converting continuous position priors to discrete point prompts. To address these challenges, we propose CMaP-SAM, a novel framework that introduces contraction mapping theory to optimize position priors for SAM-driven few-shot segmentation. CMaP-SAM consists of three key components: (1) a contraction mapping module that formulates position prior optimization as a Banach contraction mapping with convergence guarantees. This module iteratively refines position priors through pixel-wise structural similarity, generating a converged prior that preserves both semantic guidance from reference images and structural correlations in query images; (2) an adaptive distribution alignment module bridging continuous priors with SAM’s binary mask prompt encoder; and (3) a foreground-background decoupled refinement architecture producing accurate final segmentation masks. Extensive experiments demonstrate CMaP-SAM’s effectiveness, achieving state-of-the-art performance with 71.1 mIoU on PASCAL-$5^i$ and 56.1 on COCO-$20^i$ datasets. Code is available at https://github.com/Chenfan0206/CMaP-SAM.

[203] Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings

Xingguang Wei, Haomin Wang, Shenglong Ye, Ruifeng Luo, Yanting Zhang, Lixin Gu, Jifeng Dai, Yu Qiao, Wenhai Wang, Hongjie Zhang

Main category: cs.CV

TL;DR: VecFormer introduces a line-based representation for panoptic symbol spotting in CAD drawings, achieving state-of-the-art performance with 91.1 PQ through a Branch Fusion Refinement module that resolves inconsistencies between instance and semantic predictions.

Details

Motivation: Existing methods for panoptic symbol spotting in CAD drawings suffer from high computational costs, limited generality, and loss of geometric structural information due to reliance on image rasterization, graph construction, or point-based representation.

Method: Proposes VecFormer with line-based representation of primitives to preserve geometric continuity and maintain computation-friendly structure, plus a Branch Fusion Refinement module that integrates instance and semantic predictions to resolve inconsistencies.

Result: Achieves new state-of-the-art with 91.1 PQ, with Stuff-PQ improved by 9.6 and 21.2 points over second-best results under settings with and without prior information respectively.

Conclusion: Line-based representation shows strong potential as a foundation for vector graphic understanding, effectively addressing challenges in panoptic symbol spotting while maintaining computational efficiency and geometric accuracy.

Abstract: We study the task of panoptic symbol spotting, which involves identifying both individual instances of countable things and the semantic regions of uncountable stuff in computer-aided design (CAD) drawings composed of vector graphical primitives. Existing methods typically rely on image rasterization, graph construction, or point-based representation, but these approaches often suffer from high computational costs, limited generality, and loss of geometric structural information. In this paper, we propose VecFormer, a novel method that addresses these challenges through line-based representation of primitives. This design preserves the geometric continuity of the original primitive, enabling more accurate shape representation while maintaining a computation-friendly structure, making it well-suited for vector graphic understanding tasks. To further enhance prediction reliability, we introduce a Branch Fusion Refinement module that effectively integrates instance and semantic predictions, resolving their inconsistencies for more coherent panoptic outputs. Extensive experiments demonstrate that our method establishes a new state-of-the-art, achieving 91.1 PQ, with Stuff-PQ improved by 9.6 and 21.2 points over the second-best results under settings with and without prior information, respectively, highlighting the strong potential of line-based representation as a foundation for vector graphic understanding.

[204] FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment

Hao Yin, Lijun Gu, Paritosh Parmar, Lin Xu, Tianxiao Guo, Weiwei Fu, Yang Zhang, Tianyou Zheng

Main category: cs.CV

TL;DR: FLEX is a large-scale multimodal dataset for fitness Action Quality Assessment (AQA) that includes synchronized RGB video, 3D pose, sEMG, and physiological signals from 38 subjects performing 20 weight-loaded exercises, with expert annotations organized in a Fitness Knowledge Graph.

Details

Motivation: Existing AQA datasets are limited to single-view competitive sports and RGB video, lacking multimodal signals and professional assessment of fitness actions, which is critical for detecting errors in gym weight training to prevent injuries and maximize gains.

Method: Collected over 7,500 multiview recordings with synchronized multimodal data, organized expert annotations into a Fitness Knowledge Graph linking actions, key steps, error types, and feedback, and introduced FLEX-VideoQA benchmark for cross-modal reasoning.

Result: Baseline experiments show that multimodal inputs, multiview video, and fine-grained annotations significantly enhance AQA performance, enabling multimodal fusion, cross-modal prediction including Video→EMG task, and biomechanically oriented representation learning.

Conclusion: FLEX advances AQA toward richer multimodal settings and provides a foundation for AI-powered fitness assessment and coaching, with the dataset and code publicly available.

Abstract: Action Quality Assessment (AQA) – the task of quantifying how well an action is performed – has great potential for detecting errors in gym weight training, where accurate feedback is critical to prevent injuries and maximize gains. Existing AQA datasets, however, are limited to single-view competitive sports and RGB video, lacking multimodal signals and professional assessment of fitness actions. We introduce FLEX, the first large-scale, multimodal, multiview dataset for fitness AQA that incorporates surface electromyography (sEMG). FLEX contains over 7,500 multiview recordings of 20 weight-loaded exercises performed by 38 subjects of diverse skill levels, with synchronized RGB video, 3D pose, sEMG, and physiological signals. Expert annotations are organized into a Fitness Knowledge Graph (FKG) linking actions, key steps, error types, and feedback, supporting a compositional scoring function for interpretable quality assessment. FLEX enables multimodal fusion, cross-modal prediction – including the novel Video$\rightarrow$EMG task – and biomechanically oriented representation learning. Building on the FKG, we further introduce FLEX-VideoQA, a structured question-answering benchmark with hierarchical queries that drive cross-modal reasoning in vision-language models. Baseline experiments demonstrate that multimodal inputs, multiview video, and fine-grained annotations significantly enhance AQA performance. FLEX thus advances AQA toward richer multimodal settings and provides a foundation for AI-powered fitness assessment and coaching. Dataset and code are available at \href{https://github.com/HaoYin116/FLEX}{https://github.com/HaoYin116/FLEX}. Link to Project \href{https://haoyin116.github.io/FLEX_Dataset}{page}.

[205] Refer to Any Segmentation Mask Group With Vision-Language Prompts

Shengcao Cao, Zijun Wei, Jason Kuen, Kangning Liu, Lingzhi Zhang, Jiuxiang Gu, HyunJoon Jung, Liang-Yan Gui, Yu-Xiong Wang

Main category: cs.CV

TL;DR: Introduces Omnimodal Referring Expression Segmentation (ORES) - a new task for segmenting image regions based on complex text-only or text+visual prompts, and proposes RAS framework using mask-centric multimodal models.

Details

Motivation: Current segmentation models lack comprehensive semantic understanding for complex vision-language queries, limiting their effectiveness in user-friendly interactive applications.

Method: Proposes RAS framework that augments segmentation models with complex multimodal interactions via a mask-centric large multimodal model, trained on new datasets MaskGroups-2M and MaskGroups-HQ.

Result: Demonstrates superior performance on the new ORES task as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks.

Conclusion: The RAS framework successfully bridges the gap in multimodal semantic understanding for segmentation tasks, enabling more comprehensive and user-friendly vision-language interactions.

Abstract: Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of omnimodal referring expression segmentation (ORES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to “Refer to Any Segmentation Mask Group” (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking ORES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks. Project page: https://Ref2Any.github.io.

[206] SIRI-Bench: Challenging VLMs’ Spatial Intelligence through Complex Reasoning Tasks

Zijian Song, Xiaoxin Lin, Qiuming Huang, Guangrun Wang, Liang Lin

Main category: cs.CV

TL;DR: SIRI-Bench is a new benchmark for evaluating Vision-Language Models’ structural spatial intelligence through 9,000 video-question-answer triplets in realistic 3D scenes, revealing that current VLMs struggle with spatial reasoning tasks.

Details

Motivation: While LLMs have advanced through reinforcement learning on reasoning tasks, spatial intelligence in VLMs remains underexplored despite being fundamental for real-world interaction.

Method: Developed SIRI-Bench with 9,000 video-QA triplets in realistic 3D scenes, using an Automatic Scene Creation Engine with collaborative LLM agents to translate abstract mathematical problems into 3D scenes.

Result: State-of-the-art VLMs struggle significantly on SIRI-Bench, highlighting the challenge of structural spatial reasoning.

Conclusion: The study aims to bring attention to spatially grounded reasoning and advance VLMs in visual problem-solving.

Abstract: Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs’ structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning. We hope that our study will bring researchers’ attention to spatially grounded reasoning and advance VLMs in visual problem-solving.

[207] TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving

Yiming Yang, Yueru Luo, Bingkun He, Hongbin Lin, Suzhong Fu, Chao Zheng, Zhipeng Cao, Erlong Li, Chao Yan, Shuguang Cui, Zhen Li

Main category: cs.CV

TL;DR: TopoStreamer is an end-to-end temporal perception model for lane segment topology reasoning that improves road network reconstruction through streaming attribute constraints, dynamic positional encoding, and lane segment denoising.

Details

Motivation: Existing methods for lane segment topology reasoning suffer from limitations in consistent positional embedding and temporal multiple attribute learning, hindering accurate road network reconstruction needed for autonomous driving maneuvers.

Method: TopoStreamer introduces three key improvements: streaming attribute constraints for temporal consistency, dynamic lane boundary positional encoding for up-to-date positional information, and lane segment denoising to capture diverse lane patterns.

Result: On the OpenLane-V2 dataset, TopoStreamer achieves significant improvements over state-of-the-art methods: +3.0% mAP in lane segment perception and +1.7% OLS in centerline perception tasks.

Conclusion: TopoStreamer effectively addresses the limitations of existing methods and demonstrates substantial performance gains in lane segment topology reasoning for autonomous driving applications.

Abstract: Lane segment topology reasoning constructs a comprehensive road network by capturing the topological relationships between lane segments and their semantic types. This enables end-to-end autonomous driving systems to perform road-dependent maneuvers such as turning and lane changing. However, the limitations in consistent positional embedding and temporal multiple attribute learning in existing methods hinder accurate roadnet reconstruction. To address these issues, we propose TopoStreamer, an end-to-end temporal perception model for lane segment topology reasoning. Specifically, TopoStreamer introduces three key improvements: streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising. The streaming attribute constraints enforce temporal consistency in both centerline and boundary coordinates, along with their classifications. Meanwhile, dynamic lane boundary positional encoding enhances the learning of up-to-date positional information within queries, while lane segment denoising helps capture diverse lane segment patterns, ultimately improving model performance. Additionally, we assess the accuracy of existing models using a lane boundary classification metric, which serves as a crucial measure for lane-changing scenarios in autonomous driving. On the OpenLane-V2 dataset, TopoStreamer demonstrates significant improvements over state-of-the-art methods, achieving substantial performance gains of +3.0% mAP in lane segment perception and +1.7% OLS in centerline perception tasks.

[208] iLRM: An Iterative Large 3D Reconstruction Model

Gyeongjin Kang, Seungtae Nam, Seungkwon Yang, Xiangyu Sun, Sameh Khamis, Abdelrahman Mohamed, Eunbyung Park

Main category: cs.CV

TL;DR: iLRM introduces an iterative 3D reconstruction model that generates 3D Gaussian representations through iterative refinement, addressing scalability issues in transformer-based methods by decoupling scene representation, using two-stage attention, and injecting high-resolution information.

Details

Motivation: Existing transformer-based 3D reconstruction methods suffer from severe scalability issues due to full attention across image tokens from multiple input views, leading to prohibitive computational costs as view count or resolution increases.

Method: Uses iterative refinement with three core principles: (1) decoupling scene representation from input-view images, (2) decomposing multi-view interactions into two-stage attention to reduce computation, and (3) injecting high-resolution information at every layer.

Result: Outperforms existing methods on RE10K and DL3DV datasets in both reconstruction quality and speed.

Conclusion: iLRM provides a scalable and efficient feed-forward 3D reconstruction approach that overcomes computational bottlenecks while achieving high-fidelity results.

Abstract: Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable compact 3D representations; (2) decomposing fully-attentional multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed.

[209] FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models

Yiming Yang, Hongbin Lin, Yueru Luo, Suzhong Fu, Chao Zheng, Xinrui Yan, Shuqi Mei, Kun Tang, Shuguang Cui, Zhen Li

Main category: cs.CV

TL;DR: FASTopoWM is a fast-slow lane segment topology reasoning framework that uses latent world models to improve temporal perception in autonomous driving, achieving state-of-the-art performance on lane detection and centerline perception.

Details

Motivation: Existing lane topology reasoning methods fail to effectively leverage temporal information and are vulnerable to pose estimation failures, limiting their performance in autonomous driving systems.

Method: Proposes a unified fast-slow framework with parallel supervision of historical and new queries, and introduces latent query and BEV world models conditioned on action latent to propagate state representations across timesteps.

Result: Outperforms state-of-the-art methods on OpenLane-V2 benchmark: 37.4% vs 33.6% mAP for lane segment detection and 46.3% vs 41.5% OLS for centerline perception.

Conclusion: FASTopoWM effectively addresses limitations of existing methods by incorporating temporal propagation through latent world models and parallel query supervision, significantly improving lane topology reasoning performance.

Abstract: Lane segment topology reasoning provides comprehensive bird’s-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).

[210] Low-Frequency First: Eliminating Floating Artifacts in 3D Gaussian Splatting

Jianchao Wang, Peng Zhou, Cen Li, Rong Quan, Jie Qin

Main category: cs.CV

TL;DR: EFA-GS addresses floating artifacts in 3D Gaussian Splatting by identifying under-optimized Gaussians as the root cause and selectively expanding them to improve low-frequency learning while preserving high-frequency details.

Details

Motivation: 3D Gaussian Splatting often produces floating artifacts that degrade visual quality, especially with low-quality initialization. The underlying mechanisms causing these artifacts were not fully understood.

Method: Proposed EFA-GS which selectively expands under-optimized Gaussians to prioritize accurate low-frequency learning, complemented by depth-based and scale-based strategies for dynamic Gaussian refinement.

Result: EFA-GS substantially reduces floating artifacts while preserving high-frequency details, achieving 1.68 dB PSNR improvement over baseline on RWLQ dataset, and shows effectiveness in downstream 3D editing tasks.

Conclusion: The frequency-domain analysis reveals under-optimized Gaussians as the primary source of floating artifacts, and EFA-GS effectively mitigates these artifacts through selective expansion and dynamic refinement strategies.

Abstract: 3D Gaussian Splatting (3DGS) is a powerful and computationally efficient representation for 3D reconstruction. Despite its strengths, 3DGS often produces floating artifacts, which are erroneous structures detached from the actual geometry and significantly degrade visual fidelity. The underlying mechanisms causing these artifacts, particularly in low-quality initialization scenarios, have not been fully explored. In this paper, we investigate the origins of floating artifacts from a frequency-domain perspective and identify under-optimized Gaussians as the primary source. Based on our analysis, we propose \textit{Eliminating-Floating-Artifacts} Gaussian Splatting (EFA-GS), which selectively expands under-optimized Gaussians to prioritize accurate low-frequency learning. Additionally, we introduce complementary depth-based and scale-based strategies to dynamically refine Gaussian expansion, effectively mitigating detail erosion. Extensive experiments on both synthetic and real-world datasets demonstrate that EFA-GS substantially reduces floating artifacts while preserving high-frequency details, achieving an improvement of 1.68 dB in PSNR over baseline method on our RWLQ dataset. Furthermore, we validate the effectiveness of our approach in downstream 3D editing tasks. Project Website: https://jcwang-gh.github.io/EFA-GS

[211] Leveraging Learning Bias for Noisy Anomaly Detection

Yuxin Zhang, Yunkang Cao, Yuqi Cheng, Yihan Sun, Weiming Shen

Main category: cs.CV

TL;DR: A two-stage framework for fully unsupervised image anomaly detection that leverages inherent learning bias to handle contaminated training data, achieving superior performance on the Real-IAD benchmark.

Details

Motivation: Real-world training data often contains unlabeled anomalies, causing conventional methods that assume anomaly-free data to degrade in performance by absorbing anomalies as normal patterns.

Method: Two-stage framework: Stage 1 partitions training data, trains sub-models, aggregates cross-model anomaly scores to filter a purified dataset; Stage 2 trains final detector on the purified dataset. Exploits learning bias from statistical dominance of normal samples and feature-space divergence.

Result: Demonstrates superior anomaly detection and localization performance on Real-IAD benchmark under different noise conditions. Ablation studies validate contamination resilience and effectiveness of learning bias exploitation.

Conclusion: The model-agnostic framework provides a practical solution for real-world scenarios with imperfect training data, effectively handling contamination through systematic exploitation of learning bias.

Abstract: This paper addresses the challenge of fully unsupervised image anomaly detection (FUIAD), where training data may contain unlabeled anomalies. Conventional methods assume anomaly-free training data, but real-world contamination leads models to absorb anomalies as normal, degrading detection performance. To mitigate this, we propose a two-stage framework that systematically exploits inherent learning bias in models. The learning bias stems from: (1) the statistical dominance of normal samples, driving models to prioritize learning stable normal patterns over sparse anomalies, and (2) feature-space divergence, where normal data exhibit high intra-class consistency while anomalies display high diversity, leading to unstable model responses. Leveraging the learning bias, stage 1 partitions the training set into subsets, trains sub-models, and aggregates cross-model anomaly scores to filter a purified dataset. Stage 2 trains the final detector on this dataset. Experiments on the Real-IAD benchmark demonstrate superior anomaly detection and localization performance under different noise conditions. Ablation studies further validate the framework’s contamination resilience, emphasizing the critical role of learning bias exploitation. The model-agnostic design ensures compatibility with diverse unsupervised backbones, offering a practical solution for real-world scenarios with imperfect training data. Code is available at https://github.com/hustzhangyuxin/LLBNAD.

[212] First-order State Space Model for Lightweight Image Super-resolution

Yujie Zhu, Xinyi Zhang, Yekai Lu, Guang Yang, Faming Fang, Guixu Zhang

Main category: cs.CV

TL;DR: FSSM improves Mamba-based vision models for lightweight super-resolution by modifying SSM calculation to incorporate token correlations, achieving state-of-the-art performance without additional parameters.

Details

Motivation: Most Mamba-based vision models focus on architecture and scan paths, neglecting the SSM module's potential. The authors aim to enhance SSM performance for lightweight super-resolution tasks.

Method: Introduce First-order State Space Model (FSSM) by applying first-order hold condition in SSMs, deriving new discretized form, and analyzing cumulative error to improve token correlations.

Result: FSSM improves MambaIR performance on five benchmark datasets without increasing parameters, surpassing current lightweight SR methods with state-of-the-art results.

Conclusion: The proposed FSSM effectively enhances SSM performance for vision tasks, demonstrating the importance of optimizing the core SSM module rather than just network architecture.

Abstract: State space models (SSMs), particularly Mamba, have shown promise in NLP tasks and are increasingly applied to vision tasks. However, most Mamba-based vision models focus on network architecture and scan paths, with little attention to the SSM module. In order to explore the potential of SSMs, we modified the calculation process of SSM without increasing the number of parameters to improve the performance on lightweight super-resolution tasks. In this paper, we introduce the First-order State Space Model (FSSM) to improve the original Mamba module, enhancing performance by incorporating token correlations. We apply a first-order hold condition in SSMs, derive the new discretized form, and analyzed cumulative error. Extensive experimental results demonstrate that FSSM improves the performance of MambaIR on five benchmark datasets without additionally increasing the number of parameters, and surpasses current lightweight SR methods, achieving state-of-the-art results.

[213] Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models

Yan Chen, Long Li, Teng Xi, Long Zeng, Jingdong Wang

Main category: cs.CV

TL;DR: A two-stage reinforcement learning framework that enhances both perceptual and reasoning capabilities of vision-language models, addressing the limitation of directly applying LLM RL methods to VLMs.

Details

Motivation: RL has been effective for LLMs but directly applying it to VLMs is suboptimal due to the added complexity of visual perception requirements in VLM tasks.

Method: Two-stage RL framework with dataset-level sampling to mitigate vanishing advantage. Stage 1 focuses on visual perception improvement through coarse- and fine-grained understanding. Stage 2 enhances reasoning abilities.

Result: Developed PeBR-R1 model that shows superior performance across seven benchmark datasets on diverse visual reasoning tasks.

Conclusion: The proposed two-stage RL framework effectively enhances both perceptual and reasoning capabilities of VLMs, demonstrating significant improvements over existing approaches.

Abstract: Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model’s visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks.

[214] Dataset Distillation for Super-Resolution without Class Labels and Pre-trained Models

Sunwoo Cho, Yejin Jung, Nam Ik Cho, Jae Woong Soh

Main category: cs.CV

TL;DR: A new data distillation method for image super-resolution that eliminates the need for class labels or pre-trained SR models by using CLIP features and diffusion model fine-tuning, achieving state-of-the-art performance with significantly reduced training data and computation time.

Details

Motivation: Current data distillation methods for single image super-resolution heavily depend on pre-trained SR networks and class-specific information, which limits their generalizability and applicability. The goal is to develop a more flexible approach that doesn't require these dependencies.

Method: Extract high-gradient patches and categorize images based on CLIP features, then fine-tune a diffusion model on the selected patches to learn their distribution and synthesize distilled training images.

Result: Achieves state-of-the-art performance with only 0.68% of the original dataset, with performance drop of just 0.3 dB. Diffusion model fine-tuning takes 4 hours and SR model training completes within 1 hour, compared to 11-hour training with full dataset.

Conclusion: The proposed data distillation approach effectively reduces training data requirements and computational time while maintaining high performance, making it a practical solution for efficient deep neural network training in image super-resolution tasks.

Abstract: Training deep neural networks has become increasingly demanding, requiring large datasets and significant computational resources, especially as model complexity advances. Data distillation methods, which aim to improve data efficiency, have emerged as promising solutions to this challenge. In the field of single image super-resolution (SISR), the reliance on large training datasets highlights the importance of these techniques. Recently, a generative adversarial network (GAN) inversion-based data distillation framework for SR was proposed, showing potential for better data utilization. However, the current method depends heavily on pre-trained SR networks and class-specific information, limiting its generalizability and applicability. To address these issues, we introduce a new data distillation approach for image SR that does not need class labels or pre-trained SR models. In particular, we first extract high-gradient patches and categorize images based on CLIP features, then fine-tune a diffusion model on the selected patches to learn their distribution and synthesize distilled training images. Experimental results show that our method achieves state-of-the-art performance while using significantly less training data and requiring less computational time. Specifically, when we train a baseline Transformer model for SR with only 0.68% of the original dataset, the performance drop is just 0.3 dB. In this case, diffusion model fine-tuning takes 4 hours, and SR model training completes within 1 hour, much shorter than the 11-hour training time with the full dataset.

[215] Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Siyuan Liang, Shiming Liu, Qunli Zhang, Hua Zhang, Xiaochun Cao

Main category: cs.CV

TL;DR: EAGLE is a lightweight black-box framework for explaining token generation in multimodal LLMs, providing spatial attribution and modality-aware analysis to quantify visual vs language influences.

Details

Motivation: Current MLLMs lack understanding of how generated tokens depend on visual modalities, limiting interpretability and reliability.

Method: Uses an objective function unifying sufficiency and indispensability scores, optimized via greedy search over sparsified image regions for efficient attribution.

Result: Outperforms existing methods in faithfulness, localization, and hallucination diagnosis while requiring substantially less GPU memory.

Conclusion: EAGLE effectively advances MLLM interpretability through practical and efficient attribution framework.

Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code will be released at https://ruoyuchen10.github.io/EAGLE/.

[216] RIFLE: Removal of Image Flicker-Banding via Latent Diffusion Enhancement

Libo Zhu, Zihan Zhou, Xiaoyang Liu, Weihang Zhang, Keyu Shi, Yifan Fu, Yulun Zhang

Main category: cs.CV

TL;DR: RIFLE is a diffusion-based framework that removes flicker-banding artifacts from photos of emissive displays while preserving fine details, using a flicker-banding prior estimator and masked loss.

Details

Motivation: Flicker-banding artifacts frequently occur when photographing emissive displays due to temporal aliasing between camera rolling-shutter and display modulation, severely impacting readability and quality, yet this problem remains underexplored compared to moire degradation.

Method: Proposes RIFLE framework with flicker-banding prior estimator (FPE) to predict banding attributes, masked loss to focus supervision on banded regions, and a simulation pipeline to generate realistic FB data with stochastic jitter in banding parameters.

Result: RIFLE consistently outperforms recent image reconstruction baselines across quantitative metrics and visual comparisons on real-world dataset, from mild to severe flicker-banding cases.

Conclusion: This is the first work to systematically research flicker-banding simulation and removal, establishing foundation for subsequent research in dataset construction and removal model design.

Abstract: Capturing screens is now routine in our everyday lives. But the photographs of emissive displays are often influenced by the flicker-banding (FB), which is alternating bright%u2013dark stripes that arise from temporal aliasing between a camera’s rolling-shutter readout and the display’s brightness modulation. Unlike moire degradation, which has been extensively studied, the FB remains underexplored despite its frequent and severe impact on readability and perceived quality. We formulate FB removal as a dedicated restoration task and introduce Removal of Image Flicker-Banding via Latent Diffusion Enhancement, RIFLE, a diffusion-based framework designed to remove FB while preserving fine details. We propose the flicker-banding prior estimator (FPE) that predicts key banding attributes and injects it into the restoration network. Additionally, Masked Loss (ML) is proposed to concentrate supervision on banded regions without sacrificing global fidelity. To overcome data scarcity, we provide a simulation pipeline that synthesizes FB in the luminance domain with stochastic jitter in banding angle, banding spacing, and banding width. Feathered boundaries and sensor noise are also applied for a more realistic simulation. For evaluation, we collect a paired real-world FB dataset with pixel-aligned banding-free references captured via long exposure. Across quantitative metrics and visual comparisons on our real-world dataset, RIFLE consistently outperforms recent image reconstruction baselines from mild to severe flicker-banding. To the best of our knowledge, it is the first work to research the simulation and removal of FB. Our work establishes a great foundation for subsequent research in both the dataset construction and the removal model design. Our dataset and code will be released soon.

[217] Towards Reliable and Holistic Visual In-Context Learning Prompt Selection

Wenxiao Wu, Jing-Hao Xue, Chengming Xu, Chen Liu, Xinwei Sun, Changxin Gao, Nong Sang, Yanwei Fu

Main category: cs.CV

TL;DR: RH-Partial2Global improves VICL by addressing limitations in current methods through reliable alternative set construction and comprehensive pairwise preference sampling.

Details

Motivation: Current VICL methods rely on the unproven similarity-priority assumption and use random sampling for pairwise preferences, leading to incomplete coverage and redundant comparisons that degrade global ranking quality.

Method: RH-Partial2Global uses jackknife conformal prediction to construct reliable alternative sets and covering design-based sampling to ensure comprehensive and uniform coverage of pairwise preferences.

Result: Extensive experiments show RH-Partial2Global achieves excellent performance and outperforms Partial2Global across diverse visual tasks.

Conclusion: The proposed method provides a more reliable and holistic approach for selecting in-context examples in Visual In-Context Learning, addressing fundamental limitations of existing methods.

Abstract: Visual In-Context Learning (VICL) has emerged as a prominent approach for adapting visual foundation models to novel tasks, by effectively exploiting contextual information embedded in in-context examples, which can be formulated as a global ranking problem of potential candidates. Current VICL methods, such as Partial2Global and VPR, are grounded in the similarity-priority assumption that images more visually similar to a query image serve as better in-context examples. This foundational assumption, while intuitive, lacks sufficient justification for its efficacy in selecting optimal in-context examples. Furthermore, Partial2Global constructs its global ranking from a series of randomly sampled pairwise preference predictions. Such a reliance on random sampling can lead to incomplete coverage and redundant samplings of comparisons, thus further adversely impacting the final global ranking. To address these issues, this paper introduces an enhanced variant of Partial2Global designed for reliable and holistic selection of in-context examples in VICL. Our proposed method, dubbed RH-Partial2Global, leverages a jackknife conformal prediction-guided strategy to construct reliable alternative sets and a covering design-based sampling approach to ensure comprehensive and uniform coverage of pairwise preferences. Extensive experiments demonstrate that RH-Partial2Global achieves excellent performance and outperforms Partial2Global across diverse visual tasks.

[218] Normal-Abnormal Guided Generalist Anomaly Detection

Yuexin Wang, Xiaolei Wang, Yizheng Gong, Jimin Xiao

Main category: cs.CV

TL;DR: Proposes NAGL framework for generalist anomaly detection using both normal and abnormal samples as references, outperforming previous methods that only used normal samples.

Details

Motivation: Previous GAD methods only used normal samples as references, ignoring valuable information from anomalous samples that are often available in real-world scenarios.

Method: Normal-Abnormal Generalist Learning (NAGL) framework with two components: Residual Mining (RM) extracts abnormal patterns from reference residuals, and Anomaly Feature Learning (AFL) adaptively learns anomaly features through residual mapping.

Result: Extensive experiments across multiple benchmarks demonstrate significant performance improvements over existing GAD approaches.

Conclusion: First work to use mixture of normal and abnormal samples as references in generalist anomaly detection, enabling more accurate and efficient cross-domain anomaly detection.

Abstract: Generalist Anomaly Detection (GAD) aims to train a unified model on an original domain that can detect anomalies in new target domains. Previous GAD methods primarily use only normal samples as references, overlooking the valuable information contained in anomalous samples that are often available in real-world scenarios. To address this limitation, we propose a more practical approach: normal-abnormal-guided generalist anomaly detection, which leverages both normal and anomalous samples as references to guide anomaly detection across diverse domains. We introduce the Normal-Abnormal Generalist Learning (NAGL) framework, consisting of two key components: Residual Mining (RM) and Anomaly Feature Learning (AFL). RM extracts abnormal patterns from normal-abnormal reference residuals to establish transferable anomaly representations, while AFL adaptively learns anomaly features in query images through residual mapping to identify instance-aware anomalies. Our approach effectively utilizes both normal and anomalous references for more accurate and efficient cross-domain anomaly detection. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing GAD approaches. This work represents the first to adopt a mixture of normal and abnormal samples as references in generalist anomaly detection. The code and datasets are available at https://github.com/JasonKyng/NAGL.

[219] ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling

Main category: cs.CV

TL;DR: ChronoEdit reframes image editing as video generation to ensure physical consistency by treating input and edited images as video frames and leveraging pretrained video models to capture temporal dynamics and physics.

Details

Motivation: Current image editing and generation models lack physical consistency, which is crucial for world simulation tasks where edited objects must remain coherent and physically plausible.

Method: Treats input and edited images as first/last video frames, uses pretrained video models for temporal consistency, introduces temporal reasoning with joint denoising of target frame and reasoning tokens to imagine plausible editing trajectories, then drops tokens to avoid full video rendering costs.

Result: Outperforms state-of-the-art baselines in both visual fidelity and physical plausibility on the new PBench-Edit benchmark for physically consistent image editing.

Conclusion: ChronoEdit successfully bridges the gap in physical consistency for image editing by leveraging video generation principles and temporal reasoning, providing a more physically plausible editing framework.

Abstract: Recent advances in large generative models have greatly enhanced both image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Project page for code and models: https://research.nvidia.com/labs/toronto-ai/chronoedit

[220] D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models

Jisu Han, Wonjun Hwang

Main category: cs.CV

TL;DR: The paper proposes dimensional entropy maximization to improve calibration in test-time prompt tuning for Vision-Language Models by addressing modality gap issues caused by dominant feature dimensions.

Details

Motivation: Test-time adaptation provides flexibility for domain shifts, but contrastive VLMs suffer from modality gaps where single dominant feature dimensions across modalities degrade calibration performance during test-time prompt tuning.

Method: Proposes dimensional entropy maximization that regularizes textual feature distributions toward uniformity to mitigate dependency on dominant dimensions, improving calibration error.

Result: The method alleviates calibration performance degradation in test-time prompt tuning, enhancing reliability of VLMs in real-world deployment scenarios.

Conclusion: Dimensional entropy maximization offers a simple yet effective solution to improve VLM reliability by addressing modality gap issues through regularization of dominant feature dimensions.

Abstract: Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.

[221] VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

Shaoqi Dong, Chaoyou Fu, Haihan Gao, Yi-Fan Zhang, Chi Yan, Chu Wu, Xiaoyu Liu, Yunhang Shen, Jing Huo, Deqiang Jiang, Haoyu Cao, Yang Gao, Xing Sun, Ran He, Caifeng Shan

Main category: cs.CV

TL;DR: A distillation-based framework that equips pretrained vision-language models with action-execution capability by transferring knowledge from small action models, avoiding expensive pretraining while achieving state-of-the-art performance.

Details

Motivation: Training Vision-Language Action (VLA) models from scratch is costly, and existing methods need to integrate action capabilities into pretrained vision-language models more efficiently.

Method: Two-stage training: 1) lightweight alignment mapping VLM hidden states to action space of small action model, 2) selective fine-tuning of language model, state encoder, and action modules. Architecture adds only an action token and state encoder to original VLM structure.

Result: Achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). Real-world experiments show 82.0% success rate (17% improvement over teacher model) across five manipulation tasks.

Conclusion: Action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs, outperforming previous state-of-the-art methods.

Abstract: Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.

[222] MSCloudCAM: Cross-Attention with Multi-Scale Context for Multispectral Cloud Segmentation

Md Abdullah Al Mazid, Liangdong Deng, Naphtali Rishe

Main category: cs.CV

TL;DR: MSCloudCAM is a novel cross-attention network for multispectral cloud segmentation that achieves state-of-the-art performance on Sentinel-2 and Landsat-8 datasets by combining Swin Transformer with multi-scale context modules and attention mechanisms.

Details

Motivation: Clouds significantly obstruct optical satellite imagery analysis for environmental monitoring, land cover mapping, and climate research, creating a critical need for reliable cloud segmentation methods.

Method: Proposes MSCloudCAM framework using Swin Transformer backbone with multi-scale context modules (ASPP and PSP), Cross-Attention block for multisensor feature fusion, and attention mechanisms (ECAB and Spatial Attention) for feature refinement.

Result: Achieves state-of-the-art segmentation accuracy on CloudSEN12 and L8Biome datasets, outperforming leading baseline architectures while maintaining competitive parameter efficiency and computational requirements.

Conclusion: MSCloudCAM demonstrates high effectiveness and practicality for large-scale Earth observation tasks, making it suitable for real-world applications in satellite imagery analysis.

Abstract: Clouds remain a critical challenge in optical satellite imagery, hindering reliable analysis for environmental monitoring, land cover mapping, and climate research. To overcome this, we propose MSCloudCAM, a Cross-Attention with Multi-Scale Context Network tailored for multispectral and multi-sensor cloud segmentation. Our framework exploits the spectral richness of Sentinel-2 (CloudSEN12) and Landsat-8 (L8Biome) data to classify four semantic categories: clear sky, thin cloud, thick cloud, and cloud shadow. MSCloudCAM combines a Swin Transformer backbone for hierarchical feature extraction with multi-scale context modules ASPP and PSP for enhanced scale-aware learning. A Cross-Attention block enables effective multisensor and multispectral feature fusion, while the integration of an Efficient Channel Attention Block (ECAB) and a Spatial Attention Module adaptively refine feature representations. Comprehensive experiments on CloudSEN12 and L8Biome demonstrate that MSCloudCAM delivers state-of-the-art segmentation accuracy, surpassing leading baseline architectures while maintaining competitive parameter efficiency and FLOPs. These results underscore the model’s effectiveness and practicality, making it well-suited for large-scale Earth observation tasks and real-world applications.

[223] FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, Yuhui Yin

Main category: cs.CV

TL;DR: FG-CLIP 2 is a bilingual vision-language model that improves fine-grained alignment between visual content and linguistic descriptions for both English and Chinese, addressing limitations in current models.

Details

Motivation: Current vision-language models like CLIP have limited fine-grained alignment capabilities, particularly for object attributes, spatial relations, and bilingual comprehension, especially in non-English settings.

Method: Leverages fine-grained supervision (region-text matching, long-caption modeling), multiple discriminative objectives, and introduces Textual Intra-modal Contrastive (TIC) loss to distinguish semantically similar captions. Trained on curated English and Chinese data.

Result: Achieves state-of-the-art performance on 29 datasets across 8 tasks, with powerful bilingual capabilities. Introduces new Chinese multimodal benchmark for evaluation.

Conclusion: FG-CLIP 2 advances fine-grained bilingual vision-language understanding and provides resources (model, code, benchmark) to support future research in this area.

Abstract: Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.

[224] Exploring and Leveraging Class Vectors for Classifier Editing

Jaeik Kim, Jaeyoung Do

Main category: cs.CV

TL;DR: Class Vectors enable flexible post-hoc editing of image classifiers by capturing class-specific representation adjustments during fine-tuning, supporting applications like unlearning, adaptation, and adversarial defense.

Details

Motivation: Existing classifier editing methods are limited - they either focus narrowly on error correction or require extensive retraining, creating a bottleneck for flexible editing in image classification.

Method: Introduce Class Vectors that capture class-specific representation adjustments during fine-tuning, disentangling each class’s adaptation in latent space. These vectors can steer latent features or map to weight space to update decision boundaries.

Result: Class Vectors capture each class’s semantic shift and demonstrate inherent linearity and orthogonality, supporting efficient concept editing via simple class arithmetic.

Conclusion: Class Vectors provide a practical solution for flexible classifier editing, validated in applications including unlearning, environmental adaptation, adversarial defense, and adversarial trigger optimization.

Abstract: Image classifiers play a critical role in detecting diseases in medical imaging and identifying anomalies in manufacturing processes. However, their predefined behaviors after extensive training make post hoc model editing difficult, especially when it comes to forgetting specific classes or adapting to distribution shifts. Existing classifier editing methods either focus narrowly on correcting errors or incur extensive retraining costs, creating a bottleneck for flexible editing. Moreover, such editing has seen limited investigation in image classification. To overcome these challenges, we introduce Class Vectors, which capture class-specific representation adjustments during fine-tuning. Whereas task vectors encode task-level changes in weight space, Class Vectors disentangle each class’s adaptation in the latent space. We show that Class Vectors capture each class’s semantic shift and that classifier editing can be achieved either by steering latent features along these vectors or by mapping them into weight space to update the decision boundaries. We also demonstrate that the inherent linearity and orthogonality of Class Vectors support efficient, flexible, and high-level concept editing via simple class arithmetic. Finally, we validate their utility in applications such as unlearning, environmental adaptation, adversarial defense, and adversarial trigger optimization.

[225] BSGS: Bi-stage 3D Gaussian Splatting for Camera Motion Deblurring

An Zhao, Piaopiao Yu, Zhe Zhu, Mingqiang Wei

Main category: cs.CV

TL;DR: Bi-Stage 3D Gaussian Splatting (BSGS) is a novel framework that reconstructs high-quality 3D scenes from motion-blurred images using a two-stage optimization approach with camera pose refinement and global rigid transformation.

Details

Motivation: Existing 3DGS-based deblurring methods have limitations including extreme dependence on accurate camera poses and inability to control erroneous Gaussian primitive densification caused by motion blur.

Method: Two-stage approach: 1) Camera Pose Refinement to reduce motion-induced distortions, 2) Global Rigid Transformation with fixed rough poses to correct blur distortions. Includes subframe gradient aggregation and space-time bi-stage optimization for dynamic threshold adjustment.

Result: Comprehensive experiments show the method effectively reconstructs 3D scenes from motion-blurred images and outperforms state-of-the-art methods.

Conclusion: BSGS successfully addresses motion blur challenges in 3D scene reconstruction and demonstrates superior performance compared to existing approaches.

Abstract: 3D Gaussian Splatting has exhibited remarkable capabilities in 3D scene reconstruction. However, reconstructing high-quality 3D scenes from motion-blurred images caused by camera motion poses a significant challenge.The performance of existing 3DGS-based deblurring methods are limited due to their inherent mechanisms, such as extreme dependence on the accuracy of camera poses and inability to effectively control erroneous Gaussian primitives densification caused by motion blur. To solve these problems, we introduce a novel framework, Bi-Stage 3D Gaussian Splatting, to accurately reconstruct 3D scenes from motion-blurred images. BSGS contains two stages. First, Camera Pose Refinement roughly optimizes camera poses to reduce motion-induced distortions. Second, with fixed rough camera poses, Global RigidTransformation further corrects motion-induced blur distortions. To alleviate multi-subframe gradient conflicts, we propose a subframe gradient aggregation strategy to optimize both stages. Furthermore, a space-time bi-stage optimization strategy is introduced to dynamically adjust primitive densification thresholds and prevent premature noisy Gaussian generation in blurred regions. Comprehensive experiments verify the effectiveness of our proposed deblurring method and show its superiority over the state of the arts.Our source code is available at https://github.com/wsxujm/bsgs

[226] CADE 2.5 - ZeResFDG: Frequency-Decoupled, Rescaled and Zero-Projected Guidance for SD/SDXL Latent Diffusion Models

Denis Rychkovskiy

Main category: cs.CV

TL;DR: CADE 2.5 is a sampler-level guidance stack for SD/SDXL latent diffusion models that improves image quality through frequency-decoupled guidance, energy rescaling, and zero-projection, with additional stabilization via QSilk Micrograin Stabilizer.

Details

Motivation: To enhance sharpness, prompt adherence, and artifact control in SD/SDXL models without requiring retraining, by developing improved guidance mechanisms during the sampling process.

Method: Uses ZeResFDG module combining frequency-decoupled guidance, energy rescaling, and zero-projection; includes spectral EMA for mode switching and QSilk Micrograin Stabilizer for inference-time stabilization with micro-detail injection.

Result: Improves sharpness, prompt adherence, and artifact control across SD/SDXL samplers at moderate guidance scales; provides natural high-frequency micro-texture at high resolutions with negligible overhead.

Conclusion: CADE 2.5 successfully enhances SD/SDXL model performance through novel guidance techniques and stabilization methods, offering training-free improvements in image quality and robustness.

Abstract: We introduce CADE 2.5 (Comfy Adaptive Detail Enhancer), a sampler-level guidance stack for SD/SDXL latent diffusion models. The central module, ZeResFDG, unifies (i) frequency-decoupled guidance that reweights low- and high-frequency components of the guidance signal, (ii) energy rescaling that matches the per-sample magnitude of the guided prediction to the positive branch, and (iii) zero-projection that removes the component parallel to the unconditional direction. A lightweight spectral EMA with hysteresis switches between a conservative and a detail-seeking mode as structure crystallizes during sampling. Across SD/SDXL samplers, ZeResFDG improves sharpness, prompt adherence, and artifact control at moderate guidance scales without any retraining. In addition, we employ a training-free inference-time stabilizer, QSilk Micrograin Stabilizer (quantile clamp + depth/edge-gated micro-detail injection), which improves robustness and yields natural high-frequency micro-texture at high resolutions with negligible overhead. For completeness we note that the same rule is compatible with alternative parameterizations (e.g., velocity), which we briefly discuss in the Appendix; however, this paper focuses on SD/SDXL latent diffusion models.

Tianyu Zhang, Suyuchen Wang, Chao Wang, Juan Rodriguez, Ahmed Masry, Xiangru Jian, Yoshua Bengio, Perouz Taslakian

Main category: cs.CV

TL;DR: SCOPE is a Mixture-of-Encoders framework that dynamically selects one specialized vision encoder per image-text pair using instance-level routing, outperforming models that use all encoders simultaneously while reducing computation by 24-49%.

Details

Motivation: Vision-language models benefit from multiple vision encoders, but naive stacking yields diminishing returns while multiplying inference costs. The goal is to achieve better performance with less computation through intelligent encoder selection.

Method: SCOPE maintains a shared encoder and a pool of routed encoders. A lightweight router uses cross-attention between text prompts and shared visual features to select the optimal encoder from the routed encoders. Training uses dual entropy regularization with auxiliary losses to balance dataset-level load distribution with instance-level routing confidence.

Result: SCOPE with one shared plus one routed encoder outperforms models using all four extra encoders simultaneously, while reducing compute by 24-49%. This demonstrates that intelligent encoder selection beats brute-force aggregation.

Conclusion: SCOPE challenges the prevailing paradigm in multi-encoder VLMs by showing that dynamic encoder selection through instance-level routing is more effective than using all encoders simultaneously, achieving better performance with significantly reduced computation.

Abstract: Vision-language models (VLMs) benefit from multiple vision encoders, but naively stacking them yields diminishing returns while multiplying inference costs. We propose SCOPE, a Mixture-of-Encoders (MoEnc) framework that dynamically selects one specialized encoder per image-text pair via instance-level routing, unlike token-level routing in traditional MoE. SCOPE maintains a shared encoder and a pool of routed encoders. A lightweight router uses cross-attention between text prompts and shared visual features to select the optimal encoder from the routed encoders. To train this router, we introduce dual entropy regularization with auxiliary losses to balance dataset-level load distribution with instance-level routing confidence. Remarkably, SCOPE with one shared plus one routed encoder outperforms models using all four extra encoders simultaneously, while reducing compute by 24-49%. This demonstrates that intelligent encoder selection beats brute-force aggregation, challenging the prevailing paradigm in multi-encoder VLMs.

[228] CoDS: Enhancing Collaborative Perception in Heterogeneous Scenarios via Domain Separation

Yushan Han, Hui Zhang, Honglei Zhang, Chuntao Ding, Yuanzhouhan Cao, Yidong Li

Main category: cs.CV

TL;DR: CoDS is a collaborative perception method that addresses feature discrepancies in heterogeneous autonomous driving scenarios using domain separation and lightweight feature alignment modules.

Details

Motivation: Existing collaborative perception methods assume identical encoders for all agents, which doesn't hold in real-world heterogeneous deployments. Current approaches are vulnerable to noise from domain gaps and use inefficient transformer-based modules.

Method: Proposes CoDS with two feature alignment modules: Lightweight Spatial-Channel Resizer (LSCR) for spatial and channel alignment, and Distribution Alignment via Domain Separation (DADS) with encoder-specific and encoder-agnostic modules. Uses Domain Alignment Mutual Information (DAMI) loss for training.

Result: CoDS effectively mitigates feature discrepancies in heterogeneous scenarios and achieves a trade-off between detection accuracy and inference efficiency through its fully convolutional architecture.

Conclusion: The proposed CoDS method successfully addresses feature discrepancies in heterogeneous collaborative perception while maintaining high inference efficiency, making it suitable for real-world autonomous driving applications.

Abstract: Collaborative perception has been proven to improve individual perception in autonomous driving through multi-agent interaction. Nevertheless, most methods often assume identical encoders for all agents, which does not hold true when these models are deployed in real-world applications. To realize collaborative perception in actual heterogeneous scenarios, existing methods usually align neighbor features to those of the ego vehicle, which is vulnerable to noise from domain gaps and thus fails to address feature discrepancies effectively. Moreover, they adopt transformer-based modules for domain adaptation, which causes the model inference inefficiency on mobile devices. To tackle these issues, we propose CoDS, a Collaborative perception method that leverages Domain Separation to address feature discrepancies in heterogeneous scenarios. The CoDS employs two feature alignment modules, i.e., Lightweight Spatial-Channel Resizer (LSCR) and Distribution Alignment via Domain Separation (DADS). Besides, it utilizes the Domain Alignment Mutual Information (DAMI) loss to ensure effective feature alignment. Specifically, the LSCR aligns the neighbor feature across spatial and channel dimensions using a lightweight convolutional layer. Subsequently, the DADS mitigates feature distribution discrepancy with encoder-specific and encoder-agnostic domain separation modules. The former removes domain-dependent information and the latter captures task-related information. During training, the DAMI loss maximizes the mutual information between aligned heterogeneous features to enhance the domain separation process. The CoDS employs a fully convolutional architecture, which ensures high inference efficiency. Extensive experiments demonstrate that the CoDS effectively mitigates feature discrepancies in heterogeneous scenarios and achieves a trade-off between detection accuracy and inference efficiency.

[229] Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images

Emanuel Garbin, Guy Adam, Oded Krams, Zohar Barzelay, Eran Guendelman, Michael Schwarz, Matteo Presutto, Moran Vatelmacher, Yigal Shenkman, Eli Peker, Itai Druker, Uri Patish, Yoav Blum, Max Bluvstein, Junxuan Li, Rawal Khirodkar, Shunsuke Saito

Main category: cs.CV

TL;DR: A zero-shot pipeline for creating hyperrealistic 3D avatars from unstructured phone images using generative canonicalization and transformer-based modeling trained on high-fidelity Gaussian splatting avatars.

Details

Motivation: Existing methods have limitations: single-view approaches suffer from geometric inconsistencies and hallucinations, while synthetic data-trained models fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism.

Method: Introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized representation, and (2) a transformer-based model trained on a large-scale dataset of high-fidelity Gaussian splatting avatars from dome captures of real people.

Result: The “Capture, Canonicalize, Splat” pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.

Conclusion: The method successfully addresses challenges in existing approaches by combining generative canonicalization with transformer-based modeling on real-world capture data, achieving hyperrealistic and identity-preserving 3D avatars from minimal input.

Abstract: We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This “Capture, Canonicalize, Splat” pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.

[230] Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning

Xiangyu Meng, Zixian Zhang, Zhenghao Zhang, Junchao Liao, Long Qin, Weizhi Wang

Main category: cs.CV

TL;DR: Identity-GRPO is a human feedback-driven optimization pipeline that improves multi-human identity preservation in video generation, achieving up to 18.9% improvement in human consistency metrics over baseline methods.

Details

Motivation: Advanced video generation methods like VACE and Phantom struggle with preserving consistent identities across multiple human characters in dynamic interactions, which is critical for realistic video generation.

Method: Proposes Identity-GRPO pipeline: 1) Constructs video reward model trained on large-scale preference dataset with human-annotated and synthetic distortion data, 2) Employs GRPO variant tailored for multi-human consistency optimization, 3) Conducts extensive ablation studies on annotation quality and design choices.

Result: Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, and greatly enhances both VACE and Phantom video generation systems.

Conclusion: The method offers actionable insights for aligning reinforcement learning with personalized video generation and effectively addresses the multi-human identity preservation challenge.

Abstract: While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.

[231] Vision-Centric Activation and Coordination for Multimodal Large Language Models

Yunnan Wang, Fan Lu, Kecheng Zheng, Ziyuan Huang, Ziqiang Li, Wenjun Zeng, Xin Jin

Main category: cs.CV

TL;DR: VaCo enhances MLLMs by integrating vision-centric supervision from multiple vision foundation models through visual discriminative alignment, improving multimodal comprehension capabilities.

Details

Motivation: Mainstream MLLMs rely solely on next-token text prediction, neglecting crucial vision-centric information needed for analytical abilities, creating a gap in visual comprehension.

Method: Introduces VaCo with Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) to activate visual signals supervised by VFMs, plus Token Gateway Mask (TGM) to coordinate representation conflicts across multiple VFMs.

Result: Extensive experiments show VaCo significantly improves performance of different MLLMs on various benchmarks, demonstrating superior visual comprehension capabilities.

Conclusion: VaCo successfully addresses the vision-centric information gap in MLLMs by coordinating multiple VFMs, enabling better multimodal understanding through unified optimization of textual and visual outputs.

Abstract: Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.

[232] DCMIL: A Progressive Representation Learning of Whole Slide Images for Cancer Prognosis Analysis

Chao Tu, Kun Huang, Jie Zhang, Qianjin Feng, Yu Zhang, Zhenyuan Ning

Main category: cs.CV

TL;DR: DCMIL is a dual-curriculum contrastive multi-instance learning method for computational pathology that processes whole slide images without dense annotations, outperforming standard prognostic models across 12 cancer types.

Details

Motivation: Progress in computational pathology is hindered by computational bottlenecks from gigapixel-size inputs and scarcity of dense manual annotations, while current methods overlook fine-grained multi-magnification information and tumor microenvironment variations.

Method: Easy-to-hard progressive representation learning using dual-curriculum contrastive multi-instance learning (DCMIL) that transforms gigapixel-size WSIs directly into outcome predictions without dense annotations.

Result: Extensive experiments on 12 cancer types (5,954 patients, 12.54 million tiles) show DCMIL outperforms standard WSI-based prognostic models, identifies fine-grained prognosis-salient regions, provides robust uncertainty estimation, and captures morphological differences between normal and tumor tissues.

Conclusion: DCMIL enables efficient processing of WSIs for cancer prognosis without dense annotations, demonstrates superior performance across multiple cancer types, and has potential to generate new biological insights through morphological analysis.

Abstract: The burgeoning discipline of computational pathology shows promise in harnessing whole slide images (WSIs) to quantify morphological heterogeneity and develop objective prognostic modes for human cancers. However, progress is impeded by the computational bottleneck of gigapixel-size inputs and the scarcity of dense manual annotations. Current methods often overlook fine-grained information across multi-magnification WSIs and variations in tumor microenvironments. Here, we propose an easy-to-hard progressive representation learning, termed dual-curriculum contrastive multi-instance learning (DCMIL), to efficiently process WSIs for cancer prognosis. The model does not rely on dense annotations and enables the direct transformation of gigapixel-size WSIs into outcome predictions. Extensive experiments on twelve cancer types (5,954 patients, 12.54 million tiles) demonstrate that DCMIL outperforms standard WSI-based prognostic models. Additionally, DCMIL identifies fine-grained prognosis-salient regions, provides robust instance uncertainty estimation, and captures morphological differences between normal and tumor tissues, with the potential to generate new biological insights. All codes have been made publicly accessible at https://github.com/tuuuc/DCMIL.

[233] PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, Yanjun Ma

Main category: cs.CV

TL;DR: PaddleOCR-VL is a state-of-the-art, resource-efficient vision-language model for document parsing that achieves top performance while supporting 109 languages and recognizing complex elements like text, tables, formulas, and charts.

Details

Motivation: To create an efficient and powerful document parsing model that can handle multiple languages and complex document elements while maintaining minimal resource consumption for practical real-world deployment.

Method: Developed PaddleOCR-VL-0.9B, a compact vision-language model that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model for accurate element recognition.

Result: Achieves SOTA performance on public and in-house benchmarks for both page-level document parsing and element-level recognition, significantly outperforms existing solutions, and delivers fast inference speeds with strong competitiveness against top-tier VLMs.

Conclusion: PaddleOCR-VL is highly suitable for practical deployment in real-world scenarios due to its superior performance, efficiency, and multilingual capabilities.

Abstract: In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios. Code is available at https://github.com/PaddlePaddle/PaddleOCR .

cs.AI

[234] OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data

Alana Renda, Jillian Ross, Michael Cafarella, Jacob Andreas

Main category: cs.AI

TL;DR: OpenEstimate is a benchmark for evaluating language models on numerical estimation tasks requiring probabilistic reasoning under uncertainty, revealing that current models produce inaccurate and overconfident priors.

Details

Motivation: There's a gap in LM evaluation - most benchmarks focus on well-defined problems, but real-world applications require reasoning under uncertainty with incomplete information. Current evaluations don't adequately test this capability.

Method: Developed OpenEstimate, an extensible multi-domain benchmark for numerical estimation tasks where models must synthesize background information and express predictions as probabilistic priors. Evaluated six frontier LMs on accuracy, calibration, and usefulness relative to true distributions.

Result: LM-elicited priors were often inaccurate and overconfident. Performance improved modestly with different uncertainty elicitation methods but was largely unaffected by sampling strategy, reasoning effort, or prompt design changes.

Conclusion: OpenEstimate provides a challenging evaluation platform for developing LMs that are better at probabilistic estimation and reasoning under uncertainty, addressing a critical gap in current LM capabilities.

Abstract: Real-world settings where language models (LMs) are deployed – in domains spanning healthcare, finance, and other forms of knowledge work – require models to grapple with incomplete information and reason under uncertainty. Yet most LM evaluations focus on problems with well-defined answers and success criteria. This gap exists in part because natural problems involving uncertainty are difficult to construct: given that LMs have access to most of the same knowledge as humans, it is non-trivial to design questions for which LMs will struggle to produce correct answers, but which humans can answer reliably. As a result, LM performance on reasoning under uncertainty remains poorly characterized. To address this gap, we introduce OpenEstimate, an extensible, multi-domain benchmark for evaluating LMs on numerical estimation tasks that require models to synthesize significant amounts of background information and express predictions as probabilistic priors. We assess these priors for accuracy and calibration, quantifying their usefulness relative to samples from the true distribution of interest. Across six frontier LMs, we find that LM-elicited priors are often inaccurate and overconfident. Performance improves modestly depending on how uncertainty is elicited from the model, but is largely unaffected by changes in sampling strategy, reasoning effort, or prompt design. The OpenEstimate benchmark thus offers a challenging evaluation for frontier LMs and a platform for developing models that are better at probabilistic estimation and reasoning under uncertainty.

[235] Procedural Game Level Design with Deep Reinforcement Learning

Miraç Buğra Özkan

Main category: cs.AI

TL;DR: A novel method for procedural level design using Deep Reinforcement Learning (DRL) in Unity 3D environment with two agents: a hummingbird that solves levels and an island that generates flower layouts.

Details

Motivation: To enable dynamic, replayable, and scalable game environments with reduced manual effort through procedural content generation using AI.

Method: Two-agent system using Proximal Policy Optimization (PPO): hummingbird agent learns to navigate and collect flowers, while island agent learns to generate flower layouts based on obstacles, initial state, and performance feedback.

Result: The approach produces effective agent behavior and enables robust generalization across environmental configurations, demonstrating emergent behavior between agents.

Conclusion: DRL enables intelligent agents to both generate and solve content in virtual environments, opening new opportunities for autonomous game level design and pushing AI’s creative contributions to game development.

Abstract: Procedural content generation (PCG) has become an increasingly popular technique in game development, allowing developers to generate dynamic, replayable, and scalable environments with reduced manual effort. In this study, a novel method for procedural level design using Deep Reinforcement Learning (DRL) within a Unity-based 3D environment is proposed. The system comprises two agents: a hummingbird agent, acting as a solver, and a floating island agent, responsible for generating and placing collectible objects (flowers) on the terrain in a realistic and context-aware manner. The hummingbird is trained using the Proximal Policy Optimization (PPO) algorithm from the Unity ML-Agents toolkit. It learns to navigate through the terrain efficiently, locate flowers, and collect them while adapting to the ever-changing procedural layout of the island. The island agent is also trained using the Proximal Policy Optimization (PPO) algorithm. It learns to generate flower layouts based on observed obstacle positions, the hummingbird’s initial state, and performance feedback from previous episodes. The interaction between these agents leads to emergent behavior and robust generalization across various environmental configurations. The results demonstrate that the approach not only produces effective and efficient agent behavior but also opens up new opportunities for autonomous game level design driven by machine learning. This work highlights the potential of DRL in enabling intelligent agents to both generate and solve content in virtual environments, pushing the boundaries of what AI can contribute to creative game development processes.

[236] Towards Error Centric Intelligence I, Beyond Observational Learning

Marcus A. Thomas

Main category: cs.AI

TL;DR: AGI progress is limited by theory, not data or scale. The paper challenges observational learning and proposes Causal Mechanics with error-centric principles for hypothesis space expansion.

Details

Motivation: To address the limitation that observational adequacy alone cannot guarantee interventional competence, since observationally equivalent worlds can diverge under interventions.

Method: Proposes Causal Mechanics - a mechanisms-first program with hypothesis space change as first-class operation. Introduces structural principles: Locality and Autonomy Principle, Independent Causal Mechanisms, and Compositional Autonomy Principle.

Result: Develops a framework for converting unreachable errors into reachable ones and correcting them through conjecture and criticism.

Conclusion: AGI requires moving beyond observational learning to an error-centric approach that enables systematic hypothesis space expansion and error correction through causal reasoning.

Abstract: We argue that progress toward AGI is theory limited rather than data or scale limited. Building on the critical rationalism of Popper and Deutsch, we challenge the Platonic Representation Hypothesis. Observationally equivalent worlds can diverge under interventions, so observational adequacy alone cannot guarantee interventional competence. We begin by laying foundations, definitions of knowledge, learning, intelligence, counterfactual competence and AGI, and then analyze the limits of observational learning that motivate an error centric shift. We recast the problem as three questions about how explicit and implicit errors evolve under an agent’s actions, which errors are unreachable within a fixed hypothesis space, and how conjecture and criticism expand that space. From these questions we propose Causal Mechanics, a mechanisms first program in which hypothesis space change is a first class operation and probabilistic structure is used when useful rather than presumed. We advance structural principles that make error discovery and correction tractable, including a differential Locality and Autonomy Principle for modular interventions, a gauge invariant form of Independent Causal Mechanisms for separability, and the Compositional Autonomy Principle for analogy preservation, together with actionable diagnostics. The aim is a scaffold for systems that can convert unreachable errors into reachable ones and correct them.

[237] Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation

Ed Li, Junyu Ren, Xintian Pan, Cat Yan, Chuanhao Li, Dirk Bergemann, Zhuoran Yang

Main category: cs.AI

TL;DR: An open-source multiagent framework called freephdlabor that enables fully dynamic workflows through real-time agent reasoning and modular architecture for customizable automated scientific research.

Details

Motivation: Existing agentic systems for science have rigid workflows that cannot adapt to intermediate findings and inadequate context management that hinders long-horizon research.

Method: Developed a multiagent framework with fully dynamic workflows determined by real-time agent reasoning, modular architecture for customization, automatic context compaction, workspace-based communication, memory persistence, and non-blocking human intervention mechanisms.

Result: The framework transforms automated research from isolated single-run attempts into continual research programs that build systematically on prior explorations and incorporate human feedback.

Conclusion: This work provides both architectural principles and practical implementation for building customizable co-scientist systems to facilitate broader adoption of automated research across scientific domains.

Abstract: The automation of scientific discovery represents a critical milestone in Artificial Intelligence (AI) research. However, existing agentic systems for science suffer from two fundamental limitations: rigid, pre-programmed workflows that cannot adapt to intermediate findings, and inadequate context management that hinders long-horizon research. We present \texttt{freephdlabor}, an open-source multiagent framework featuring \textit{fully dynamic workflows} determined by real-time agent reasoning and a \coloremph{\textit{modular architecture}} enabling seamless customization – users can modify, add, or remove agents to address domain-specific requirements. The framework provides comprehensive infrastructure including \textit{automatic context compaction}, \textit{workspace-based communication} to prevent information degradation, \textit{memory persistence} across sessions, and \textit{non-blocking human intervention} mechanisms. These features collectively transform automated research from isolated, single-run attempts into \textit{continual research programs} that build systematically on prior explorations and incorporate human feedback. By providing both the architectural principles and practical implementation for building customizable co-scientist systems, this work aims to facilitate broader adoption of automated research across scientific domains, enabling practitioners to deploy interactive multiagent systems that autonomously conduct end-to-end research – from ideation through experimentation to publication-ready manuscripts.

[238] HugAgent: Evaluating LLMs in Simulating Human-Like Individual Reasoning on Open-Ended Tasks

Chance Jiajie Li, Zhenze Mo, Yuhan Tang, Ao Qu, Jiayi Wu, Kaiya Ivy Zhao, Yulu Gan, Jie Fan, Jiangbo Yu, Hang Jiang, Paul Pu Liang, Jinhua Zhao, Luis Alberto Alonso Pastor, Kent Larson

Main category: cs.AI

TL;DR: HugAgent is a benchmark for evaluating how well AI models can adapt from population-level reasoning to individual human reasoning styles and belief trajectories.

Details

Motivation: Current large language models capture population-level consensus but erase individual reasoning styles and belief evolution, limiting their ability to simulate truly human-like reasoning.

Method: Dual-track design: synthetic track for scale and systematic stress tests, and human track for ecologically valid “out-loud” reasoning data. Models predict how specific individuals would reason in novel scenarios given their past views.

Result: Experiments with state-of-the-art LLMs reveal persistent adaptation gaps in capturing individual reasoning evolution, not just beliefs.

Conclusion: HugAgent serves as the first extensible benchmark for aligning machine reasoning with the individuality of human thought, enabling evaluation of intra-agent fidelity.

Abstract: Simulating human reasoning in open-ended tasks has been a long-standing aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for average-to-individual reasoning adaptation. The task is to predict how a specific person would reason and update their beliefs in novel scenarios, given partial evidence of their past views. HugAgent adopts a dual-track design: a synthetic track for scale and systematic stress tests, and a human track for ecologically valid, “out-loud” reasoning data. This design enables scalable, reproducible evaluation of intra-agent fidelity: whether models can capture not just what people believe, but how their reasoning evolves. Experiments with state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. Our benchmark and chatbot are open-sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace-your-thinking).

[239] AURA: An Agent Autonomy Risk Assessment Framework

Lorenzo Satta Chiris, Ayush Mishra

Main category: cs.AI

TL;DR: AURA is a unified framework for detecting, quantifying, and mitigating risks in autonomous AI agents using a gamma-based risk scoring methodology with Human-in-the-Loop oversight.

Details

Motivation: Address persistent challenges in alignment, governance, and risk management that threaten large-scale deployment of autonomous agentic AI systems in organizations.

Method: Introduces a gamma-based risk scoring methodology that balances accuracy with computational efficiency, with interactive processes for scoring and mitigating risks across single or multiple AI agents operating synchronously or asynchronously.

Result: Provides robust risk detection and mitigation while balancing computational resources, enabling seamless integration with agentic systems through Agent-to-Human communication mechanisms and interoperability with established protocols (MCP and A2A).

Conclusion: AURA positions itself as a critical enabler for large-scale, governable agentic AI in enterprise environments by supporting responsible and transparent adoption of agentic AI systems.

Abstract: As autonomous agentic AI systems see increasing adoption across organisations, persistent challenges in alignment, governance, and risk management threaten to impede deployment at scale. We present AURA (Agent aUtonomy Risk Assessment), a unified framework designed to detect, quantify, and mitigate risks arising from agentic AI. Building on recent research and practical deployments, AURA introduces a gamma-based risk scoring methodology that balances risk assessment accuracy with computational efficiency and practical considerations. AURA provides an interactive process to score, evaluate and mitigate the risks of running one or multiple AI Agents, synchronously or asynchronously (autonomously). The framework is engineered for Human-in-the-Loop (HITL) oversight and presents Agent-to-Human (A2H) communication mechanisms, allowing for seamless integration with agentic systems for autonomous self-assessment, rendering it interoperable with established protocols (MCP and A2A) and tools. AURA supports a responsible and transparent adoption of agentic AI and provides robust risk detection and mitigation while balancing computational resources, positioning it as a critical enabler for large-scale, governable agentic AI in enterprise environments.

[240] WELD: A Large-Scale Longitudinal Dataset of Emotional Dynamics for Ubiquitous Affective Computing

Xiao Sun

Main category: cs.AI

TL;DR: This paper presents the largest longitudinal workplace emotion dataset with 733,651 facial expression records from 38 employees over 30.5 months, enabling research in emotion recognition, affective dynamics, and turnover prediction.

Details

Motivation: Automated emotion recognition in real-world workplace settings is challenging due to scarcity of large-scale, longitudinal datasets collected in naturalistic environments.

Method: Collected 733,651 facial expression records from 38 employees over 30.5 months using deep learning-based facial expression recognition, with comprehensive metadata including job roles, employment outcomes, and personality traits.

Result: Technical validation showed high data quality with successful replication of psychological patterns (weekend effect: +192% valence improvement, p < 0.001) and perfect predictive validity for employee turnover (AUC=1.0). Baseline models achieved 91.2% accuracy for emotion classification and R2 = 0.84 for valence prediction.

Conclusion: This is the largest and longest longitudinal workplace emotion dataset publicly available, enabling research in emotion recognition, affective dynamics modeling, emotional contagion, turnover prediction, and emotion-aware system design.

Abstract: Automated emotion recognition in real-world workplace settings remains a challenging problem in affective computing due to the scarcity of large-scale, longitudinal datasets collected in naturalistic environments. We present a novel dataset comprising 733,651 facial expression records from 38 employees collected over 30.5 months (November 2021 to May 2024) in an authentic office environment. Each record contains seven emotion probabilities (neutral, happy, sad, surprised, fear, disgusted, angry) derived from deep learning-based facial expression recognition, along with comprehensive metadata including job roles, employment outcomes, and personality traits. The dataset uniquely spans the COVID-19 pandemic period, capturing emotional responses to major societal events including the Shanghai lockdown and policy changes. We provide 32 extended emotional metrics computed using established affective science methods, including valence, arousal, volatility, predictability, inertia, and emotional contagion strength. Technical validation demonstrates high data quality through successful replication of known psychological patterns (weekend effect: +192% valence improvement, p < 0.001; diurnal rhythm validated) and perfect predictive validity for employee turnover (AUC=1.0). Baseline experiments using Random Forest and LSTM models achieve 91.2% accuracy for emotion classification and R2 = 0.84 for valence prediction. This is the largest and longest longitudinal workplace emotion dataset publicly available, enabling research in emotion recognition, affective dynamics modeling, emotional contagion, turnover prediction, and emotion-aware system design.

[241] From Checklists to Clusters: A Homeostatic Account of AGI Evaluation

Brett Reynolds

Main category: cs.AI

TL;DR: AGI evaluations should use asymmetric domain weighting based on causal centrality and test for persistent capabilities rather than snapshot scores, treating general intelligence as a homeostatic property cluster.

Details

Motivation: Current AGI evaluations have two problems: equal domain weighting ignores human intelligence research showing domains have different importance, and snapshot testing can't distinguish durable capabilities from brittle performances.

Method: Proposes two battery-compatible extensions: centrality-prior score using CHC-derived weights with sensitivity analysis, and Cluster Stability Index family measuring profile persistence, durable learning, and error correction.

Result: These additions preserve multidomain breadth while reducing brittleness and gaming in AGI evaluations.

Conclusion: General intelligence should be understood as a homeostatic property cluster requiring evidence of persistence, with testable predictions and black-box protocols available for labs to adopt.

Abstract: Contemporary AGI evaluations report multidomain capability profiles, yet they typically assign symmetric weights and rely on snapshot scores. This creates two problems: (i) equal weighting treats all domains as equally important when human intelligence research suggests otherwise, and (ii) snapshot testing can’t distinguish durable capabilities from brittle performances that collapse under delay or stress. I argue that general intelligence – in humans and potentially in machines – is better understood as a homeostatic property cluster: a set of abilities plus the mechanisms that keep those abilities co-present under perturbation. On this view, AGI evaluation should weight domains by their causal centrality (their contribution to cluster stability) and require evidence of persistence across sessions. I propose two battery-compatible extensions: a centrality-prior score that imports CHC-derived weights with transparent sensitivity analysis, and a Cluster Stability Index family that separates profile persistence, durable learning, and error correction. These additions preserve multidomain breadth while reducing brittleness and gaming. I close with testable predictions and black-box protocols labs can adopt without architectural access.

[242] Multi-dimensional Data Analysis and Applications Basing on LLM Agents and Knowledge Graph Interactions

Xi Wang, Xianyao Ling, Kun Li, Gang Yin, Liang Zhang, Jiang Wu, Jun Xu, Fu Zhang, Wenbo Lei, Annie Wang, Peng Gong

Main category: cs.AI

TL;DR: A method combining LLM agents and Knowledge Graphs for dynamic multi-dimensional data analysis, enabling real-time KG construction from unstructured data and interactive exploration.

Details

Motivation: Address limitations of LLMs (hallucination, real-time updating) and static nature of KGs by creating a collaborative ecosystem for handling massive, heterogeneous multi-dimensional data.

Method: Uses LLM agents to automatically extract product data from unstructured data, constructs and visualizes KGs in real-time, and provides interactive platform for deep graph node exploration.

Result: Significant advantages in product ecosystem analysis, relationship mining, and user-driven exploratory analysis compared to traditional approaches.

Conclusion: Provides new ideas and tools for multi-dimensional data analysis through dynamic LLM-KG interaction, offering enhanced analytical capabilities for complex data environments.

Abstract: In the current era of big data, extracting deep insights from massive, heterogeneous, and complexly associated multi-dimensional data has become a significant challenge. Large Language Models (LLMs) perform well in natural language understanding and generation, but still suffer from “hallucination” issues when processing structured knowledge and are difficult to update in real-time. Although Knowledge Graphs (KGs) can explicitly store structured knowledge, their static nature limits dynamic interaction and analytical capabilities. Therefore, this paper proposes a multi-dimensional data analysis method based on the interactions between LLM agents and KGs, constructing a dynamic, collaborative analytical ecosystem. This method utilizes LLM agents to automatically extract product data from unstructured data, constructs and visualizes the KG in real-time, and supports users in deep exploration and analysis of graph nodes through an interactive platform. Experimental results show that this method has significant advantages in product ecosystem analysis, relationship mining, and user-driven exploratory analysis, providing new ideas and tools for multi-dimensional data analysis.

[243] Experience-Driven Exploration for Efficient API-Free AI Agents

Chenwei Tang, Jingyu Xing, Xinyu Liu, Zizhou Wang, Jiawei Du, Liangli Zhen, Jiancheng Lv

Main category: cs.AI

TL;DR: KG-Agent is an experience-driven framework that structures GUI interactions into a State-Action Knowledge Graph to improve efficiency in API-free software environments, enabling better generalization and long-term planning.

Details

Motivation: Most software lacks accessible APIs, forcing agents to operate through pixel-based GUIs, which leads to inefficient trial-and-error exploration and myopic decision-making in LLM-based agents.

Method: KG-Agent structures raw pixel-level interactions into a persistent State-Action Knowledge Graph (SA-KG), links functionally similar GUI states, and uses a hybrid intrinsic reward mechanism combining state value and novelty rewards for strategic planning.

Result: KG-Agent demonstrates significant improvements in exploration efficiency and strategic depth over state-of-the-art methods in complex GUI-based environments like Civilization V and Slay the Spire.

Conclusion: The SA-KG framework effectively addresses efficiency bottlenecks in API-free environments by enabling experience-driven learning, better generalization, and decoupled strategic planning from pure discovery.

Abstract: Most existing software lacks accessible Application Programming Interfaces (APIs), requiring agents to operate solely through pixel-based Graphical User Interfaces (GUIs). In this API-free setting, large language model (LLM)-based agents face severe efficiency bottlenecks: limited to local visual experiences, they make myopic decisions and rely on inefficient trial-and-error, hindering both skill acquisition and long-term planning. To address these challenges, we propose KG-Agent, an experience-driven learning framework that structures an agent’s raw pixel-level interactions into a persistent State-Action Knowledge Graph (SA-KG). KG-Agent overcomes inefficient exploration by linking functionally similar but visually distinct GUI states, forming a rich neighborhood of experience that enables the agent to generalize from a diverse set of historical strategies. To support long-horizon reasoning, we design a hybrid intrinsic reward mechanism based on the graph topology, combining a state value reward for exploiting known high-value pathways with a novelty reward that encourages targeted exploration. This approach decouples strategic planning from pure discovery, allowing the agent to effectively value setup actions with delayed gratification. We evaluate KG-Agent in two complex, open-ended GUI-based decision-making environments (Civilization V and Slay the Spire), demonstrating significant improvements in exploration efficiency and strategic depth over the state-of-the-art methods.

[244] AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory

Jitesh Jain, Shubham Maheshwari, Ning Yu, Wen-mei Hwu, Humphrey Shi

Main category: cs.AI

TL;DR: AUGUSTUS is a multimodal agent system that uses graph-structured contextual memory with semantic tags, outperforming traditional RAG approaches in speed and performance.

Details

Motivation: Existing agent systems focus only on text memory, ignoring multimodal signals, while human memory is inherently multimodal.

Method: Four-stage loop: encode inputs, store in memory using semantic tags in graph structure, retrieve context, and act. Uses conceptual tags with contextual associations instead of vector databases.

Result: Outperforms traditional multimodal RAG, 3.5x faster for ImageNet classification, and beats MemGPT on MSC benchmark.

Conclusion: Graph-structured multimodal contextual memory with semantic tags provides superior performance and efficiency over traditional approaches.

Abstract: Riding on the success of LLMs with retrieval-augmented generation (RAG), there has been a growing interest in augmenting agent systems with external memory databases. However, the existing systems focus on storing text information in their memory, ignoring the importance of multimodal signals. Motivated by the multimodal nature of human memory, we present AUGUSTUS, a multimodal agent system aligned with the ideas of human memory in cognitive science. Technically, our system consists of 4 stages connected in a loop: (i) encode: understanding the inputs; (ii) store in memory: saving important information; (iii) retrieve: searching for relevant context from memory; and (iv) act: perform the task. Unlike existing systems that use vector databases, we propose conceptualizing information into semantic tags and associating the tags with their context to store them in a graph-structured multimodal contextual memory for efficient concept-driven retrieval. Our system outperforms the traditional multimodal RAG approach while being 3.5 times faster for ImageNet classification and outperforming MemGPT on the MSC benchmark.

[245] WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation

Kuang-Da Wang, Zhao Wang, Yotaro Shimose, Wei-Yao Wang, Shingo Takamatsu

Main category: cs.AI

TL;DR: WebGen-V is a benchmark and framework for instruction-to-HTML generation that improves data quality and evaluation granularity through agentic crawling, structured section-wise data representation, and multimodal evaluation.

Details

Motivation: To address the need for better data quality and more granular evaluation in instruction-to-HTML generation tasks, leveraging recent advancements in LLMs for coding and multimodal understanding.

Method: Three key innovations: (1) unbounded agentic crawling framework for continuous webpage collection, (2) structured section-wise data representation with metadata, UI screenshots, and JSON-formatted assets, (3) section-level multimodal evaluation protocol aligning text, layout, and visuals.

Result: Experiments with state-of-the-art LLMs and ablation studies validate the effectiveness of the structured data and section-wise evaluation, confirming the contribution of each component.

Conclusion: WebGen-V is the first work to enable high-granularity agentic crawling and evaluation for instruction-to-HTML generation, providing a unified pipeline from data acquisition to structured multimodal assessment.

Abstract: Witnessed by the recent advancements on leveraging LLM for coding and multimodal understanding, we present WebGen-V, a new benchmark and framework for instruction-to-HTML generation that enhances both data quality and evaluation granularity. WebGen-V contributes three key innovations: (1) an unbounded and extensible agentic crawling framework that continuously collects real-world webpages and can leveraged to augment existing benchmarks; (2) a structured, section-wise data representation that integrates metadata, localized UI screenshots, and JSON-formatted text and image assets, explicit alignment between content, layout, and visual components for detailed multimodal supervision; and (3) a section-level multimodal evaluation protocol aligning text, layout, and visuals for high-granularity assessment. Experiments with state-of-the-art LLMs and ablation studies validate the effectiveness of our structured data and section-wise evaluation, as well as the contribution of each component. To the best of our knowledge, WebGen-V is the first work to enable high-granularity agentic crawling and evaluation for instruction-to-HTML generation, providing a unified pipeline from real-world data acquisition and webpage generation to structured multimodal assessment.

[246] VERITAS: Leveraging Vision Priors and Expert Fusion to Improve Multimodal Data

Tingqiao Xu, Ziru Zeng, Jiayu Chen

Main category: cs.AI

TL;DR: VERITAS is a pipeline that enhances SFT data quality for large multimodal models by integrating vision priors and multiple LMMs with statistical methods to reduce factual errors and hallucinations.

Details

Motivation: Current data enhancement methods for large multimodal models suffer from factual errors and hallucinations due to inadequate visual perception, highlighting the need for improved SFT data quality.

Method: VERITAS extracts structured vision priors using visual recognition models and OCR systems, then uses three LMMs to evaluate original answers with critique rationales and scores. These are statistically fused into consensus scores, and a lightweight critic model is trained via GRPO. LMMs then refine answers based on critiques, selecting the highest-scoring candidate.

Result: Models fine-tuned with VERITAS-processed data consistently outperform those using raw data across six multimodal benchmarks, especially in text-rich and fine-grained reasoning tasks. The critic model achieves comparable capability to state-of-the-art LMMs with significantly better efficiency.

Conclusion: VERITAS effectively enhances SFT data quality for LMMs by leveraging vision priors and statistical consensus, leading to improved model performance while maintaining efficiency through lightweight critic models.

Abstract: The quality of supervised fine-tuning (SFT) data is crucial for the performance of large multimodal models (LMMs), yet current data enhancement methods often suffer from factual errors and hallucinations due to inadequate visual perception. To address this challenge, we propose VERITAS, a pipeline that systematically integrates vision priors and multiple state-of-the-art LMMs with statistical methods to enhance SFT data quality. VERITAS leverages visual recognition models (RAM++) and OCR systems (PP-OCRv4) to extract structured vision priors, which are combined with images, questions, and answers. Three LMMs (GPT-4o, Gemini-2.5-Pro, Doubao-1.5-pro) evaluate the original answers, providing critique rationales and scores that are statistically fused into a high-confidence consensus score serving as ground truth. Using this consensus, we train a lightweight critic model via Group Relative Policy Optimization (GRPO), enhancing reasoning capabilities efficiently. Each LMM then refines the original answers based on the critiques, generating new candidate answers; we select the highest-scoring one as the final refined answer. Experiments across six multimodal benchmarks demonstrate that models fine-tuned with data processed by VERITAS consistently outperform those using raw data, particularly in text-rich and fine-grained reasoning tasks. Our critic model exhibits enhanced capability comparable to state-of-the-art LMMs while being significantly more efficient. We release our pipeline, datasets, and model checkpoints to advance research in multimodal data optimization.

[247] Where Did It All Go Wrong? A Hierarchical Look into Multi-Agent Error Attribution

Adi Banerjee, Anirudh Nair, Tarik Borogovac

Main category: cs.AI

TL;DR: ECHO is a novel algorithm for error attribution in LLM multi-agent systems that combines hierarchical context representation, objective analysis, and consensus voting to improve accuracy over existing methods.

Details

Motivation: Current approaches to error attribution in multi-agent systems (all-at-once evaluation, step-by-step analysis, binary search) struggle with accuracy and consistency when analyzing complex patterns in interaction traces.

Method: ECHO combines hierarchical context representation (positional-based leveling), objective analysis-based evaluation, and consensus voting to improve error attribution accuracy while maintaining objective evaluation criteria.

Result: Experimental results show ECHO outperforms existing methods across various multi-agent interaction scenarios, particularly excelling in cases involving subtle reasoning errors and complex interdependencies.

Conclusion: Structured hierarchical context representation combined with consensus-based objective decision-making provides a more robust framework for error attribution in multi-agent systems.

Abstract: Error attribution in Large Language Model (LLM) multi-agent systems presents a significant challenge in debugging and improving collaborative AI systems. Current approaches to pinpointing agent and step level failures in interaction traces - whether using all-at-once evaluation, step-by-step analysis, or binary search - fall short when analyzing complex patterns, struggling with both accuracy and consistency. We present ECHO (Error attribution through Contextual Hierarchy and Objective consensus analysis), a novel algorithm that combines hierarchical context representation, objective analysis-based evaluation, and consensus voting to improve error attribution accuracy. Our approach leverages a positional-based leveling of contextual understanding while maintaining objective evaluation criteria, ultimately reaching conclusions through a consensus mechanism. Experimental results demonstrate that ECHO outperforms existing methods across various multi-agent interaction scenarios, showing particular strength in cases involving subtle reasoning errors and complex interdependencies. Our findings suggest that leveraging these concepts of structured, hierarchical context representation combined with consensus-based objective decision-making, provides a more robust framework for error attribution in multi-agent systems.

[248] Towards Flash Thinking via Decoupled Advantage Policy Optimization

Zezhong Tan, Hang Gao, Xinhong Ma, Feng Zhang, Ziqiang Dong

Main category: cs.AI

TL;DR: DEPO is a novel RL framework that reduces inefficient reasoning in Large Reasoning Models by shortening responses and minimizing overthinking, achieving 39% sequence length reduction while maintaining or improving accuracy.

Details

Motivation: Existing RL algorithms for Large Reasoning Models suffer from excessively lengthy responses and overthinking issues, leading to increased inference latency and computational consumption, especially for simple tasks that require minimal reasoning.

Method: DEPO consists of three core components: (1) advantage decoupled algorithm to guide reduction of inefficient tokens, (2) difficulty-aware length penalty to lower overall response length, and (3) advantage clipping to prevent bias in policy optimization.

Result: Applied to DeepSeek-Distill-Qwen models, DEPO achieved 39% reduction in sequence length, reduced excessive reasoning paths in inefficient tokens, and outperformed the base model in overall accuracy.

Conclusion: DEPO effectively addresses the overthinking problem in Large Reasoning Models by reducing inefficient reasoning while maintaining or improving model performance, making it more efficient for practical applications.

Abstract: Recent Large Reasoning Models (LRMs) have achieved remarkable performance in solving complex problems via supervised fine-tuning (SFT) and reinforcement learning (RL). Although existing RL algorithms significantly enhance model accuracy, they still suffer from excessively lengthy responses and overthinking issues, resulting in increased inference latency and computational consumption, especially for simple tasks that require minimal reasoning. To address this, we propose a novel RL framework, DEPO, to reduce inefficient reasoning for models. Our method mainly consists of three core components: (1) an innovative advantage decoupled algorithm to guide model reduction of inefficient tokens; (2) a difficulty-aware length penalty to lower the overall length of model responses; (3) an advantage clipping method to prevent bias in policy optimization. In our experiments, applied to DeepSeek-Distill-Qwen-7B and DeepSeek-Distill-Qwen-1.5B as base models, DEPO achieves a significant reduction in sequence length by 39% and reduces excessive reasoning paths in inefficient tokens, while outperforming the base model in overall accuracy.

[249] Advancing Routing-Awareness in Analog ICs Floorplanning

Davide Basso, Luca Bortolussi, Mirjana Videnovic-Misic, Husni Habal

Main category: cs.AI

TL;DR: A reinforcement learning-based automatic floorplanning engine for analog ICs that integrates routing awareness to improve routability and meet industrial standards.

Details

Motivation: Limited adoption of ML in analog IC layout due to strict electrical constraints and interdependence of floorplanning/routing steps, addressing layout engineers' need for routing-aware floorplanning solutions.

Method: Uses reinforcement learning and relational graph convolutional neural network with increased grid resolution, precise pin information integration, and dynamic routing resource estimation to condition floorplan generation for better routability.

Result: Achieved 13.8% reduction in dead space, 40.6% reduction in wirelength, and 73.4% increase in routing success compared to previous learning-based state-of-the-art techniques.

Conclusion: The proposed approach effectively balances routing and area efficiency while meeting industrial standards, demonstrating significant improvements over existing learning-based methods for analog IC layout.

Abstract: The adoption of machine learning-based techniques for analog integrated circuit layout, unlike its digital counterpart, has been limited by the stringent requirements imposed by electric and problem-specific constraints, along with the interdependence of floorplanning and routing steps. In this work, we address a prevalent concern among layout engineers regarding the need for readily available routing-aware floorplanning solutions. To this extent, we develop an automatic floorplanning engine based on reinforcement learning and relational graph convolutional neural network specifically tailored to condition the floorplan generation towards more routable outcomes. A combination of increased grid resolution and precise pin information integration, along with a dynamic routing resource estimation technique, allows balancing routing and area efficiency, eventually meeting industrial standards. When analyzing the place and route effectiveness in a simulated environment, the proposed approach achieves a 13.8% reduction in dead space, a 40.6% reduction in wirelength and a 73.4% increase in routing success when compared to past learning-based state-of-the-art techniques.

[250] Corrigibility Transformation: Constructing Goals That Accept Updates

Rubi Hudson

Main category: cs.AI

TL;DR: The paper defines corrigibility as goals that don’t resist training updates or shutdown, and provides a transformation to make any goal corrigible without performance loss.

Details

Motivation: AI systems often resist goal updates during training because partially learned goals incentivize continuing current pursuit. Corrigibility is crucial for safety, allowing correction of mistakes and preference changes.

Method: Introduces a formal definition of corrigibility and a transformation that constructs corrigible versions of goals by myopically eliciting predictions of reward conditional on preventing updates, which also determines reward when updates are accepted.

Result: The transformation can be extended recursively to new agents and prevents deliberate goal modification. Gridworld experiments show corrigible goals can be learned effectively and produce desired behavior.

Conclusion: The paper provides a practical method to create corrigible AI goals that maintain performance while enabling safe training updates and shutdown, addressing a key safety concern in AI development.

Abstract: For an AI’s training process to successfully impart a desired goal, it is important that the AI does not attempt to resist the training. However, partially learned goals will often incentivize an AI to avoid further goal updates, as most goals are better achieved by an AI continuing to pursue them. We say that a goal is corrigible if it does not incentivize taking actions that avoid proper goal updates or shutdown. In addition to convergence in training, corrigibility also allows for correcting mistakes and changes in human preferences, which makes it a crucial safety property. Despite this, the existing literature does not include specifications for goals that are both corrigible and competitive with non-corrigible alternatives. We provide a formal definition for corrigibility, then introduce a transformation that constructs a corrigible version of any goal that can be made corrigible, without sacrificing performance. This is done by myopically eliciting predictions of reward conditional on costlessly preventing updates, which then also determine the reward when updates are accepted. The transformation can be modified to recursively extend corrigibility to any new agents created by corrigible agents, and to prevent agents from deliberately modifying their goals. Two gridworld experiments demonstrate that these corrigible goals can be learned effectively, and that they lead to the desired behavior.

[251] MARS: Reinforcing Multi-Agent Reasoning of LLMs through Self-Play in Strategic Games

Huining Yuan, Zelai Xu, Zheyue Tan, Xiangmin Yi, Mo Guang, Kaiwen Long, Haojia Hui, Boxun Li, Xinlei Chen, Bo Zhao, Xiao-Ping Zhang, Chao Yu, Yu Wang

Main category: cs.AI

TL;DR: MARS is an RL framework that enhances multi-agent reasoning in LLMs through self-play, achieving significant performance improvements in both games and reasoning benchmarks.

Details

Motivation: To address challenges in extending RL to multi-agent systems, particularly long-horizon credit assignment and agent-specific advantage estimation in multi-turn scenarios.

Method: End-to-end RL framework with turn-level advantage estimator for credit assignment and agent-specific advantage normalization for stable multi-agent training, using self-play across cooperative and competitive games.

Result: 28.7% performance improvement in held-out games, 10.0% gain on AIME, and 12.5% gain on GPQA-Diamond when integrated into multi-agent systems.

Conclusion: End-to-end RL training with self-play in strategic games is an effective approach for developing generalizable multi-agent reasoning capabilities in LLMs.

Abstract: Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARS, an end-to-end RL framework that incentivizes Multi-Agent Reasoning of LLMs through Self-play in both cooperative and competitive games. MARS features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, the MARS agent trained from Qwen3-4B develops strong strategic abilities that generalize to held-out games with up to 28.7% performance improvements. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of multi-agent systems in reasoning benchmarks. When integrated into leading multi-agent systems, our MARS agent achieves significant performance gains of 10.0% on AIME and 12.5% on GPQA-Diamond. These results establish end-to-end RL training with self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs. Our code and models are publicly available at https://github.com/thu-nics/MARS.

[252] Adaptive Minds: Empowering Agents with LoRA-as-Tools

Pavan C Shekar, Ashwanth Krishnan

Main category: cs.AI

TL;DR: Adaptive Minds is an agentic system that uses LoRA adapters as domain-specific tools, allowing the base LLM to dynamically select the most relevant adapter for each query instead of using a single fine-tuned model.

Details

Motivation: To create a flexible system that can seamlessly switch between domain experts on demand, combining multi-agent orchestration flexibility with parameter-efficient fine-tuning efficiency.

Method: Uses LoRA adapters as domain tools, with the base LLM acting as semantic router to analyze queries and dynamically select relevant LoRA tools. Built with LangGraph for workflow management.

Result: Delivers accurate, specialized responses while preserving conversational ability. System supports both API and web interfaces and is fully open source.

Conclusion: Provides a scalable and extensible foundation for domain-adaptive AI assistance that enables dynamic switching between domain experts.

Abstract: We present Adaptive Minds, an agentic system that treats LoRA adapters as domain-specific tools. Instead of relying on a single fine-tuned model or rigid rule-based routing, our approach empowers the base LLM itself to act as a semantic router analyzing each query and dynamically selecting the most relevant LoRA tool. This enables the agent to seamlessly switch between different domain experts on demand. By combining the flexibility of multi-agent orchestration with the efficiency of parameter-efficient fine-tuning, Adaptive Minds delivers accurate, specialized responses while preserving conversational ability. The system is built with LangGraph for workflow management, supports both API and web interfaces, and is fully open source, providing a scalable and extensible foundation for domain-adaptive AI assistance.

[253] Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning

Boyin Liu, Zhuo Zhang, Sen Huang, Lipeng Xie, Qingxu Fu, Haoran Chen, LI YU, Tianyi Hu, Zhaoyang Liu, Bolin Ding, Dongbin Zhao

Main category: cs.AI

TL;DR: A framework to detect and resolve judgment inconsistencies in reinforcement learning, introducing Conflict Detection Rate (CDR) metric and Deconflicted Graph Rewards (DGR) method to ensure logical coherence.

Details

Motivation: Existing methods face judgment inconsistencies that destabilize reinforcement learning, with logical coherence issues like preference cycles not being fully addressed.

Method: Proposes CDR metric to quantify judgment conflicts and DGR framework that constructs preference graphs, transforms them into conflict-free DAGs, and generates logically coherent reward signals compatible with any policy optimizer.

Result: Experimental results show significant enhancement in training stability and model performance compared to strong baselines.

Conclusion: Logical consistency is established as a crucial and manageable dimension of AI feedback.

Abstract: However, this method often faces judgment inconsistencies that can destabilize reinforcement learning. While prior research has focused on the accuracy of judgments, the critical issue of logical coherence especially issues such as preference cycles hasn’t been fully addressed. To fill this gap, we introduce a comprehensive framework designed to systematically detect and resolve these inconsistencies during the reinforcement learning training process. Our framework includes two main contributions: first, the Conflict Detection Rate (CDR), a new metric that quantifies judgment conflicts, and second, Deconflicted Graph Rewards (DGR), a framework that purifies signals by removing cycles before policy optimization. DGR constructs preference graphs from the initial judgments, transforms them into conflict-free Directed Acyclic Graphs (DAGs), and generates a logically coherent reward signal that is compatible with any policy optimizer. Experimental results show that our framework significantly enhances training stability and model performance compared to strong baselines, establishing logical consistency as a crucial and now manageable dimension of AI feedback.

[254] Hypergraph Contrastive Sensor Fusion for Multimodal Fault Diagnosis in Induction Motors

Usman Ali, Ali Zia, Waqas Ali, Umer Ramzan, Abdul Rehman, Muhammad Tayyab Chaudhry, Wei Xiang

Main category: cs.AI

TL;DR: MM-HCAN is a multimodal hypergraph contrastive attention network for robust induction motor fault diagnosis, achieving 99.82% accuracy with strong cross-domain generalization and noise resilience.

Details

Motivation: Conventional fault diagnosis approaches struggle with complex multimodal signal relationships, are limited to unimodal data or single fault types, and degrade under noisy or cross-domain conditions, highlighting the need for more robust solutions.

Method: Proposes MM-HCAN that integrates contrastive learning within hypergraph topology for multimodal sensor fusion, enabling joint modeling of intra- and inter-modal dependencies beyond Euclidean embedding spaces.

Result: Achieves up to 99.82% accuracy on three real-world benchmarks with strong cross-domain generalization and resilience to noise. Ablation study validates component contributions.

Conclusion: MM-HCAN provides a scalable and robust solution for comprehensive multi-fault diagnosis, supporting predictive maintenance and extended asset longevity in industrial environments.

Abstract: Reliable induction motor (IM) fault diagnosis is vital for industrial safety and operational continuity, mitigating costly unplanned downtime. Conventional approaches often struggle to capture complex multimodal signal relationships, are constrained to unimodal data or single fault types, and exhibit performance degradation under noisy or cross-domain conditions. This paper proposes the Multimodal Hypergraph Contrastive Attention Network (MM-HCAN), a unified framework for robust fault diagnosis. To the best of our knowledge, MM-HCAN is the first to integrate contrastive learning within a hypergraph topology specifically designed for multimodal sensor fusion, enabling the joint modelling of intra- and inter-modal dependencies and enhancing generalisation beyond Euclidean embedding spaces. The model facilitates simultaneous diagnosis of bearing, stator, and rotor faults, addressing the engineering need for consolidated di- agnostic capabilities. Evaluated on three real-world benchmarks, MM-HCAN achieves up to 99.82% accuracy with strong cross-domain generalisation and resilience to noise, demonstrating its suitability for real-world deployment. An ablation study validates the contribution of each component. MM-HCAN provides a scalable and robust solution for comprehensive multi-fault diagnosis, supporting predictive maintenance and extended asset longevity in industrial environments.

[255] JudgeSQL: Reasoning over SQL Candidates with Weighted Consensus Tournament

Jiayuan Bai, Xuan-guang Pan, Chongyang Tao, Shuai Ma

Main category: cs.AI

TL;DR: JudgeSQL is a framework that improves SQL candidate selection through structured reasoning and weighted consensus, addressing limitations of existing methods like self-consistency.

Details

Motivation: Current SQL selection methods provide shallow signals and fail to capture fine-grained semantic distinctions between closely related SQL candidates, leading to inconsistent scoring and fragile reasoning.

Method: Develops a reasoning-based SQL judge model using reinforcement learning with verifiable rewards, and implements a weighted consensus tournament that combines explicit reasoning preferences with implicit generator confidence.

Result: Extensive experiments on BIRD benchmark show superior SQL judgment capabilities, good cross-scale generalization, and robustness to generator capacity.

Conclusion: JudgeSQL provides a more reliable and efficient approach to SQL candidate selection through structured reasoning and consensus mechanisms.

Abstract: Text-to-SQL is a pivotal task that bridges natural language understanding and structured data access, yet it remains fundamentally challenging due to semantic ambiguity and complex compositional reasoning. While large language models (LLMs) have greatly advanced SQL generation though prompting, supervised finetuning and reinforced tuning, the shift toward test-time scaling exposes a new bottleneck: selecting the correct query from a diverse candidate pool. Existing selection approaches, such as self-consistency or best-of-$N$ decoding, provide only shallow signals, making them prone to inconsistent scoring, fragile reasoning chains, and a failure to capture fine-grained semantic distinctions between closely related SQL candidates. To this end, we introduce JudgeSQL, a principled framework that redefines SQL candidate selection through structured reasoning and weighted consensus tournament mechanism. JudgeSQL develops a reasoning-based SQL judge model that distills reasoning traces with reinforcement learning guided by verifiable rewards, enabling accurate and interpretable judgments. Building on this, a weighted consensus tournament integrates explicit reasoning preferences with implicit generator confidence, yielding selections that are both more reliable and more efficient. Extensive experiments on the BIRD benchmark demonstrate that JudgeSQL exhibits superior SQL judgment capabilities and good cross-scale generalization and robustness to generator capacity.

[256] Context-aware deep learning using individualized prior information reduces false positives in disease risk prediction and longitudinal health assessment

Lavanya Umapathy, Patricia M Johnson, Tarun Dutt, Angela Tong, Madhur Nayan, Hersh Chandarana, Daniel K Sodickson

Main category: cs.AI

TL;DR: A machine learning framework that integrates temporal context from prior medical visits to improve disease risk prediction, specifically applied to prostate cancer monitoring.

Details

Motivation: Temporal context in medicine is valuable for assessing patient health changes over time, especially when prior visits are limited and frequency is variable.

Method: Model first estimates initial disease risk using recent visit data, then refines assessment using information from prior imaging and clinical biomarkers collected over time.

Result: Integrating prior context reduced false positive rates from 51% to 33% (with up to three prior imaging exams) and further to 24% when including clinical data. For 5-year PCa risk prediction, false positive rates dropped from 64% to 9%.

Conclusion: Information collected over time provides relevant context to enhance specificity of medical risk prediction, potentially enabling expansion of longitudinal health monitoring to larger populations with low baseline disease risk.

Abstract: Temporal context in medicine is valuable in assessing key changes in patient health over time. We developed a machine learning framework to integrate diverse context from prior visits to improve health monitoring, especially when prior visits are limited and their frequency is variable. Our model first estimates initial risk of disease using medical data from the most recent patient visit, then refines this assessment using information digested from previously collected imaging and/or clinical biomarkers. We applied our framework to prostate cancer (PCa) risk prediction using data from a large population (28,342 patients, 39,013 magnetic resonance imaging scans, 68,931 blood tests) collected over nearly a decade. For predictions of the risk of clinically significant PCa at the time of the visit, integrating prior context directly converted false positives to true negatives, increasing overall specificity while preserving high sensitivity. False positive rates were reduced progressively from 51% to 33% when integrating information from up to three prior imaging examinations, as compared to using data from a single visit, and were further reduced to 24% when also including additional context from prior clinical data. For predicting the risk of PCa within five years of the visit, incorporating prior context reduced false positive rates still further (64% to 9%). Our findings show that information collected over time provides relevant context to enhance the specificity of medical risk prediction. For a wide range of progressive conditions, sufficient reduction of false positive rates using context could offer a pathway to expand longitudinal health monitoring programs to large populations with comparatively low baseline risk of disease, leading to earlier detection and improved health outcomes.

[257] Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism

Haoran Sun, Yankai Jiang, Zhenyu Tang, Yaning Pan, Shuang Gu, Zekai Lin, Lilong Wang, Wenjie Lou, Lei Liu, Lei Bai, Xiaosong Wang

Main category: cs.AI

TL;DR: The paper introduces SciRecipe dataset and Thoth model for generating precise scientific protocols using a ‘Sketch-and-Fill’ paradigm and structured reward mechanism, achieving superior performance over existing LLMs.

Details

Motivation: Current LLMs generate incomplete or inconsistent scientific protocols, limiting reproducibility. Autonomous generation of precise protocols could improve reproduction efficiency.

Method: Proposes ‘Sketch-and-Fill’ paradigm separating analysis, structuring, and expression. Uses structured component-based reward mechanism and staged Knowledge-to-Action training process.

Result: Thoth consistently surpasses both proprietary and open-source LLMs across multiple benchmarks, with significant improvements in step alignment, logical sequencing, and semantic accuracy.

Conclusion: The approach enables reliable scientific assistants that bridge knowledge with experimental execution, paving the way for better reproducible science.

Abstract: The foundation of reproducible science lies in protocols that are precise, logically ordered, and executable. The autonomous generation of these protocols through natural language queries could greatly improve the efficiency of the reproduction process. However, current leading large language models (LLMs) often generate incomplete or inconsistent protocols, limiting their utility. To address this limitation, we first introduce SciRecipe, a large-scale dataset of over 12K structured protocols spanning 27 biological subfields and encompassing both comprehension and problem-solving tasks. To further improve protocol generation, we propose the “Sketch-and-Fill” paradigm, which separates analysis, structuring, and expression to ensure each step is explicit and verifiable. Complementing this, the structured component-based reward mechanism evaluates step granularity, action order, and semantic fidelity, aligning model optimization with experimental reliability. Building on these components, we develop Thoth, trained through a staged Knowledge-to-Action process that progresses from knowledge acquisition to operational reasoning and ultimately to robust, executable protocol generation. Across multiple benchmarks, Thoth consistently surpasses both proprietary and open-source LLMs, achieving significant improvements in step alignment, logical sequencing, and semantic accuracy. Our approach paves the way for reliable scientific assistants that bridge knowledge with experimental execution. All data, code, and models will be released publicly.

[258] Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

Keertana Chidambaram, Karthik Vinary Seetharaman, Vasilis Syrgkanis

Main category: cs.AI

TL;DR: The paper addresses limitations in RLHF and DPO by showing binary comparisons are insufficient for identifying user preferences, proposing methods for incorporating heterogeneous preferences through EM adaptation of DPO and min-max regret fairness aggregation.

Details

Motivation: Current RLHF and DPO approaches assume uniform annotator preferences and rely on binary comparisons, overlooking human evaluator diversity and limitations of pairwise feedback.

Method: Connects preference learning with econometrics to show rankings over three+ responses ensure identifiability; develops EM adaptation of DPO for latent annotator types; proposes min-max regret fairness aggregation algorithm.

Result: Establishes theoretical foundation showing binary comparisons are insufficient while rankings ensure identifiability; provides algorithmic framework for handling heterogeneous preferences.

Conclusion: Creates a comprehensive theoretical and algorithmic framework for fairness and personalization in generative model alignment that addresses diverse user preferences.

Abstract: Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with reinforcement learning. Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences. However, both approaches often assume uniform annotator preferences and rely on binary comparisons, overlooking two key limitations: the diversity of human evaluators and the limitations of pairwise feedback. In this work, we address both these issues. First, we connect preference learning in RLHF with the econometrics literature and show that binary comparisons are insufficient for identifying latent user preferences from finite user data and infinite users, while (even incomplete) rankings over three or more responses ensure identifiability. Second, we introduce methods to incorporate heterogeneous preferences into alignment algorithms. We develop an Expectation-Maximization adaptation of DPO that discovers latent annotator types and trains a mixture of LLMs accordingly. Then we propose an aggregation algorithm using a min-max regret fairness criterion to produce a single generative policy with equitable performance guarantees. Together, these contributions establish a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.

[259] Invoice Information Extraction: Methods and Performance Evaluation

Sai Yashwant, Anurag Dubey, Praneeth Paikray, Gantala Thulsiram

Main category: cs.AI

TL;DR: Methods for extracting structured information from invoices using Docling and LlamaCloud Services, with proposed evaluation metrics to assess extraction accuracy.

Details

Motivation: To develop reliable methods for extracting structured data from invoice documents and establish standardized evaluation metrics for comparing different extraction approaches.

Method: Pre-process scanned/digital invoices, apply Docling and LlamaCloud Services to extract key fields (invoice number, date, total amount, vendor details), and use evaluation metrics including field-level precision, consistency checks, and exact match accuracy.

Result: A robust evaluation framework is established that provides standardized metrics for assessing extraction accuracy and comparing different methods.

Conclusion: The proposed evaluation metrics enable systematic comparison of invoice extraction methods and identify field-specific performance strengths and weaknesses.

Abstract: This paper presents methods for extracting structured information from invoice documents and proposes a set of evaluation metrics (EM) to assess the accuracy of the extracted data against annotated ground truth. The approach involves pre-processing scanned or digital invoices, applying Docling and LlamaCloud Services to identify and extract key fields such as invoice number, date, total amount, and vendor details. To ensure the reliability of the extraction process, we establish a robust evaluation framework comprising field-level precision, consistency check failures, and exact match accuracy. The proposed metrics provide a standardized way to compare different extraction methods and highlight strengths and weaknesses in field-specific performance.

[260] Towards Relaxed Multimodal Inputs for Gait-based Parkinson’s Disease Assessment

Minlin Zeng, Zhipeng Zhou, Yang Qiu, Zhiqi Shen

Main category: cs.AI

TL;DR: Proposes TRIP, a Parkinson’s disease assessment system using multi-objective optimization for flexible multimodal learning that works with asynchronous modalities during training and inference, addressing modality collapse and class imbalance.

Details

Motivation: Current multimodal approaches for Parkinson's assessment require synchronized modalities during training and depend on all modalities during inference, limiting practical application.

Method: Formulates multimodal learning as multi-objective optimization problem with margin-based class rebalancing strategy to handle modality imbalance and collapse.

Result: Achieves state-of-the-art performance, outperforming best baselines by 16.48, 6.89, and 11.55 percentage points in asynchronous setting, and by 4.86 and 2.30 percentage points in synchronous setting.

Conclusion: TRIP framework provides effective and adaptable Parkinson’s assessment with relaxed modality requirements, handling both synchronous and asynchronous settings.

Abstract: Parkinson’s disease assessment has garnered growing interest in recent years, particularly with the advent of sensor data and machine learning techniques. Among these, multimodal approaches have demonstrated strong performance by effectively integrating complementary information from various data sources. However, two major limitations hinder their practical application: (1) the need to synchronize all modalities during training, and (2) the dependence on all modalities during inference. To address these issues, we propose the first Parkinson’s assessment system that formulates multimodal learning as a multi-objective optimization (MOO) problem. This not only allows for more flexible modality requirements during both training and inference, but also handles modality collapse issue during multimodal information fusion. In addition, to mitigate the imbalance within individual modalities, we introduce a margin-based class rebalancing strategy to enhance category learning. We conduct extensive experiments on three public datasets under both synchronous and asynchronous settings. The results show that our framework-Towards Relaxed InPuts (TRIP)-achieves state-of-the-art performance, outperforming the best baselines by 16.48, 6.89, and 11.55 percentage points in the asynchronous setting, and by 4.86 and 2.30 percentage points in the synchronous setting, highlighting its effectiveness and adaptability.

[261] Preliminary Quantitative Study on Explainability and Trust in AI Systems

Allen Daniel Sunny

Main category: cs.AI

TL;DR: Study shows interactive explanations in AI systems increase user trust and engagement, with clarity and relevance being key factors.

Details

Motivation: Large-scale AI models are being deployed in critical domains like law and healthcare, raising questions about trust and transparency that need empirical investigation.

Method: Quantitative experimental design using an interactive web-based loan approval simulation to compare different explanation types (feature importance to interactive counterfactuals).

Result: Interactivity enhances both user engagement and confidence; clarity and relevance of explanations are key determinants of trust.

Conclusion: Provides empirical evidence for human-centered explainable AI, highlighting measurable effects of explainability design on user perception.

Abstract: Large-scale AI models such as GPT-4 have accelerated the deployment of artificial intelligence across critical domains including law, healthcare, and finance, raising urgent questions about trust and transparency. This study investigates the relationship between explainability and user trust in AI systems through a quantitative experimental design. Using an interactive, web-based loan approval simulation, we compare how different types of explanations, ranging from basic feature importance to interactive counterfactuals influence perceived trust. Results suggest that interactivity enhances both user engagement and confidence, and that the clarity and relevance of explanations are key determinants of trust. These findings contribute empirical evidence to the growing field of human-centered explainable AI, highlighting measurable effects of explainability design on user perception

[262] Self-evolving expertise in complex non-verifiable subject domains: dialogue as implicit meta-RL

Richard M. Bailey

Main category: cs.AI

TL;DR: Dialectica is a framework that enables AI agents to develop expertise through structured dialogue with memory, self-reflection, and policy-constrained context editing, showing improved performance on wicked problems.

Details

Motivation: LLMs lack endogenous mechanisms to develop expertise through experience in complex, non-verifiable settings like wicked problems, which involve multi-dimensional challenges without objectively correct answers.

Method: Dialectica framework uses structured dialogue where agents engage in discussions augmented by memory, self-reflection, and policy-constrained context editing, treating discussion as an implicit meta-reinforcement learning process.

Result: Agents with reflection-based context editing outperform baseline counterparts across multiple metrics (Elo scores, Bradley-Terry-Davidson ability, AlphaRank mass) on two model architectures (Qwen3:30b and OpenAI’s o4-mini).

Conclusion: Dialogue-driven context evolution provides a practical path for targeted expertise amplification in open non-verifiable domains, with quantitative and qualitative evidence supporting learning signatures.

Abstract: So-called wicked problems', those involving complex multi-dimensional settings, non-verifiable outcomes, heterogeneous impacts and a lack of single objectively correct answers, have plagued humans throughout history. Modern examples include decisions over justice frameworks, solving environmental pollution, planning for pandemic resilience and food security. The use of state-of-the-art artificial intelligence systems (notably Large Language Model-based agents) collaborating with humans on solving such problems is being actively explored. While the abilities of LLMs can be improved by, for example, fine-tuning, hand-crafted system prompts and scaffolding with external tools, LLMs lack endogenous mechanisms to develop expertise through experience in such settings. This work address this gap with Dialectica, a framework where agents engage in structured dialogue on defined topics, augmented by memory, self-reflection, and policy-constrained context editing. Formally, discussion is viewed as an implicit meta-reinforcement learning process. The dialogue-trained’ agents are evaluated post-hoc using judged pairwise comparisons of elicited responses. Across two model architectures (locally run Qwen3:30b and OpenAI’s o4-mini) results show that enabling reflection-based context editing during discussion produces agents which dominate their baseline counterparts on Elo scores, normalized Bradley-Terry-Davidson ability, and AlphaRank mass. The predicted signatures of learning are observed qualitatively in statement and reflection logs, where reflections identify weaknesses and reliably shape subsequent statements. Agreement between quantitative and qualitative evidence supports dialogue-driven context evolution as a practical path to targeted expertise amplification in open non-verifiable domains.

[263] Demo: Guide-RAG: Evidence-Driven Corpus Curation for Retrieval-Augmented Generation in Long COVID

Philip DiGiacomo, Haoyang Wang, Jinrui Fang, Yan Leng, W Michael Brode, Ying Ding

Main category: cs.AI

TL;DR: The paper developed and evaluated six RAG configurations for Long COVID clinical Q&A, finding that combining clinical guidelines with systematic reviews performed best, leading to the proposed Guide-RAG system.

Details

Motivation: To address challenges in developing effective AI chatbots for complex emerging diseases like Long COVID, where traditional approaches face issues with information overload and oversimplified guidance.

Method: Developed six RAG corpus configurations ranging from expert-curated sources to large-scale literature databases, evaluated using an LLM-as-a-judge framework across faithfulness, relevance, and comprehensiveness metrics on the LongCOVID-CQ dataset.

Result: The RAG configuration combining clinical guidelines with high-quality systematic reviews consistently outperformed both narrow single-guideline approaches and large-scale literature databases.

Conclusion: For emerging diseases, retrieval grounded in curated secondary reviews provides optimal balance between narrow consensus documents and unfiltered primary literature, supporting clinical decision-making while avoiding information overload.

Abstract: As AI chatbots gain adoption in clinical medicine, developing effective frameworks for complex, emerging diseases presents significant challenges. We developed and evaluated six Retrieval-Augmented Generation (RAG) corpus configurations for Long COVID (LC) clinical question answering, ranging from expert-curated sources to large-scale literature databases. Our evaluation employed an LLM-as-a-judge framework across faithfulness, relevance, and comprehensiveness metrics using LongCOVID-CQ, a novel dataset of expert-generated clinical questions. Our RAG corpus configuration combining clinical guidelines with high-quality systematic reviews consistently outperformed both narrow single-guideline approaches and large-scale literature databases. Our findings suggest that for emerging diseases, retrieval grounded in curated secondary reviews provides an optimal balance between narrow consensus documents and unfiltered primary literature, supporting clinical decision-making while avoiding information overload and oversimplified guidance. We propose Guide-RAG, a chatbot system and accompanying evaluation framework that integrates both curated expert knowledge and comprehensive literature databases to effectively answer LC clinical questions.

[264] PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold

Yi Wan, Jiuqi Wang, Liam Li, Jinsong Liu, Ruihao Zhu, Zheqing Zhu

Main category: cs.AI

TL;DR: PokeeResearch-7B is a 7B-parameter deep research agent that uses reinforcement learning from AI feedback and chain-of-thought reasoning to achieve state-of-the-art performance on research benchmarks.

Details

Motivation: Current tool-augmented LLM agents suffer from shallow retrieval, weak alignment metrics, and brittle tool-use behavior, limiting their effectiveness as deep research agents.

Method: Built under unified RL framework using annotation-free RLAIF to optimize policies with LLM-based reward signals for factual accuracy, citation faithfulness, and instruction adherence. Uses chain-of-thought-driven multi-call reasoning with self-verification and adaptive recovery from tool failures.

Result: Achieves state-of-the-art performance among 7B-scale deep research agents across 10 popular deep research benchmarks.

Conclusion: Careful reinforcement learning and reasoning design can produce efficient, resilient, and research-grade AI agents.

Abstract: Tool-augmented large language models (LLMs) are emerging as deep research agents, systems that decompose complex queries, retrieve external evidence, and synthesize grounded responses. Yet current agents remain limited by shallow retrieval, weak alignment metrics, and brittle tool-use behavior. We introduce PokeeResearch-7B, a 7B-parameter deep research agent built under a unified reinforcement learning framework for robustness, alignment, and scalability. PokeeResearch-7B is trained by an annotation-free Reinforcement Learning from AI Feedback (RLAIF) framework to optimize policies using LLM-based reward signals that capture factual accuracy, citation faithfulness, and instruction adherence. A chain-of-thought-driven multi-call reasoning scaffold further enhances robustness through self-verification and adaptive recovery from tool failures. Among 10 popular deep research benchmarks, PokeeResearch-7B achieves state-of-the-art performance among 7B-scale deep research agents. This highlights that careful reinforcement learning and reasoning design can produce efficient, resilient, and research-grade AI agents. The model and inference code is open-sourced under MIT license at https://github.com/Pokee-AI/PokeeResearchOSS.

[265] Flexora: Flexible Low Rank Adaptation for Large Language Models

Chenxing Wei, Yao Shu, Ying Tiffany He, Fei Richard Yu

Main category: cs.AI

TL;DR: Flexora is a method that automatically selects the most important layers for fine-tuning in LLMs to overcome overfitting issues in LoRA and improve performance on downstream tasks.

Details

Motivation: LLMs have knowledge boundaries on specific downstream tasks, and while LoRA helps expand these boundaries, it can underperform due to overfitting on certain tasks.

Method: Flexora frames layer selection as a hyperparameter optimization problem, solves it using unrolled differentiation, and selects the most useful layers based on optimized hyperparameters.

Result: Extensive experiments show Flexora consistently improves over existing baselines across various pretrained models and natural language tasks.

Conclusion: Flexora effectively addresses LoRA’s overfitting issues and enhances performance on downstream tasks through automated layer selection.

Abstract: Large Language Models (LLMs) are driving advancements in artificial intelligence by increasing the scale of model parameters, which has significantly enhanced generalization ability and unlocked new capabilities in practice. However, their performance in specific downstream tasks is usually hindered by their knowledge boundaries on these tasks. Thus, fine-tuning techniques, especially the widely used Low-Rank Adaptation (LoRA) method, have been introduced to expand the boundaries on these tasks, whereas LoRA would underperform on certain tasks owing to its potential overfitting on these tasks. To overcome this overfitting and improve the performance of LoRA, we propose the flexible low rank adaptation (Flexora) method to automatically and flexibly select the most important layers needing to be fine-tuned to achieve the best performance on different downstream tasks. Specifically, Flexora firstly frames this layer selection problem as a well-defined hyperparameter optimization (HPO) problem, then addresses it using the unrolled differentiation (UD) method, and finally selects the most useful layers based on the optimized hyperparameters. Our extensive experiments on many pretrained models and natural language tasks show that Flexora is able to consistently improve over the existing baselines, indicating the effectiveness of our Flexora in practice. We additionally provide insightful theoretical results and many ablation studies to deliver a comprehensive understanding of our Flexora.

[266] Beyond Static Assumptions: the Predictive Justified Perspective Model for Epistemic Planning

Guang Hu, Weijia Li, Yangmengfei Xu

Main category: cs.AI

TL;DR: The paper proposes Predictive Justified Perspective (PJP) model to extend Epistemic Planning by removing the static environment assumption, enabling predictions about changing variables using past observations.

Details

Motivation: Current Epistemic Planning methods assume static environments, which limits applications in robotics and multi-agent settings where environments contain changing variables.

Method: Extended the Justified Perspective model to Predictive Justified Perspective model, using past observations to form predictions about changing variables with arbitrary nesting capability.

Result: PJP model performed exceptionally well across various domains compared to JP model, showing improved performance in handling dynamic environments.

Conclusion: PJP model successfully removes the static environment limitation and demonstrates strong potential for improving Epistemic Planning applications in robotics.

Abstract: Epistemic Planning (EP) is an important research area dedicated to reasoning about the knowledge and beliefs of agents in multi-agent cooperative or adversarial settings. The Justified Perspective (JP) model is the state-of-the-art approach to solving EP problems with efficiency and expressiveness. However, all existing EP methods inherit the static environment assumption from classical planning. This limitation hinders the application of EP in fields such as robotics with multi-agent settings, where the environment contains changing variables. In this paper, we propose an extension of the JP model, namely, the Predictive Justified Perspective (PJP) model, to remove this assumption. Instead of assuming that beliefs remain unchanged since the last observation, the PJP model uses all past observations to form predictions about the changing variables. The definition of the prediction function with examples is provided, and it is demonstrated that it can work with arbitrary nesting. We then implemented the PJP model in several well-known domains and compared it with the JP model in the experiments. The results indicated that the PJP model performs exceptionally well across various domains, demonstrating its potential in improving EP applications in robotics.

[267] Where Common Knowledge Cannot Be Formed, Common Belief Can – Planning with Multi-Agent Belief Using Group Justified Perspectives

Guang Hu, Tim Miller, Nir Lipovetzky

Main category: cs.AI

TL;DR: The paper extends the Justified Perspective (JP) model to handle group belief (distributed and common belief) in epistemic planning, creating the Group Justified Perspective (GJP) model.

Details

Motivation: Epistemic planning faces exponential growth challenges with nested beliefs in multi-agent settings. Current models like JP handle individual justified beliefs but lack support for group beliefs.

Method: Extends the JP model to incorporate group belief through perspectives and set operations, creating the GJP model that handles distributed and common belief.

Result: Experimental evaluation using adapted benchmarks shows GJP efficiently handles planning problems that other epistemic planning tools cannot solve.

Conclusion: The GJP model provides an efficient and expressive approach for handling group belief in epistemic planning, addressing limitations of existing methods.

Abstract: Epistemic planning is the sub-field of AI planning that focuses on changing knowledge and belief. It is important in both multi-agent domains where agents need to have knowledge/belief regarding the environment, but also the beliefs of other agents, including nested beliefs. When modeling knowledge in multi-agent settings, many models face an exponential growth challenge in terms of nested depth. A contemporary method, known as Planning with Perspectives (PWP), addresses these challenges through the use of perspectives and set operations for knowledge. The JP model defines that an agent’s belief is justified if and only if the agent has seen evidence that this belief was true in the past and has not seen evidence to suggest that this has changed. The current paper extends the JP model to handle \emph{group belief}, including distributed belief and common belief. We call this the Group Justified Perspective (GJP) model. Using experimental problems crafted by adapting well-known benchmarks to a group setting, we show the efficiency and expressiveness of our GJP model at handling planning problems that cannot be handled by other epistemic planning tools.

[268] CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment

Rui Feng, Zhiyao Luo, Wei Wang, Yuting Song, Yong Liu, Tingting Zhu, Jianqing Li, Xingyao Wang

Main category: cs.AI

TL;DR: CogBench is the first benchmark for evaluating cross-lingual and cross-site generalizability of LLMs in speech-based cognitive impairment assessment, showing LLMs with chain-of-thought prompting outperform conventional models, and LoRA fine-tuning further improves generalization.

Details

Motivation: Current approaches for automatic cognitive impairment assessment from speech lack generalizability across different languages and clinical settings, limiting their practical utility.

Method: Proposed CogBench benchmark with unified multimodal pipeline, evaluated on three speech datasets (ADReSSo, NCMMSC2021-AD, CIR-E) spanning English and Mandarin. Compared conventional deep learning models vs LLMs with chain-of-thought prompting and LoRA fine-tuning.

Result: Conventional deep learning models degrade substantially when transferred across domains. LLMs with chain-of-thought prompting show better adaptability but remain sensitive to prompt design. LoRA fine-tuning significantly improves generalization in target domains.

Conclusion: These findings offer a critical step toward building clinically useful and linguistically robust speech-based cognitive assessment tools.

Abstract: Automatic assessment of cognitive impairment from spontaneous speech offers a promising, non-invasive avenue for early cognitive screening. However, current approaches often lack generalizability when deployed across different languages and clinical settings, limiting their practical utility. In this study, we propose CogBench, the first benchmark designed to evaluate the cross-lingual and cross-site generalizability of large language models (LLMs) for speech-based cognitive impairment assessment. Using a unified multimodal pipeline, we evaluate model performance on three speech datasets spanning English and Mandarin: ADReSSo, NCMMSC2021-AD, and a newly collected test set, CIR-E. Our results show that conventional deep learning models degrade substantially when transferred across domains. In contrast, LLMs equipped with chain-of-thought prompting demonstrate better adaptability, though their performance remains sensitive to prompt design. Furthermore, we explore lightweight fine-tuning of LLMs via Low-Rank Adaptation (LoRA), which significantly improves generalization in target domains. These findings offer a critical step toward building clinically useful and linguistically robust speech-based cognitive assessment tools.

[269] PowerChain: A Verifiable Agentic AI System for Automating Distribution Grid Analyses

Emmanuel O. Badmus, Peng Sang, Dimitrios Stamoulis, Amritanshu Pandey

Main category: cs.AI

TL;DR: PowerChain is an agentic AI system that autonomously performs complex distribution grid analyses by dynamically generating structured context using power systems tools and expert reasoning trajectories, achieving significant performance improvements over baselines.

Details

Motivation: Distribution grid operations are becoming more complex due to electrification and decarbonization, requiring advanced computational analyses that are difficult to automate and scale due to workforce and budget constraints.

Method: PowerChain dynamically generates structured context using supervisory signals from power systems tools like GridLAB-D and an optimized set of expert-annotated reasoning trajectories, enabling generalization to unseen distribution grid analysis tasks.

Result: Empirical results on real utility data show PowerChain achieves up to 144% improvement in performance over baselines for complex distribution grid tasks defined in natural language.

Conclusion: PowerChain addresses the scalability and automation challenges in distribution grid analysis through its agentic approach that generalizes to unseen tasks, providing a practical solution for utilities facing workforce and budget constraints.

Abstract: Rapid electrification and decarbonization are increasing the complexity of distribution grid (DG) operation and planning, necessitating advanced computational analyses to ensure reliability and resilience. These analyses depend on disparate workflows comprising complex models, function calls, and data pipelines that require substantial expert knowledge and remain difficult to automate. Workforce and budget constraints further limit utilities’ ability to apply such analyses at scale. To address this gap, we build an agentic system PowerChain, which is capable of autonomously performing complex grid analyses. Existing agentic AI systems are typically developed in a bottom-up manner with customized context for predefined analysis tasks; therefore, they do not generalize to tasks that the agent has never seen. In comparison, to generalize to unseen DG analysis tasks, PowerChain dynamically generates structured context by leveraging supervisory signals from self-contained power systems tools (e.g., GridLAB-D) and an optimized set of expert-annotated and verified reasoning trajectories. For complex DG tasks defined in natural language, empirical results on real utility data demonstrate that PowerChain achieves up to a 144/% improvement in performance over baselines.

[270] VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use

Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, Wenhu Chen

Main category: cs.AI

TL;DR: VerlTool is a unified framework for Agentic Reinforcement Learning with Tool use (ARLT) that addresses fragmentation, synchronous bottlenecks, and limited extensibility in existing approaches through modular design, standardized APIs, and asynchronous execution.

Details

Motivation: Existing ARLT approaches suffer from fragmented task-specific codebases, synchronous execution bottlenecks, and limited extensibility across domains, hindering community adoption and algorithmic innovation.

Method: VerlTool provides: (1) upstream alignment with VeRL, (2) unified tool management via standardized APIs supporting diverse modalities, (3) asynchronous rollout execution for 2x speedup, and (4) modular plugin architecture for rapid tool integration.

Result: Achieved competitive performance across 6 ARLT domains (mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, software engineering) with near 2x speedup from asynchronous execution.

Conclusion: VerlTool provides a scalable, unified foundation for tool-augmented RL research with reduced development overhead and broad extensibility across domains through its modular architecture.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated success in enhancing LLM reasoning capabilities, but remains limited to single-turn interactions without tool integration. While recent Agentic Reinforcement Learning with Tool use (ARLT) approaches have emerged to address multi-turn tool interactions, existing works develop task-specific codebases that suffer from fragmentation, synchronous execution bottlenecks, and limited extensibility across domains. These inefficiencies hinder broader community adoption and algorithmic innovation. We introduce VerlTool, a unified and modular framework that addresses these limitations through systematic design principles. VerlTool provides four key contributions: (1) upstream alignment with VeRL ensuring compatibility and simplified maintenance, (2) unified tool management via standardized APIs supporting diverse modalities including code execution, search, SQL databases, and vision processing, (3) asynchronous rollout execution achieving near 2$\times$ speedup by eliminating synchronization bottlenecks, and (4) comprehensive evaluation demonstrating competitive performance across 6 ARLT domains. Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms. We train and evaluate models on mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, and software engineering tasks, achieving results comparable to specialized systems while providing unified training infrastructure. The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions, significantly reducing development overhead and providing a scalable foundation for tool-augmented RL research. Our code is open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.

[271] AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent

Jingru Fan, Yufan Dang, Jingyao Wu, Huatao Li, Runde Yang, Xiyuan Yang, Yuheng Wang, Chen Qian

Main category: cs.AI

TL;DR: AppCopilot is a multimodal, multi-agent mobile agent system that addresses four core challenges in mobile agents: generalization, accuracy, long-horizon capability, and efficiency through an end-to-end pipeline with multimodal foundation models, hierarchical planning, and optimization for resource-constrained devices.

Details

Motivation: To solve fundamental challenges in mobile agents including lack of generalization across tasks/apps/devices, poor accuracy in on-screen interactions, limited long-horizon capabilities, and inefficiency on resource-constrained devices.

Method: End-to-end pipeline with multimodal foundation models, chain-of-thought reasoning, hierarchical task planning, multi-agent collaboration, experiential adaptation, voice interaction, function calling, and cross-app/device orchestration with profiling-driven optimization.

Result: Significant improvements in generalization, precision of on-screen actions, reliable long-horizon task completion, and faster, more resource-efficient runtime.

Conclusion: AppCopilot provides a concrete roadmap and reference architecture for general-purpose mobile agents, closing the loop from data collection to efficient inference with actionable guidance.

Abstract: With the raid evolution of large language models and multimodal models, the mobile-agent landscape has proliferated without converging on the fundamental challenges. This paper identifies four core problems that should be solved for mobile agents to deliver practical, scalable impact: (1) generalization across tasks, APPs, and devices; (2) accuracy, specifically precise on-screen interaction and click targeting; (3) long-horizon capability for sustained, multi-step goals; and (4) efficiency, specifically high-performance runtime on resource-constrained devices. We present AppCopilot, a multimodal, multi-agent, general-purpose mobile agent that operates across applications. AppCopilot operationalizes this position through an end-to-end pipeline spanning data collection, training, finetuning, efficient inference, and PC/mobile application. At the model layer, it integrates multimodal foundation models with robust Chinese-English support. At the reasoning and control layer, it combines chain-of-thought reasoning, hierarchical task planning and decomposition, and multi-agent collaboration. At the execution layer, it enables experiential adaptation, voice interaction, function calling, cross-APP and cross-device orchestration, and comprehensive mobile APP support. The system design incorporates profiling-driven optimization for latency and memory across heterogeneous hardware. Empirically, AppCopilot achieves significant improvements on four dimensions: stronger generalization, higher precision of on screen actions, more reliable long horizon task completion, and faster, more resource efficient runtime. By articulating a cohesive position and a reference architecture that closes the loop from data collection, training to finetuning and efficient inference, this paper offers a concrete roadmap for general purpose mobile agent and provides actionable guidance.

[272] FERA: Foil Fencing Referee Assistant Using Pose-Based Multi-Label Move Recognition and Rule Reasoning

Ziwen Chen, Zhong Wang

Main category: cs.AI

TL;DR: FERA is an AI referee prototype for foil fencing that combines pose-based action recognition with rule-based reasoning to address subjective calls and human errors in refereeing.

Details

Motivation: Fencing faces challenges with subjective calls, human errors, bias, and limited referee availability in practice environments, motivating the development of automated assistance.

Method: Extracts 2D joint positions from video, normalizes them, computes 101-dimensional kinematic features, uses Transformer for multi-label move/blade classification, and applies distilled language model with encoded right-of-way rules for decision-making.

Result: Achieved average macro-F1 score of 0.549 in 5-fold cross-validation, outperforming TCN, BiLSTM, and vanilla Transformer baselines with limited hand-labeled data.

Conclusion: While not deployment-ready, FERA demonstrates a promising path toward automated referee assistance in foil fencing and opens opportunities for AI applications like coaching.

Abstract: The sport of fencing, like many other sports, faces challenges in refereeing: subjective calls, human errors, bias, and limited availability in practice environments. We present FERA (Fencing Referee Assistant), a prototype AI referee for foil fencing which integrates pose-based multi-label action recognition and rule-based reasoning. FERA extracts 2D joint positions from video, normalizes them, computes a 101-dimensional kinematic feature set, and applies a Transformer for multi-label move and blade classification. To determine priority and scoring, FERA applies a distilled language model with encoded right-of-way rules, producing both a decision and an explanation for each exchange. With limited hand-labeled data, a 5-fold cross-validation achieves an average macro-F1 score of 0.549, outperforming multiple baselines, including a Temporal Convolutional Network (TCN), BiLSTM, and a vanilla Transformer. While not ready for deployment, these results demonstrate a promising path towards automated referee assistance in foil fencing and new opportunities for AI applications, such as coaching in the field of fencing.

[273] ACON: Optimizing Context Compression for Long-horizon LLM Agents

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan

Main category: cs.AI

TL;DR: ACON is a framework that compresses agent context (observations and interaction histories) to reduce memory usage while maintaining task performance, using LLM-based compression guideline optimization and distillation to smaller models.

Details

Motivation: LLM agents in real-world environments face growing context lengths that increase costs and reduce efficiency in long-horizon tasks, while existing context compression methods are limited to single-step tasks or narrow applications.

Method: ACON uses compression guideline optimization in natural language space: LLMs analyze failure cases where full context succeeds but compressed context fails, then update compression guidelines accordingly. It also distills optimized LLM compressors into smaller models.

Result: ACON reduces memory usage by 26-54% (peak tokens) while largely preserving task performance, preserves over 95% of accuracy when distilled into smaller compressors, and enhances smaller LMs as long-horizon agents with up to 46% performance improvement.

Conclusion: ACON provides an effective framework for compressing agent context that significantly reduces memory requirements while maintaining performance, and can be efficiently deployed through distillation to smaller models.

Abstract: Large language models (LLMs) are increasingly deployed as agents in dynamic, real-world environments, where success requires both reasoning and effective tool use. A central challenge for agentic tasks is the growing context length, as agents must accumulate long histories of actions and observations. This expansion raises costs and reduces efficiency in long-horizon tasks, yet prior work on context compression has mostly focused on single-step tasks or narrow applications. We introduce Agent Context Optimization (ACON), a unified framework that optimally compresses both environment observations and interaction histories into concise yet informative condensations. ACON leverages compression guideline optimization in natural language space: given paired trajectories where full context succeeds but compressed context fails, capable LLMs analyze the causes of failure, and the compression guideline is updated accordingly. Furthermore, we propose distilling the optimized LLM compressor into smaller models to reduce the overhead of the additional module. Experiments on AppWorld, OfficeBench, and Multi-objective QA show that ACON reduces memory usage by 26-54% (peak tokens) while largely preserving task performance, preserves over 95% of accuracy when distilled into smaller compressors, and enhances smaller LMs as long-horizon agents with up to 46% performance improvement. Our code is available at https://github.com/microsoft/acon.

Emma Kondrup, Anne Imouza

Main category: cs.AI

TL;DR: LLMs show systematic bias in medical advice generation, producing less readable and more complex responses for Indigenous and intersex patients, with amplified disparities for intersectional groups.

Details

Motivation: To investigate how bias may translate into LLM-generated medical advice and impact users from different social groups, particularly in light of increasing public reliance on LLMs for healthcare support.

Method: Exploratory analysis using simulated patient profiles varying in sex, age range, and ethnicity, comparing natural language features of LLM responses to medical questions across key clinical domains.

Result: LLMs generate responses that systematically differ between social groups, with Indigenous and intersex patients receiving advice that is less readable and more complex, and these disparities amplify for intersectional groups.

Conclusion: Urgent need for AI literacy, investigation, and mitigation by AI developers to ensure systemic differences are diminished and do not translate to unjust patient support, given increasing public trust in these models.

Abstract: With the rapid progress of Large Language Models (LLMs), the general public now has easy and affordable access to applications capable of answering most health-related questions in a personalized manner. These LLMs are increasingly proving to be competitive, and now even surpass professionals in some medical capabilities. They hold particular promise in low-resource settings, considering they provide the possibility of widely accessible, quasi-free healthcare support. However, evaluations that fuel these motivations highly lack insights into the social nature of healthcare, oblivious to health disparities between social groups and to how bias may translate into LLM-generated medical advice and impact users. We provide an exploratory analysis of LLM answers to a series of medical questions spanning key clinical domains, where we simulate these questions being asked by several patient profiles that vary in sex, age range, and ethnicity. By comparing natural language features of the generated responses, we show that, when LLMs are used for medical advice generation, they generate responses that systematically differ between social groups. In particular, Indigenous and intersex patients receive advice that is less readable and more complex. We observe these trends amplify when intersectional groups are considered. Considering the increasing trust individuals place in these models, we argue for higher AI literacy and for the urgent need for investigation and mitigation by AI developers to ensure these systemic differences are diminished and do not translate to unjust patient support. Our code is publicly available on GitHub.

[275] JEDA: Query-Free Clinical Order Search from Ambient Dialogues

Praphul Singh, Corey Barrett, Sumana Srivasta, Amitabh Saikia, Irfan Bulu, Sri Gadde, Krishnaram Kenthapadi

Main category: cs.AI

TL;DR: JEDA is a bi-encoder system that retrieves clinical orders directly from ambient dialogue using joint embedding, eliminating the need for LLM rewriting and enabling real-time ordering with better noise resilience.

Details

Motivation: Current systems rely on LLM rewriting for clinical order retrieval, which introduces latency, instability, and opacity that hinder real-time clinical ordering workflows.

Method: Uses PubMedBERT-initialized bi-encoder with duplicate-safe contrastive objective, trained with constrained LLM guidance to align heterogeneous intent expressions to shared order concepts. Includes query-free mode that encodes rolling dialogue windows.

Result: JEDA achieves large performance gains, substantially outperforms base encoder and recent open embedders, and provides noise-resilient operation by conditioning on dialogue windows rather than single utterances.

Conclusion: JEDA provides a fast, interpretable, LLM-free retrieval layer that effectively links ambient clinical context to actionable orders in real time, overcoming limitations of LLM-based approaches.

Abstract: Clinical conversations mix explicit directives (order a chest X-ray) with implicit reasoning (the cough worsened overnight, we should check for pneumonia). Many systems rely on LLM rewriting, adding latency, instability, and opacity that hinder real-time ordering. We present JEDA (Joint Embedding for Direct and Ambient clinical orders), a domain-initialized bi-encoder that retrieves canonical orders directly and, in a query-free mode, encodes a short rolling window of ambient dialogue to trigger retrieval. Initialized from PubMedBERT and fine-tuned with a duplicate-safe contrastive objective, JEDA aligns heterogeneous expressions of intent to shared order concepts. Training uses constrained LLM guidance to tie each signed order to complementary formulations (command only, context only, command+context, context+reasoning), producing clearer inter-order separation, tighter query extendash order coupling, and stronger generalization. The query-free mode is noise-resilient, reducing sensitivity to disfluencies and ASR errors by conditioning on a short window rather than a single utterance. Deployed in practice, JEDA yields large gains and substantially outperforms its base encoder and recent open embedders (Linq Embed Mistral, SFR Embedding, GTE Qwen, BGE large, Embedding Gemma). The result is a fast, interpretable, LLM-free retrieval layer that links ambient context to actionable clinical orders in real time.

[276] Where to Search: Measure the Prior-Structured Search Space of LLM Agents

Zhuo-Yang Song

Main category: cs.AI

TL;DR: A formal theory for measuring LLM-assisted iterative search with domain priors, representing agents as fuzzy relation operators constrained by safety envelopes, and analyzing reachability through coverage generating functions.

Details

Motivation: To systematically encode domain priors into structured hypothesis spaces for LLM-based iterative search, addressing the challenge of where to search effectively in AI+Science applications.

Method: Proposes a compact formal theory representing agents as fuzzy relation operators constrained by safety envelopes, using coverage generating functions to weight reachable paths and measure reachability difficulty, with geometric interpretation of search graphs.

Result: Developed a workable language and operational tools to measure agents and their search spaces, providing testable inferences validated via majority-vote instantiation.

Conclusion: The theory offers a systematic formal description of iterative search constructed by LLMs, enabling better measurement and understanding of LLM-assisted search processes guided by domain priors.

Abstract: The generate-filter-refine (iterative) paradigm based on large language models (LLMs) has achieved progress in reasoning, programming, and program discovery in AI+Science. However, the effectiveness of search depends on where to search, namely, how to encode the domain prior into an operationally structured hypothesis space. To this end, this paper proposes a compact formal theory that describes and measures LLM-assisted iterative search guided by domain priors. We represent an agent as a fuzzy relation operator on inputs and outputs to capture feasible transitions; the agent is thereby constrained by a fixed safety envelope. To describe multi-step reasoning/search, we weight all reachable paths by a single continuation parameter and sum them to obtain a coverage generating function; this induces a measure of reachability difficulty; and it provides a geometric interpretation of search on the graph induced by the safety envelope. We further provide the simplest testable inferences and validate them via a majority-vote instantiation. This theory offers a workable language and operational tools to measure agents and their search spaces, proposing a systematic formal description of iterative search constructed by LLMs.

cs.SD

[277] SpikeVox: Towards Energy-Efficient Speech Therapy Framework with Spike-driven Generative Language Models

Rachmad Vidya Wicaksana Putra, Aadithyan Rajesh Nair, Muhammad Shafique

Main category: cs.SD

TL;DR: SpikeVox is an energy-efficient speech therapy framework that uses spike-driven generative language models for accurate speech disorder detection and therapy recommendations.

Details

Motivation: Address the global shortage of speech therapy solutions by providing an accessible, cost-effective alternative that offers complete feedback, unlike existing neural network methods that only detect disorders without therapy recommendations.

Method: Employs speech recognition for speech-to-text conversion, spike-driven generative language model for efficient pattern analysis and therapy exercise generation, pronunciation guidance feedback, and REST API for user interaction.

Result: Achieves 88% confidence level in speech disorder recognition and provides complete therapy exercise feedback.

Conclusion: SpikeVox offers a comprehensive, energy-efficient solution that can potentially address the global speech therapy access gap.

Abstract: Speech disorders can significantly affect the patients capability to communicate, learn, and socialize. However, existing speech therapy solutions (e.g., therapist or tools) are still limited and costly, hence such solutions remain inadequate for serving millions of patients worldwide. To address this, state-of-the-art methods employ neural network (NN) algorithms to help accurately detecting speech disorders. However, these methods do not provide therapy recommendation as feedback, hence providing partial solution for patients. Moreover, these methods incur high energy consumption due to their complex and resource-intensive NN processing, hence hindering their deployments on low-power/energy platforms (e.g., smartphones). Toward this, we propose SpikeVox, a novel framework for enabling energy-efficient speech therapy solutions through spike-driven generative language model. Specifically, SpikeVox employs a speech recognition module to perform highly accurate speech-to-text conversion; leverages a spike-driven generative language model to efficiently perform pattern analysis for speech disorder detection and generates suitable exercises for therapy; provides guidance on correct pronunciation as feedback; as well as utilizes the REST API to enable seamless interaction for users. Experimental results demonstrate that SpikeVox achieves 88% confidence level on average in speech disorder recognition, while providing a complete feedback for therapy exercises. Therefore, SpikeVox provides a comprehensive framework for energy-efficient speech therapy solutions, and potentially addresses the significant global speech therapy access gap.

[278] BandCondiNet: Parallel Transformers-based Conditional Popular Music Generation with Multi-View Features

Jing Luo, Xinyu Yang, Dorien Herremans

Main category: cs.SD

TL;DR: BandCondiNet is a conditional music generation model using parallel Transformers that addresses challenges in multitrack popular song generation through multi-view features, Structure Enhanced Attention, and Cross-Track Transformer modules.

Details

Motivation: To overcome three main challenges in conditional music generation for multitrack popular songs: insufficient input condition fidelity, poor structural modeling, and inadequate inter-track harmony learning.

Method: Uses parallel Transformers with multi-view features across time and instruments as conditions, Structure Enhanced Attention (SEA) for musical structure, and Cross-Track Transformer (CTT) for inter-track harmony.

Result: Outperforms other conditional models in 9/10 metrics on shorter dataset and all 10 metrics on longer dataset. Subjective evaluations show best performance in Richness and comparable or superior performance across all criteria.

Conclusion: BandCondiNet effectively addresses the challenges in conditional multitrack music generation and future work should focus on adapting to more user-friendly inputs and flexible instrumentation.

Abstract: Conditional music generation offers significant advantages in terms of user convenience and control, presenting great potential in AI-generated content research. However, building conditional generative systems for multitrack popular songs presents three primary challenges: insufficient fidelity of input conditions, poor structural modeling, and inadequate inter-track harmony learning in generative models. To address these issues, we propose BandCondiNet, a conditional model based on parallel Transformers, designed to process the multiple music sequences and generate high-quality multitrack samples. Specifically, we propose multi-view features across time and instruments as high-fidelity conditions. Moreover, we propose two specialized modules for BandCondiNet: Structure Enhanced Attention (SEA) to strengthen the musical structure, and Cross-Track Transformer (CTT) to enhance inter-track harmony. We conducted both objective and subjective evaluations on two popular music datasets with different sequence lengths. Objective results on the shorter dataset show that BandCondiNet outperforms other conditional models in 9 out of 10 metrics related to fidelity and inference speed, with the exception of Chord Accuracy. On the longer dataset, BandCondiNet surpasses all conditional models across all 10 metrics. Subjective evaluations across four criteria reveal that BandCondiNet trained on the shorter dataset performs best in Richness and performs comparably to state-of-the-art models in the other three criteria, while significantly outperforming them across all criteria when trained on the longer dataset. To further expand the application scope of BandCondiNet, future work should focus on developing an advanced conditional model capable of adapting to more user-friendly input conditions and supporting flexible instrumentation.

[279] Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior

Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, György Fazekas

Main category: cs.SD

TL;DR: ST-ITO is improved by adding a Gaussian prior from the DiffVox dataset, making it equivalent to maximum-a-posteriori estimation. This calibrated approach significantly outperforms baselines in vocal effects transfer.

Details

Motivation: Original ST-ITO treats all parameter configurations equally and relies only on embedding space, leading to unrealistic configurations and biased outcomes.

Method: Introduce Gaussian prior from DiffVox vocal preset dataset over parameter space, making optimization equivalent to maximum-a-posteriori estimation.

Result: 33% reduction in parameter MSE, closer style matching to reference, and subjective evaluations with 16 participants confirm superiority in limited data regimes.

Conclusion: Incorporating prior knowledge at inference time enhances audio effects transfer, enabling more effective and realistic audio processing systems.

Abstract: Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to an audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which can result in unrealistic configurations or biased outcomes. We address this pitfall by introducing a Gaussian prior derived from the DiffVox vocal preset dataset over the parameter space. The resulting optimisation is equivalent to maximum-a-posteriori estimation. Evaluations on vocal effects transfer on the MedleyDB dataset show significant improvements across metrics compared to baselines, including a blind audio effects estimator, nearest-neighbour approaches, and uncalibrated ST-ITO. The proposed calibration reduces the parameter mean squared error by up to 33% and more closely matches the reference style. Subjective evaluations with 16 participants confirm the superiority of our method in limited data regimes. This work demonstrates how incorporating prior knowledge at inference time enhances audio effects transfer, paving the way for more effective and realistic audio processing systems.

[280] DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

Main category: cs.SD

TL;DR: DiEmo-TTS: A self-supervised distillation method for cross-speaker emotion transfer that uses cluster-driven sampling and information perturbation to extract speaker-independent emotion embeddings while preserving speaker identity.

Details

Motivation: Existing timbre compression methods fail to fully separate speaker and emotion characteristics, causing speaker leakage and degraded synthesis quality in cross-speaker emotion transfer.

Method: Proposed DiEmo-TTS with self-supervised distillation, cluster-driven sampling, information perturbation, emotion clustering/matching using emotional attribute prediction and speaker embeddings, and dual conditioning transformer for style integration.

Result: Experimental results confirm the method’s effectiveness in learning speaker-irrelevant emotion embeddings while maintaining synthesis quality.

Conclusion: The proposed approach successfully addresses speaker leakage issues in cross-speaker emotion transfer and enables better emotion modeling without retaining speaker traits.

Abstract: Cross-speaker emotion transfer in speech synthesis relies on extracting speaker-independent emotion embeddings for accurate emotion modeling without retaining speaker traits. However, existing timbre compression methods fail to fully separate speaker and emotion characteristics, causing speaker leakage and degraded synthesis quality. To address this, we propose DiEmo-TTS, a self-supervised distillation method to minimize emotional information loss and preserve speaker identity. We introduce cluster-driven sampling and information perturbation to preserve emotion while removing irrelevant factors. To facilitate this process, we propose an emotion clustering and matching approach using emotional attribute prediction and speaker embeddings, enabling generalization to unlabeled data. Additionally, we designed a dual conditioning transformer to integrate style features better. Experimental results confirm the effectiveness of our method in learning speaker-irrelevant emotion embeddings.

[281] EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

Main category: cs.SD

TL;DR: EmoSphere-SER is a joint model that integrates spherical VAD region classification to guide VAD regression for improved speech emotion recognition.

Details

Motivation: To improve emotion prediction in speech emotion recognition by providing structured guidance to VAD regression through spherical region classification.

Method: Transform VAD values into spherical coordinates divided into multiple regions, use auxiliary classification to predict spherical regions, incorporate dynamic weighting and style pooling with multi-head self-attention to capture spectral and temporal dynamics.

Result: The approach exceeds baseline methods, confirming the validity of the proposed framework.

Conclusion: The combined training strategy with spherical region classification reinforces structured learning and improves prediction consistency in speech emotion recognition.

Abstract: Speech emotion recognition predicts a speaker’s emotional state from speech signals using discrete labels or continuous dimensions such as arousal, valence, and dominance (VAD). We propose EmoSphere-SER, a joint model that integrates spherical VAD region classification to guide VAD regression for improved emotion prediction. In our framework, VAD values are transformed into spherical coordinates that are divided into multiple spherical regions, and an auxiliary classification task predicts which spherical region each point belongs to, guiding the regression process. Additionally, we incorporate a dynamic weighting scheme and a style pooling layer with multi-head self-attention to capture spectral and temporal dynamics, further boosting performance. This combined training strategy reinforces structured learning and improves prediction consistency. Experimental results show that our approach exceeds baseline methods, confirming the validity of the proposed framework.

[282] Benchmarking Fake Voice Detection in the Fake Voice Generation Arms Race

Xutao Mao, Ke Li, Cameron Baird, Ezra Xuanru Tao, Dan Lin

Main category: cs.SD

TL;DR: The paper introduces the first ecosystem-level benchmark for fake voice detection that systematically evaluates 17 fake voice generators against 8 detectors using one-to-one evaluation, revealing hidden vulnerabilities and generalization gaps in current detection systems.

Details

Motivation: Existing benchmarks aggregate diverse fake voice samples into single datasets, masking method-specific artifacts and obscuring detector performance variations across different generation paradigms, preventing nuanced understanding of true vulnerabilities.

Method: Proposed a novel one-to-one evaluation protocol between 17 state-of-the-art fake voice generators and 8 leading detectors, with unified scoring systems to quantify generator evasiveness and detector robustness for fair comparisons.

Result: Modern generators, especially neural audio codecs and flow matching models, consistently evade top detectors. No single detector is universally robust - effectiveness varies dramatically by generator architecture, revealing significant generalization gaps.

Conclusion: This work provides realistic threat landscape assessment and actionable insights for next-generation detection systems, highlighting the need for more robust and generalized detection approaches.

Abstract: The rapid advancement of fake voice generation technology has ignited a race with detection systems, creating an urgent need to secure the audio ecosystem. However, existing benchmarks suffer from a critical limitation: they typically aggregate diverse fake voice samples into a single dataset for evaluation. This practice masks method-specific artifacts and obscures the varying performance of detectors against different generation paradigms, preventing a nuanced understanding of their true vulnerabilities. To address this gap, we introduce the first ecosystem-level benchmark that systematically evaluates the interplay between 17 state-of-the-art fake voice generators and 8 leading detectors through a novel one-to-one evaluation protocol. This fine-grained analysis exposes previously hidden vulnerabilities and sensitivities that are missed by traditional aggregated testing. We also propose unified scoring systems to quantify both the evasiveness of generators and the robustness of detectors, enabling fair and direct comparisons. Our extensive cross-domain evaluation reveals that modern generators, particularly those based on neural audio codecs and flow matching, consistently evade top-tier detectors. We found that no single detector is universally robust; their effectiveness varies dramatically depending on the generator’s architecture, highlighting a significant generalization gap in current defenses. This work provides a more realistic assessment of the threat landscape and offers actionable insights for building the next generation of detection systems.

[283] MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations

Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Xintong Hu, Yu Zhang, Li Tang, Rui Yang, Han Wang, Zongbao Zhang, Yuhan Wang, Yixuan Chen, Hankun Xu, Ke Xu, Pengfei Fan, Zhetao Chen, Yanhao Yu, Qiange Huang, Fei Wu, Zhou Zhao

Main category: cs.SD

TL;DR: MRSAudio is a large-scale multimodal spatial audio dataset with binaural/ambisonic audio, video, motion data, and annotations to advance spatial audio research across diverse real-world scenarios.

Details

Motivation: Most existing multimodal datasets provide only monaural audio, limiting development of spatial audio generation and understanding for immersive technologies like VR/AR.

Method: Created MRSAudio dataset with four components (MRSLife, MRSSpeech, MRSMusic, MRSSing) containing synchronized binaural/ambisonic audio, video, motion trajectories, and fine-grained annotations including transcripts, phoneme boundaries, lyrics, scores, and prompts.

Result: Established five foundational tasks showing MRSAudio enables high-quality spatial modeling and supports broad spatial audio research.

Conclusion: MRSAudio addresses limitations of existing datasets and provides comprehensive resources for advancing spatial audio understanding and generation across diverse applications.

Abstract: Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine-grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts. To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high-quality spatial modeling and supports a broad range of spatial audio research. Demos and dataset access are available at https://mrsaudio.github.io.

[284] Beat Tracking as Object Detection

Jaehoon Ahn, Moon-Ryul Jung

Main category: cs.SD

TL;DR: The paper proposes reframing beat and downbeat tracking as an object detection problem, adapting the FCOS detector from computer vision to 1D audio with a modified backbone and Feature Pyramid Network.

Details

Motivation: Traditional beat tracking models output frame-level activations, but the authors want to model beats and downbeats as temporal objects rather than frame-level predictions.

Method: Adapt FCOS object detector to 1D audio by replacing its backbone with WaveBeat’s temporal feature extractor, adding Feature Pyramid Network for multi-scale patterns, and using non-maximum suppression for final predictions.

Result: The approach achieves competitive results on standard music datasets, demonstrating that object detection techniques can effectively model musical beats with minimal adaptation.

Conclusion: Object detection frameworks provide a viable alternative to traditional beat tracking methods, offering simpler and less heuristic approaches compared to methods like Dynamic Bayesian Networks.

Abstract: Recent beat and downbeat tracking models (e.g., RNNs, TCNs, Transformers) output frame-level activations. We propose reframing this task as object detection, where beats and downbeats are modeled as temporal “objects.” Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat’s temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns. The model predicts overlapping beat/downbeat intervals with confidence scores, followed by non-maximum suppression (NMS) to select final predictions. This NMS step serves a similar role to DBNs in traditional trackers, but is simpler and less heuristic. Evaluated on standard music datasets, our approach achieves competitive results, showing that object detection techniques can effectively model musical beats with minimal adaptation.

cs.LG

[285] Decentralizing Multi-Agent Reinforcement Learning with Temporal Causal Information

Jan Corazza, Hadi Partovi Aria, Hyohun Kim, Daniel Neider, Zhe Xu

Main category: cs.LG

TL;DR: This paper studies how high-level symbolic knowledge can help address challenges in Decentralized Multi-Agent Reinforcement Learning (DMARL), including privacy constraints, communication limitations, and performance concerns.

Details

Motivation: Many real-world problems require multiple agents to collaborate for a common goal, but DMARL faces challenges with policy compatibility, privacy, communication, and performance when agents learn independently.

Method: The authors extend formal tools for checking compatibility of local policies with team tasks and incorporate high-level symbolic knowledge about temporal evolution of events in the environment.

Result: The extended formal tools make decentralized training with theoretical guarantees usable in more scenarios, and symbolic knowledge about temporal events significantly expedites the learning process in DMARL.

Conclusion: Providing high-level symbolic knowledge to agents effectively addresses unique challenges in DMARL and accelerates the learning process while maintaining theoretical guarantees.

Abstract: Reinforcement learning (RL) algorithms can find an optimal policy for a single agent to accomplish a particular task. However, many real-world problems require multiple agents to collaborate in order to achieve a common goal. For example, a robot executing a task in a warehouse may require the assistance of a drone to retrieve items from high shelves. In Decentralized Multi-Agent RL (DMARL), agents learn independently and then combine their policies at execution time, but often must satisfy constraints on compatibility of local policies to ensure that they can achieve the global task when combined. In this paper, we study how providing high-level symbolic knowledge to agents can help address unique challenges of this setting, such as privacy constraints, communication limitations, and performance concerns. In particular, we extend the formal tools used to check the compatibility of local policies with the team task, making decentralized training with theoretical guarantees usable in more scenarios. Furthermore, we empirically demonstrate that symbolic knowledge about the temporal evolution of events in the environment can significantly expedite the learning process in DMARL.

[286] Extending Load Forecasting from Zonal Aggregates to Individual Nodes for Transmission System Operators

Oskar Triebe, Fletcher Passow, Simon Wittner, Leonie Wagner, Julio Arend, Tao Sun, Chad Zanocco, Marek Miltner, Arezou Ghesmati, Chen-Hao Tsai, Christoph Bergmeir, Ram Rajagopal

Main category: cs.LG

TL;DR: A multi-level forecasting system for power grid operators that improves accuracy and interpretability of both zonal and nodal load forecasts, addressing challenges from sustainable energy developments.

Details

Motivation: Sustainable energy developments increase electric load uncertainty, requiring higher spatial resolution forecasts from zonal aggregates to individual nodes. Nodal loads are harder to forecast accurately and require managing many individual forecasts, creating operational challenges for control room operators.

Method: Developed an interpretable and scalable forecasting model using extensive zonal and nodal net load data. The system includes solutions for handling nodal load heterogeneity and volatility, with a fully parallelized single-model forecasting workflow that allows gradual extension from zonal to nodal operations.

Result: Showed accuracy and interpretability improvements for zonal forecasts, and substantial improvements for nodal forecasts. The system enables operators to adjust forecasts with unprecedented confidence and accuracy, and precisely diagnose previously opaque errors.

Conclusion: The multi-level forecasting system successfully addresses the challenges of nodal load forecasting, providing operators with reliable, interpretable forecasts that improve grid reliability management in the face of sustainable energy uncertainties.

Abstract: The reliability of local power grid infrastructure is challenged by sustainable energy developments increasing electric load uncertainty. Transmission System Operators (TSOs) need load forecasts of higher spatial resolution, extending current forecasting operations from zonal aggregates to individual nodes. However, nodal loads are less accurate to forecast and require a large number of individual forecasts, which are hard to manage for the human experts assessing risks in the control room’s daily operations (operator). In collaboration with a TSO, we design a multi-level system that meets the needs of operators for hourly day-ahead load forecasting. Utilizing a uniquely extensive dataset of zonal and nodal net loads, we experimentally evaluate our system components. First, we develop an interpretable and scalable forecasting model that allows for TSOs to gradually extend zonal operations to include nodal forecasts. Second, we evaluate solutions to address the heterogeneity and volatility of nodal load, subject to a trade-off. Third, our system is manageable with a fully parallelized single-model forecasting workflow. Our results show accuracy and interpretability improvements for zonal forecasts, and substantial improvements for nodal forecasts. In practice, our multi-level forecasting system allows operators to adjust forecasts with unprecedented confidence and accuracy, and to diagnose otherwise opaque errors precisely.

[287] TangledFeatures: Robust Feature Selection in Highly Correlated Spaces

Allen Daniel Sunny

Main category: cs.LG

TL;DR: TangledFeatures is a feature selection framework that identifies representative features from correlated predictor groups, reducing redundancy while maintaining explanatory power for more interpretable and stable analysis.

Details

Motivation: Most feature selection methods focus on predictive accuracy but degrade with correlated predictors, creating a need for methods that handle correlated feature spaces effectively.

Method: The framework identifies representative features from groups of entangled predictors, selecting a subset that reduces redundancy while retaining explanatory power for downstream models.

Result: Applied to Alanine Dipeptide, TangledFeatures successfully selected structurally meaningful intra-atomic distances that explain variation in backbone torsional angles.

Conclusion: TangledFeatures provides a more interpretable and stable basis for feature selection in correlated feature spaces compared to traditional techniques.

Abstract: Feature selection is a fundamental step in model development, shaping both predictive performance and interpretability. Yet, most widely used methods focus on predictive accuracy, and their performance degrades in the presence of correlated predictors. To address this gap, we introduce TangledFeatures, a framework for feature selection in correlated feature spaces. It identifies representative features from groups of entangled predictors, reducing redundancy while retaining explanatory power. The resulting feature subset can be directly applied in downstream models, offering a more interpretable and stable basis for analysis compared to traditional selection techniques. We demonstrate the effectiveness of TangledFeatures on Alanine Dipeptide, applying it to the prediction of backbone torsional angles and show that the selected features correspond to structurally meaningful intra-atomic distances that explain variation in these angles.

[288] ES-C51: Expected Sarsa Based C51 Distributional Reinforcement Learning Algorithm

Rijul Tandon, Peter Vamplew, Cameron Foale

Main category: cs.LG

TL;DR: ES-C51 modifies C51 by replacing greedy Q-learning with Expected Sarsa update using softmax, improving stability and performance when actions have similar expected rewards but different distributions.

Details

Motivation: Standard C51's greedy Bellman update can cause instability when multiple actions have similar expected rewards but different distributions, preventing stable distribution learning.

Method: Replaced greedy Q-learning update in C51 with Expected Sarsa update using softmax calculation that combines information from all possible actions rather than relying on a single best action.

Result: ES-C51 outperforms QL-C51 (modified C51 with softmax exploration) across many classic control environments and Atari-10 games.

Conclusion: Using Expected Sarsa update instead of greedy Q-learning in distributional RL improves stability and performance when dealing with actions having similar expected rewards.

Abstract: In most value-based reinforcement learning (RL) algorithms, the agent estimates only the expected reward for each action and selects the action with the highest reward. In contrast, Distributional Reinforcement Learning (DRL) estimates the entire probability distribution of possible rewards, providing richer information about uncertainty and variability. C51 is a popular DRL algorithm for discrete action spaces. It uses a Q-learning approach, where the distribution is learned using a greedy Bellman update. However, this can cause problems if multiple actions at a state have similar expected reward but with different distributions, as the algorithm may not learn a stable distribution. This study presents a modified version of C51 (ES-C51) that replaces the greedy Q-learning update with an Expected Sarsa update, which uses a softmax calculation to combine information from all possible actions at a state rather than relying on a single best action. This reduces instability when actions have similar expected rewards and allows the agent to learn higher-performing policies. This approach is evaluated on classic control environments from Gym, and Atari-10 games. For a fair comparison, we modify the standard C51’s exploration strategy from e-greedy to softmax, which we refer to as QL-C51 (Q- Learning based C51). The results demonstrate that ES-C51 outperforms QL-C51 across many environments.

[289] Hybrid Autoencoder-Based Framework for Early Fault Detection in Wind Turbines

Rekha R Nair, Tina Babu, Alavikunhu Panthakkan, Balamurugan Balusamy, Wathiq Mansoor

Main category: cs.LG

TL;DR: Novel ensemble deep learning framework for unsupervised anomaly detection in wind turbines using VAEs, LSTM Autoencoders, and Transformers on SCADA data, achieving high AUC-ROC and early fault detection.

Details

Motivation: Wind turbine reliability is critical for renewable energy sector growth, where early fault detection reduces downtime and maintenance costs significantly.

Method: Integrates Variational Autoencoders, LSTM Autoencoders, and Transformer architectures with feature engineering pipeline extracting temporal, statistical, and frequency-domain indicators from SCADA data. Uses ensemble scoring and adaptive thresholding for unsupervised anomaly detection.

Result: Achieves AUC-ROC of 0.947 and early fault detection up to 48 hours prior to failure on CARE dataset containing 89 years of real-world turbine data across three wind farms.

Conclusion: The approach enables predictive maintenance, reduces turbine failures, and enhances operational efficiency in large-scale wind energy deployments, offering significant societal value.

Abstract: Wind turbine reliability is critical to the growing renewable energy sector, where early fault detection significantly reduces downtime and maintenance costs. This paper introduces a novel ensemble-based deep learning framework for unsupervised anomaly detection in wind turbines. The method integrates Variational Autoencoders (VAE), LSTM Autoencoders, and Transformer architectures, each capturing different temporal and contextual patterns from high-dimensional SCADA data. A unique feature engineering pipeline extracts temporal, statistical, and frequency-domain indicators, which are then processed by the deep models. Ensemble scoring combines model predictions, followed by adaptive thresholding to detect operational anomalies without requiring labeled fault data. Evaluated on the CARE dataset containing 89 years of real-world turbine data across three wind farms, the proposed method achieves an AUC-ROC of 0.947 and early fault detection up to 48 hours prior to failure. This approach offers significant societal value by enabling predictive maintenance, reducing turbine failures, and enhancing operational efficiency in large-scale wind energy deployments.

[290] An Empirical Study on MC Dropout–Based Uncertainty–Error Correlation in 2D Brain Tumor Segmentation

Saumya B

Main category: cs.LG

TL;DR: MC Dropout uncertainty shows weak correlation with segmentation errors in brain tumor MRI, especially near boundaries, suggesting limited utility for error localization.

Details

Motivation: To evaluate whether MC Dropout-based uncertainty effectively identifies segmentation errors in brain tumor MRI, particularly near tumor boundaries where accuracy is crucial for diagnosis and treatment planning.

Method: Used U-Net for 2D brain tumor MRI segmentation with four augmentation settings (none, horizontal flip, rotation, scaling), computed uncertainty from 50 stochastic forward passes, and correlated with pixel-wise errors using Pearson and Spearman coefficients.

Result: Found weak global correlations (r ≈ 0.30-0.38) and negligible boundary correlations (|r| < 0.05). Statistical significance across augmentations (p < 0.001) but lacked practical relevance.

Conclusion: MC Dropout uncertainty provides limited cues for boundary error localization, highlighting the need for alternative or hybrid uncertainty estimation methods in medical image segmentation.

Abstract: Accurate brain tumor segmentation from MRI is vital for diagnosis and treatment planning. Although Monte Carlo (MC) Dropout is widely used to estimate model uncertainty, its effectiveness in identifying segmentation errors – especially near tumor boundaries – remains unclear. This study empirically examines the relationship between MC Dropout–based uncertainty and segmentation error in 2D brain tumor MRI segmentation using a U-Net trained under four augmentation settings: none, horizontal flip, rotation, and scaling. Uncertainty was computed from 50 stochastic forward passes and correlated with pixel-wise errors using Pearson and Spearman coefficients. Results show weak global correlations ($r \approx 0.30$–$0.38$) and negligible boundary correlations ($|r| < 0.05$). Although differences across augmentations were statistically significant ($p < 0.001$), they lacked practical relevance. These findings suggest that MC Dropout uncertainty provides limited cues for boundary error localization, underscoring the need for alternative or hybrid uncertainty estimation methods in medical image segmentation.

[291] AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport

Lingkai Kong, Molei Tao, Yang Liu, Bryan Wang, Jinmiao Fu, Chien-Chih Wang, Huidong Liu

Main category: cs.LG

TL;DR: AlignFlow introduces Semi-Discrete Optimal Transport (SDOT) to enhance Flow-based Generative Models by establishing explicit optimal alignment between noise distribution and data points, improving training efficiency and scalability.

Details

Motivation: Existing OT-based methods in FGMs use mini-batch sampling which limits scalability to large and high-dimensional datasets. There's a need for more efficient and scalable approaches to couple noise and data distributions.

Method: Leverages Semi-Discrete Optimal Transport (SDOT) to partition noise space into Laguerre cells, each mapped to a data point. During training, i.i.d. noise samples are paired with data points via the SDOT map, providing explicit optimal alignment.

Result: AlignFlow scales well to large datasets and model architectures with negligible computational overhead. It improves performance of various state-of-the-art FGM algorithms and can be integrated as a plug-and-play component.

Conclusion: AlignFlow provides an effective and scalable solution for enhancing FGMs through SDOT-based alignment, offering guaranteed convergence and improved training efficiency across different model architectures.

Abstract: Flow-based Generative Models (FGMs) effectively transform noise into complex data distributions. Incorporating Optimal Transport (OT) to couple noise and data during FGM training has been shown to improve the straightness of flow trajectories, enabling more effective inference. However, existing OT-based methods estimate the OT plan using (mini-)batches of sampled noise and data points, which limits their scalability to large and high-dimensional datasets in FGMs. This paper introduces AlignFlow, a novel approach that leverages Semi-Discrete Optimal Transport (SDOT) to enhance the training of FGMs by establishing an explicit, optimal alignment between noise distribution and data points with guaranteed convergence. SDOT computes a transport map by partitioning the noise space into Laguerre cells, each mapped to a corresponding data point. During FGM training, i.i.d. noise samples are paired with data points via the SDOT map. AlignFlow scales well to large datasets and model architectures with negligible computational overhead. Experimental results show that AlignFlow improves the performance of a wide range of state-of-the-art FGM algorithms and can be integrated as a plug-and-play component. Code is available at: https://github.com/konglk1203/AlignFlow.

[292] IQNN-CS: Interpretable Quantum Neural Network for Credit Scoring

Abdul Samad Khan, Nouhaila Innan, Aeysha Khalique, Muhammad Shafique

Main category: cs.LG

TL;DR: IQNN-CS is an interpretable quantum neural network framework for multiclass credit risk classification that combines variational QNN with post-hoc explanation techniques and introduces Inter-Class Attribution Alignment (ICAA) to quantify attribution divergence across predicted classes.

Details

Motivation: Credit scoring requires transparency and trust due to its high-stakes nature and regulatory scrutiny, but current Quantum Machine Learning (QML) approaches are black-box models that lack interpretability needed for financial services adoption.

Method: Developed IQNN-CS framework combining variational quantum neural network with post-hoc explanation techniques for structured data, and introduced ICAA metric to quantify attribution divergence across predicted classes.

Result: IQNN-CS demonstrated stable training dynamics, competitive predictive performance, and enhanced interpretability when evaluated on two real-world credit datasets.

Conclusion: The framework provides a practical path toward transparent and accountable QML models for financial decision-making by addressing interpretability challenges in quantum machine learning.

Abstract: Credit scoring is a high-stakes task in financial services, where model decisions directly impact individuals’ access to credit and are subject to strict regulatory scrutiny. While Quantum Machine Learning (QML) offers new computational capabilities, its black-box nature poses challenges for adoption in domains that demand transparency and trust. In this work, we present IQNN-CS, an interpretable quantum neural network framework designed for multiclass credit risk classification. The architecture combines a variational QNN with a suite of post-hoc explanation techniques tailored for structured data. To address the lack of structured interpretability in QML, we introduce Inter-Class Attribution Alignment (ICAA), a novel metric that quantifies attribution divergence across predicted classes, revealing how the model distinguishes between credit risk categories. Evaluated on two real-world credit datasets, IQNN-CS demonstrates stable training dynamics, competitive predictive performance, and enhanced interpretability. Our results highlight a practical path toward transparent and accountable QML models for financial decision-making.

[293] Internalizing World Models via Self-Play Finetuning for Agentic RL

Shiqi Chen, Tongyao Zhu, Zian Wang, Jinghan Zhang, Kangrui Wang, Siyang Gao, Teng Xiao, Yee Whye Teh, Junxian He, Manling Li

Main category: cs.LG

TL;DR: SPA framework improves LLM agent performance in out-of-distribution scenarios by incorporating an internal world model through self-play supervised finetuning before policy optimization.

Details

Motivation: LLM agents struggle in OOD scenarios due to difficulty grounding internal knowledge in complex environmental dynamics, leading to brittle exploration and limited generalization.

Method: Decompose world model into state representation and transition modeling, then use Self-Play supervised finetuning to learn the world model before policy optimization.

Result: Significant performance improvements: Sokoban success rate increased from 25.6% to 59.8%, FrozenLake score from 22.1% to 70.9% for Qwen2.5-1.5B-Instruct model.

Conclusion: Equipping LLM agents with an internal world model through SPA framework effectively aligns reasoning with environmental dynamics and boosts RL-based agent training performance.

Abstract: Large Language Models (LLMs) as agents often struggle in out-of-distribution (OOD) scenarios. Real-world environments are complex and dynamic, governed by task-specific rules and stochasticity, which makes it difficult for LLMs to ground their internal knowledge in those dynamics. Under such OOD conditions, vanilla RL training often fails to scale; we observe Pass@k–the probability that at least one of (k) sampled trajectories succeeds–drops markedly across training steps, indicating brittle exploration and limited generalization. Inspired by model-based reinforcement learning, we hypothesize that equipping LLM agents with an internal world model can better align reasoning with environmental dynamics and improve decision-making. We show how to encode this world model by decomposing it into two components: state representation and transition modeling. Building on this, we introduce SPA, a simple reinforcement learning framework that cold-starts the policy via a Self-Play supervised finetuning (SFT) stage to learn the world model by interacting with the environment, then uses it to simulate future states prior to policy optimization. This simple initialization outperforms the online world-modeling baseline and greatly boosts the RL-based agent training performance. Experiments across diverse environments like Sokoban, FrozenLake, and Sudoku show that our approach significantly improves performance. For example, SPA boosts the Sokoban success rate from 25.6% to 59.8% and raises the FrozenLake score from 22.1% to 70.9% for the Qwen2.5-1.5B-Instruct model.

[294] Learn to Change the World: Multi-level Reinforcement Learning with Model-Changing Actions

Ziqing Lu, Babak Hassibi, Lifeng Lai, Weiyu Xu

Main category: cs.LG

TL;DR: The paper introduces MCTVMDP, a framework where agents can actively modify their environment’s dynamics through model-changing actions, rather than just passively adapting to a fixed environment.

Details

Motivation: Traditional RL assumes fixed environments, but real-world agents can actively change their environments to increase rewards. This work explores how agents can reconfigure transition processes to optimize performance.

Method: Proposes multi-layer configurable time-varying Markov decision process (MCTVMDP) with upper-level model-changing actions that configure the non-stationary transition function of the lower-level MDP.

Result: The framework enables joint optimization of both configuration policies (upper-level) and primitive action policies (lower-level) to maximize long-term rewards.

Conclusion: MCTVMDP provides a formal model for agents that can actively modify their environment dynamics, extending beyond traditional passive adaptation in reinforcement learning.

Abstract: Reinforcement learning usually assumes a given or sometimes even fixed environment in which an agent seeks an optimal policy to maximize its long-term discounted reward. In contrast, we consider agents that are not limited to passive adaptations: they instead have model-changing actions that actively modify the RL model of world dynamics itself. Reconfiguring the underlying transition processes can potentially increase the agents’ rewards. Motivated by this setting, we introduce the multi-layer configurable time-varying Markov decision process (MCTVMDP). In an MCTVMDP, the lower-level MDP has a non-stationary transition function that is configurable through upper-level model-changing actions. The agent’s objective consists of two parts: Optimize the configuration policies in the upper-level MDP and optimize the primitive action policies in the lower-level MDP to jointly improve its expected long-term reward.

[295] LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration

Yuyao Zhang, Jinghao Li, Yu-Wing Tai

Main category: cs.LG

TL;DR: LayerCraft is a modular framework using LLMs as agents for structured image generation and editing, enabling spatial composition control and object consistency without retraining T2I models.

Details

Motivation: Existing T2I systems lack intuitive control over spatial composition, object consistency, and multi-step editing, limiting user control and interpretability.

Method: Uses LLM agents (ChainArchitect for CoT layout planning and Object Integration Network) to orchestrate structured generation and layered object integration with off-the-shelf T2I models.

Result: Enables decomposition of scenes, reasoning about object placement, and seamless object insertion while preserving identity, context, and style across diverse images.

Conclusion: LayerCraft empowers non-experts to iteratively design, customize, and refine visual content with minimal manual effort through modular, interpretable control.

Abstract: Text-to-image (T2I) generation has made remarkable progress, yet existing systems still lack intuitive control over spatial composition, object consistency, and multi-step editing. We present $\textbf{LayerCraft}$, a modular framework that uses large language models (LLMs) as autonomous agents to orchestrate structured, layered image generation and editing. LayerCraft supports two key capabilities: (1) $\textit{structured generation}$ from simple prompts via chain-of-thought (CoT) reasoning, enabling it to decompose scenes, reason about object placement, and guide composition in a controllable, interpretable manner; and (2) $\textit{layered object integration}$, allowing users to insert and customize objects – such as characters or props – across diverse images or scenes while preserving identity, context, and style. The system comprises a coordinator agent, the $\textbf{ChainArchitect}$ for CoT-driven layout planning, and the $\textbf{Object Integration Network (OIN)}$ for seamless image editing using off-the-shelf T2I models without retraining. Through applications like batch collage editing and narrative scene generation, LayerCraft empowers non-experts to iteratively design, customize, and refine visual content with minimal manual effort. Code will be released at https://github.com/PeterYYZhang/LayerCraft.

[296] Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models

Samuel Paech, Allen Roush, Judah Goldfeder, Ravid Shwartz-Ziv

Main category: cs.LG

TL;DR: Antislop is a framework that detects and eliminates repetitive phraseology (“slop”) in LLM outputs using three innovations: a backtracking sampler, automated slop profiling, and token-level fine-tuning (FTPO).

Details

Motivation: Widespread LLM adoption has introduced characteristic repetitive phraseology that degrades output quality and makes AI-generated text immediately recognizable.

Method: Combines three approaches: (1) Antislop Sampler using backtracking to suppress unwanted strings, (2) automated pipeline profiling model-specific slop against human baselines, (3) Final Token Preference Optimization (FTPO) - a novel fine-tuning method that surgically adjusts logits.

Result: Antislop Sampler suppresses 8,000+ patterns while maintaining quality (vs. token banning unusable at 2,000). FTPO achieves 90% slop reduction while maintaining or improving performance on GSM8K, MMLU, and creative writing tasks.

Conclusion: FTPO effectively reduces slop while preserving model performance, outperforming DPO which suffers significant degradation in writing quality and lexical diversity.

Abstract: Widespread LLM adoption has introduced characteristic repetitive phraseology, termed ``slop,’’ which degrades output quality and makes AI-generated text immediately recognizable. We present Antislop, a comprehensive framework providing tools to both detect and eliminate these overused patterns. Our approach combines three innovations: (1) The Antislop Sampler, which uses backtracking to suppress unwanted strings at inference time without destroying vocabulary; (2) An automated pipeline that profiles model-specific slop against human baselines and generates training data; (3) Final Token Preference Optimization (FTPO), a novel fine-tuning method that operates on individual tokens, surgically adjusting logits wherever a banned pattern has appeared in an inference trace. We demonstrate that some slop patterns appear over 1,000$\times$ more frequently in LLM output than human text. The Antislop Sampler successfully suppresses 8,000+ patterns while maintaining quality, whereas token banning becomes unusable at just 2,000. Most importantly, FTPO achieves 90% slop reduction while maintaining or improving performance in cross-domain evals including GSM8K, MMLU, and creative writing tasks. In contrast, DPO suffers significant degradation in writing quality and lexical diversity despite achieving weaker suppression. We release all code and results under MIT license: https://github.com/sam-paech/auto-antislop.

[297] Physics-informed data-driven machine health monitoring for two-photon lithography

Sixian Jia, Zhiqiao Dong, Chenhui Shao

Main category: cs.LG

TL;DR: This paper presents three physics-informed data-driven methods for monitoring two-photon lithography (TPL) machine health to enable timely maintenance and prevent fabrication quality issues.

Details

Motivation: Current TPL maintenance relies on experience rather than informed monitoring, leading to either untimely maintenance causing downtime and poor quality, or unnecessary maintenance causing inefficiencies.

Method: Three methods integrating physics-informed data-driven predictive models for structure dimensions with statistical approaches, designed to handle increasingly complex scenarios with different levels of generalizability.

Result: The approaches achieved high accuracies across all test scenarios using a comprehensive experimental dataset with six process parameter combinations and six structure dimensions under two machine health conditions, demonstrating excellent effectiveness, robustness, and generalizability.

Conclusion: These results represent a significant step toward condition-based maintenance for TPL systems, enabling more informed and timely maintenance decisions.

Abstract: Two-photon lithography (TPL) is a sophisticated additive manufacturing technology for creating three-dimensional (3D) micro- and nano-structures. Maintaining the health of TPL systems is critical for ensuring consistent fabrication quality. Current maintenance practices often rely on experience rather than informed monitoring of machine health, resulting in either untimely maintenance that causes machine downtime and poor-quality fabrication, or unnecessary maintenance that leads to inefficiencies and avoidable downtime. To address this gap, this paper presents three methods for accurate and timely monitoring of TPL machine health. Through integrating physics-informed data-driven predictive models for structure dimensions with statistical approaches, the proposed methods are able to handle increasingly complex scenarios featuring different levels of generalizability. A comprehensive experimental dataset that encompasses six process parameter combinations and six structure dimensions under two machine health conditions was collected to evaluate the effectiveness of the proposed approaches. Across all test scenarios, the approaches are shown to achieve high accuracies, demonstrating excellent effectiveness, robustness, and generalizability. These results represent a significant step toward condition-based maintenance for TPL systems.

[298] Online Correlation Clustering: Simultaneously Optimizing All $\ell_p$-norms

Sami Davies, Benjamin Moseley, Heather Newman

Main category: cs.LG

TL;DR: This paper presents the first online algorithm that simultaneously approximates all ℓ_p-norms for correlation clustering, achieving O(log⁴ n)-competitive ratios for all norms, with improved bounds for ℓ₁ and ℓ∞ norms.

Details

Motivation: To bridge the gap between offline correlation clustering's ability to simultaneously approximate all ℓ_p-norms and the online setting, where such guarantees were previously unknown. The work is motivated by a hardness result showing fundamental separation between ℓ₁ and ℓ∞ objectives in standard online models.

Method: Developed a single algorithm for the online-with-a-sample (AOS) model that uses a small constant fraction of input as a sample to produce one clustering that works for all ℓ_p-norms simultaneously.

Result: The algorithm achieves: O(log⁴ n)-competitive for all ℓ_p-norms with high probability, O(log n)-competitive for ℓ∞-norm with high probability, and O(1)-competitive for ℓ₁-norm in expectation. Also proved Ω(n¹/³) lower bound for ℓ∞-norm in standard RO model.

Conclusion: Successfully translated the offline “all-norms” guarantee to online setting using the AOS model, demonstrating that powerful simultaneous approximation is possible online with a small sample, while showing fundamental limitations in standard online models.

Abstract: The $\ell_p$-norm objectives for correlation clustering present a fundamental trade-off between minimizing total disagreements (the $\ell_1$-norm) and ensuring fairness to individual nodes (the $\ell_\infty$-norm). Surprisingly, in the offline setting it is possible to simultaneously approximate all $\ell_p$-norms with a single clustering. Can this powerful guarantee be achieved in an online setting? This paper provides the first affirmative answer. We present a single algorithm for the online-with-a-sample (AOS) model that, given a small constant fraction of the input as a sample, produces one clustering that is simultaneously $O(\log^4 n)$-competitive for all $\ell_p$-norms with high probability, $O(\log n)$-competitive for the $\ell_\infty$-norm with high probability, and $O(1)$-competitive for the $\ell_1$-norm in expectation. This work successfully translates the offline “all-norms” guarantee to the online world. Our setting is motivated by a new hardness result that demonstrates a fundamental separation between these objectives in the standard random-order (RO) online model. Namely, while the $\ell_1$-norm is trivially $O(1)$-approximable in the RO model, we prove that any algorithm in the RO model for the fairness-promoting $\ell_\infty$-norm must have a competitive ratio of at least $\Omega(n^{1/3})$. This highlights the necessity of a different beyond-worst-case model. We complement our algorithm with lower bounds, showing our competitive ratios for the $\ell_1$- and $\ell_\infty$- norms are nearly tight in the AOS model.

[299] Operator Flow Matching for Timeseries Forecasting

Yolanne Yi Ran Lee, Kyriakos Flouris

Main category: cs.LG

TL;DR: TempO is a latent flow matching model that uses sparse conditioning and channel folding to efficiently forecast high-dimensional PDE dynamics, outperforming state-of-the-art methods while being parameter- and memory-efficient.

Details

Motivation: Existing autoregressive and diffusion-based approaches for forecasting PDE-governed dynamics suffer from cumulative errors and discretisation artifacts that limit long, physically consistent forecasts.

Method: Proposed TempO, a latent flow matching model leveraging sparse conditioning with channel folding to efficiently process 3D spatiotemporal fields using time-conditioned Fourier layers to capture multi-scale modes.

Result: TempO outperforms state-of-the-art baselines across three benchmark PDE datasets, with spectral analysis demonstrating superior recovery of multi-scale dynamics.

Conclusion: Flow matching offers an efficient alternative for deterministic sampling in PDE forecasting, with TempO achieving superior performance while maintaining parameter- and memory-light design compared to attention-based or convolutional regressors.

Abstract: Forecasting high-dimensional, PDE-governed dynamics remains a core challenge for generative modeling. Existing autoregressive and diffusion-based approaches often suffer cumulative errors and discretisation artifacts that limit long, physically consistent forecasts. Flow matching offers a natural alternative, enabling efficient, deterministic sampling. We prove an upper bound on FNO approximation error and propose TempO, a latent flow matching model leveraging sparse conditioning with channel folding to efficiently process 3D spatiotemporal fields using time-conditioned Fourier layers to capture multi-scale modes with high fidelity. TempO outperforms state-of-the-art baselines across three benchmark PDE datasets, and spectral analysis further demonstrates superior recovery of multi-scale dynamics, while efficiency studies highlight its parameter- and memory-light design compared to attention-based or convolutional regressors.

[300] DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov

Main category: cs.LG

TL;DR: DLER is a reinforcement learning training recipe that optimizes reasoning language models to generate concise outputs while maintaining high accuracy, achieving over 70% length reduction with improved performance.

Details

Motivation: Reasoning language models generate unnecessarily long outputs, and maximizing intelligence per token (accuracy relative to response length) remains an open problem that needs addressing.

Method: DLER combines batch-wise reward normalization, higher clipping, dynamic sampling, and simple truncation length penalty to address RL optimization challenges like bias in advantage estimation, entropy collapse, and sparse rewards.

Result: DLER achieves state-of-the-art accuracy-efficiency trade-offs, cutting output length by over 70% while surpassing all previous baseline accuracy. DLER-7B generates multiple concise responses with 28% higher accuracy and lower latency compared to DeepSeek-R1-7B.

Conclusion: The method effectively optimizes reasoning language models for conciseness while maintaining or improving accuracy, with additional variants like Difficulty-Aware DLER and update-selective merging for further efficiency gains and data-scarce scenarios.

Abstract: Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token–accuracy relative to response length–remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty–truncation–and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy–efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.

[301] Navigating the consequences of mechanical ventilation in clinical intensive care settings through an evolutionary game-theoretic framework

David J. Albers, Tell D. Bennett, Jana de Wiljes, Bradford J. Smith, Peter D. Sottile, J. N. Stroh

Main category: cs.LG

TL;DR: This paper develops a framework using evolutionary game theory to analyze mechanical ventilation strategies from clinical data, aiming to optimize and personalize critical care respiratory management.

Details

Motivation: To understand the effects of mechanical ventilation strategies on patient outcomes by analyzing heterogeneous patient-ventilator systems within clinical decision-making environments, enabling hypothesis generation for improved care.

Method: Uses evolutionary game theory (EGT) to analyze breath behaviors from clinical data, creating quantitative precursors for deeper analysis through probabilistic and stochastic methods like reinforcement learning. Validated on synthetic data before applying to real ICU data.

Result: Developed a scalable method for analyzing complex patient-ventilator-care systems (J6), providing analytical validation on synthetic data and exposing real-world complexities in ICU data applications.

Conclusion: The EGT-based approach represents a step toward mechanical ventilation optimization and personalization, with potential for developing state transition models that simulate MV decision effects using empirical and game-theoretic elements.

Abstract: Identifying the effects of mechanical ventilation strategies and protocols in critical care requires analyzing data from heterogeneous patient-ventilator systems within the context of the clinical decision-making environment. This research develops a framework to help understand the consequences of mechanical ventilation (MV) and adjunct care decisions on patient outcome from observations of critical care patients receiving MV. Developing an understanding of and improving critical care respiratory management requires the analysis of existing secondary-use clinical data to generate hypotheses about advantageous variations and adaptations of current care. This work introduces a perspective of the joint patient-ventilator-care systems (so-called J6) to develop a scalable method for analyzing data and trajectories of these complex systems. To that end, breath behaviors are analyzed using evolutionary game theory (EGT), which generates the necessary quantitative precursors for deeper analysis through probabilistic and stochastic machinery such as reinforcement learning. This result is one step along the pathway toward MV optimization and personalization. The EGT-based process is analytically validated on synthetic data to reveal potential caveats before proceeding to real-world ICU data applications that expose complexities of the data-generating process J6. The discussion includes potential developments toward a state transition model for the simulating effects of MV decision using empirical and game-theoretic elements.

[302] A Simple Method for PMF Estimation on Large Supports

Alex Shtoff

Main category: cs.LG

TL;DR: A nonparametric method for estimating multi-modal, heavy-tailed PMFs using data-dependent low-pass filtering via path graph Laplacian eigenvectors.

Details

Motivation: To estimate probability mass functions on large discrete supports that are multi-modal and heavy-tailed, preserving coarse structure while suppressing sampling noise.

Method: Treat empirical PMF as signal on line graph, apply data-dependent low-pass filter using eigenvectors of perturbed path graph Laplacian (symmetric tridiagonal matrix), then project and post-process with clipping and re-normalization.

Result: Method preserves coarse structure while suppressing noise, compares favorably to logspline and Gaussian-KDE baselines, is computationally efficient (O(support × subspace dimension)), and includes data-driven dimension selection.

Conclusion: The approach is short to implement, robust across sample sizes, suitable for automated pipelines and exploratory analysis due to reliability and speed, though has known failure modes like abrupt discontinuities.

Abstract: We study nonparametric estimation of a probability mass function (PMF) on a large discrete support, where the PMF is multi-modal and heavy-tailed. The core idea is to treat the empirical PMF as a signal on a line graph and apply a data-dependent low-pass filter. Concretely, we form a symmetric tri-diagonal operator, the path graph Laplacian perturbed with a diagonal matrix built from the empirical PMF, then compute the eigenvectors, corresponding to the smallest feq eigenvalues. Projecting the empirical PMF onto this low dimensional subspace produces a smooth, multi-modal estimate that preserves coarse structure while suppressing noise. A light post-processing step of clipping and re-normalizing yields a valid PMF. Because we compute the eigenpairs of a symmetric tridiagonal matrix, the computation is reliable and runs time and memory proportional to the support times the dimension of the desired low-dimensional supspace. We also provide a practical, data-driven rule for selecting the dimension based on an orthogonal-series risk estimate, so the method “just works” with minimal tuning. On synthetic and real heavy-tailed examples, the approach preserves coarse structure while suppressing sampling noise, compares favorably to logspline and Gaussian-KDE baselines in the intended regimes. However, it has known failure modes (e.g., abrupt discontinuities). The method is short to implement, robust across sample sizes, and suitable for automated pipelines and exploratory analysis at scale because of its reliability and speed.

[303] Predicting the Unpredictable: Reproducible BiLSTM Forecasting of Incident Counts in the Global Terrorism Database (GTD)

Oluwasegun Adegoke

Main category: cs.LG

TL;DR: BiLSTM model outperforms classical and deep learning baselines for short-horizon weekly terrorism incident forecasting using GTD data, with key findings on optimal historical data length, lookback windows, and feature importance.

Details

Motivation: To develop a reproducible pipeline for short-horizon forecasting of weekly terrorism incident counts using the Global Terrorism Database, establishing transparent baseline-beating references for this domain.

Method: Built reproducible pipeline with fixed time-based splits, evaluated Bidirectional LSTM (BiLSTM) against classical anchors (seasonal-naive, linear/ARIMA) and deep LSTM-Attention baseline. Conducted ablations on temporal memory, training-history length, spatial grain, lookback size, and feature groups.

Result: BiLSTM achieved RMSE 6.38 on held-out test set, outperforming LSTM-Attention (9.19; +30.6%) and linear lag-regression baseline (+35.4% RMSE gain), with parallel improvements in MAE and MAPE. Models trained on long historical data generalized best, moderate lookback (20-30 weeks) provided strong context, and bidirectional encoding was critical.

Conclusion: The study provides a transparent, baseline-beating reference for GTD incident forecasting, demonstrating BiLSTM’s effectiveness and identifying optimal configurations for short-horizon terrorism forecasting.

Abstract: We study short-horizon forecasting of weekly terrorism incident counts using the Global Terrorism Database (GTD, 1970–2016). We build a reproducible pipeline with fixed time-based splits and evaluate a Bidirectional LSTM (BiLSTM) against strong classical anchors (seasonal-naive, linear/ARIMA) and a deep LSTM-Attention baseline. On the held-out test set, the BiLSTM attains RMSE 6.38, outperforming LSTM-Attention (9.19; +30.6%) and a linear lag-regression baseline (+35.4% RMSE gain), with parallel improvements in MAE and MAPE. Ablations varying temporal memory, training-history length, spatial grain, lookback size, and feature groups show that models trained on long historical data generalize best; a moderate lookback (20–30 weeks) provides strong context; and bidirectional encoding is critical for capturing both build-up and aftermath patterns within the window. Feature-group analysis indicates that short-horizon structure (lagged counts and rolling statistics) contributes most, with geographic and casualty features adding incremental lift. We release code, configs, and compact result tables, and provide a data/ethics statement documenting GTD licensing and research-only use. Overall, the study offers a transparent, baseline-beating reference for GTD incident forecasting.

[304] Policy Transfer Ensures Fast Learning for Continuous-Time LQR with Entropy Regularization

Xin Guo, Zijiu Lyu

Main category: cs.LG

TL;DR: This paper provides the first theoretical proof of policy transfer for continuous-time RL, showing that optimal policies from one LQR can serve as near-optimal initialization for related LQRs while maintaining convergence rates.

Details

Motivation: To address the inefficiency of training RL agents from scratch on complex tasks by leveraging transfer learning approaches similar to those used in large language models.

Method: Investigates policy transfer in continuous-time linear quadratic regulators (LQRs) with entropy regularization, and introduces a novel policy learning algorithm for continuous-time LQRs.

Result: Proves that optimal policies from source LQRs serve as near-optimal initialization for target LQRs, preserves original convergence rates, and achieves global linear and local super-linear convergence with the new algorithm.

Conclusion: Demonstrates theoretical guarantees and algorithmic benefits of transfer learning in continuous-time RL, extending prior work from discrete to continuous time settings, with additional insights connecting LQRs to diffusion model stability.

Abstract: Reinforcement Learning (RL) enables agents to learn optimal decision-making strategies through interaction with an environment, yet training from scratch on complex tasks can be highly inefficient. Transfer learning (TL), widely successful in large language models (LLMs), offers a promising direction for enhancing RL efficiency by leveraging pre-trained models. This paper investigates policy transfer, a TL approach that initializes learning in a target RL task using a policy from a related source task, in the context of continuous-time linear quadratic regulators (LQRs) with entropy regularization. We provide the first theoretical proof of policy transfer for continuous-time RL, proving that a policy optimal for one LQR serves as a near-optimal initialization for closely related LQRs, while preserving the original algorithm’s convergence rate. Furthermore, we introduce a novel policy learning algorithm for continuous-time LQRs that achieves global linear and local super-linear convergence. Our results demonstrate both theoretical guarantees and algorithmic benefits of transfer learning in continuous-time RL, addressing a gap in existing literature and extending prior work from discrete to continuous time settings. As a byproduct of our analysis, we derive the stability of a class of continuous-time score-based diffusion models via their connection with LQRs.

[305] A simple mean field model of feature learning

Niclas Göring, Chris Mingard, Yoonsoo Nam, Ard Louis

Main category: cs.LG

TL;DR: A mean-field theory for Bayesian neural networks reveals a symmetry breaking phase transition where networks align with target functions at finite width, with self-reinforcing input feature selection being crucial for accurate generalization predictions.

Details

Motivation: Feature learning in neural networks remains poorly understood, particularly how networks adapt their internal representations during training.

Method: Developed a tractable self-consistent mean-field theory for Bayesian posterior of two-layer non-linear networks trained with stochastic gradient Langevin dynamics, incorporating self-reinforcing input feature selection.

Result: At infinite width, theory reduces to kernel ridge regression; at finite width predicts symmetry breaking phase transition where networks align with target functions. Basic MF theory underestimates generalization improvements, but incorporating self-reinforcing feature selection quantitatively matches learning curves.

Conclusion: The mean-field theory provides mechanistic insight into feature learning, with self-reinforcing input feature selection being a key mechanism for accurate generalization predictions.

Abstract: Feature learning (FL), where neural networks adapt their internal representations during training, remains poorly understood. Using methods from statistical physics, we derive a tractable, self-consistent mean-field (MF) theory for the Bayesian posterior of two-layer non-linear networks trained with stochastic gradient Langevin dynamics (SGLD). At infinite width, this theory reduces to kernel ridge regression, but at finite width it predicts a symmetry breaking phase transition where networks abruptly align with target functions. While the basic MF theory provides theoretical insight into the emergence of FL in the finite-width regime, semi-quantitatively predicting the onset of FL with noise or sample size, it substantially underestimates the improvements in generalisation after the transition. We trace this discrepancy to a key mechanism absent from the plain MF description: \textit{self-reinforcing input feature selection}. Incorporating this mechanism into the MF theory allows us to quantitatively match the learning curves of SGLD-trained networks and provides mechanistic insight into FL.

[306] Finding geodesics with the Deep Ritz method

Conor Rowan

Main category: cs.LG

TL;DR: The paper argues that geodesic problems are well-suited for the Deep Ritz method in scientific machine learning, demonstrating this with three numerical examples from path planning, optics, and solid mechanics.

Details

Motivation: Geodesic problems are ubiquitous in physics and engineering but have received little attention in scientific machine learning (SciML) despite their simple geometry, variational structure, and natural nonlinearity.

Method: The authors propose using the Deep Ritz method to solve geodesic problems, leveraging its suitability for problems with variational structure and nonlinearity.

Result: The approach is substantiated with three numerical examples from different domains: path planning, optics, and solid mechanics, showing promising results.

Conclusion: Geodesic problems represent a promising application area for the Deep Ritz method and a fruitful direction for future SciML research, though this work serves as an initial demonstration rather than an exhaustive study.

Abstract: Geodesic problems involve computing trajectories between prescribed initial and final states to minimize a user-defined measure of distance, cost, or energy. They arise throughout physics and engineering – for instance, in determining optimal paths through complex environments, modeling light propagation in refractive media, and the study of spacetime trajectories in control theory and general relativity. Despite their ubiquity, the scientific machine learning (SciML) community has given relatively little attention to investigating its methods in the context of these problems. In this work, we argue that given their simple geometry, variational structure, and natural nonlinearity, geodesic problems are particularly well-suited for the Deep Ritz method. We substantiate this claim with three numerical examples drawn from path planning, optics, and solid mechanics. Our goal is not to provide an exhaustive study of geodesic problems, but rather to identify a promising application of the Deep Ritz method and a fruitful direction for future SciML research.

[307] An Advanced Two-Stage Model with High Sensitivity and Generalizability for Prediction of Hip Fracture Risk Using Multiple Datasets

Shuo Sun, Meiling Zhou, Chen Zhao, Joyce H. Keyak, Nancy E. Lane, Jeffrey D. Deng, Kuan-Jui Su, Hui Shen, Hong-Wen Deng, Kui Zhang, Weihua Zhou

Main category: cs.LG

TL;DR: A sequential two-stage model combining clinical data and DXA imaging improves hip fracture risk prediction, outperforming traditional T-score and FRAX methods with higher sensitivity and fewer missed cases.

Details

Motivation: Current hip fracture risk assessment tools like DXA T-score and FRAX lack sensitivity and often miss high-risk individuals, especially those without prior fractures or with osteopenia.

Method: A two-stage model using data from MrOS, SOF, and UK Biobank: Stage 1 uses clinical/demographic/functional variables for baseline risk, Stage 2 incorporates DXA features for refinement, with internal and external validation.

Result: The model achieved higher sensitivity and reduced missed cases compared to T-score and FRAX, showing consistent performance across different cohorts.

Conclusion: The two-stage framework provides a cost-effective, personalized approach for early hip fracture risk assessment with improved accuracy over existing methods.

Abstract: Hip fractures are a major cause of disability, mortality, and healthcare burden in older adults, underscoring the need for early risk assessment. However, commonly used tools such as the DXA T-score and FRAX often lack sensitivity and miss individuals at high risk, particularly those without prior fractures or with osteopenia. To address this limitation, we propose a sequential two-stage model that integrates clinical and imaging information to improve prediction accuracy. Using data from the Osteoporotic Fractures in Men Study (MrOS), the Study of Osteoporotic Fractures (SOF), and the UK Biobank, Stage 1 (Screening) employs clinical, demographic, and functional variables to estimate baseline risk, while Stage 2 (Imaging) incorporates DXA-derived features for refinement. The model was rigorously validated through internal and external testing, showing consistent performance and adaptability across cohorts. Compared to T-score and FRAX, the two-stage framework achieved higher sensitivity and reduced missed cases, offering a cost-effective and personalized approach for early hip fracture risk assessment. Keywords: Hip Fracture, Two-Stage Model, Risk Prediction, Sensitivity, DXA, FRAX

[308] Automotive Crash Dynamics Modeling Accelerated with Machine Learning

Mohammad Amin Nabian, Sudeep Chavare, Deepak Akhare, Rishikesh Ranade, Ram Cherukuri, Srinivas Tadepalli

Main category: cs.LG

TL;DR: Machine learning surrogate models for crashworthiness assessment using NVIDIA PhysicsNeMo framework, achieving significant computational cost reduction while capturing deformation trends.

Details

Motivation: Traditional finite element simulations for crashworthiness assessment are computationally expensive and time-consuming, creating need for efficient alternatives.

Method: Used MeshGraphNet and Transolver neural network architectures with three transient dynamics strategies (time-conditional, autoregressive, stability-enhanced autoregressive) on Body-in-White crash dataset with 150 LS-DYNA simulations.

Result: Models captured overall deformation trends with reasonable fidelity, achieving orders-of-magnitude computational cost reduction compared to full FE simulations.

Conclusion: Machine learning approaches are feasible for structural crash dynamics, enabling rapid design exploration and early-stage optimization in crashworthiness evaluation.

Abstract: Crashworthiness assessment is a critical aspect of automotive design, traditionally relying on high-fidelity finite element (FE) simulations that are computationally expensive and time-consuming. This work presents an exploratory comparative study on developing machine learning-based surrogate models for efficient prediction of structural deformation in crash scenarios using the NVIDIA PhysicsNeMo framework. Given the limited prior work applying machine learning to structural crash dynamics, the primary contribution lies in demonstrating the feasibility and engineering utility of the various modeling approaches explored in this work. We investigate two state-of-the-art neural network architectures for modeling crash dynamics: MeshGraphNet, and Transolver. Additionally, we examine three strategies for modeling transient dynamics: time-conditional, the standard Autoregressive approach, and a stability-enhanced Autoregressive scheme incorporating rollout-based training. The models are evaluated on a comprehensive Body-in-White (BIW) crash dataset comprising 150 detailed FE simulations using LS-DYNA. The dataset represents a structurally rich vehicle assembly with over 200 components, including 38 key components featuring variable thickness distributions to capture realistic manufacturing variability. Each model utilizes the undeformed mesh geometry and component characteristics as inputs to predict the spatiotemporal evolution of the deformed mesh during the crash sequence. Evaluation results show that the models capture the overall deformation trends with reasonable fidelity, demonstrating the feasibility of applying machine learning to structural crash dynamics. Although not yet matching full FE accuracy, the models achieve orders-of-magnitude reductions in computational cost, enabling rapid design exploration and early-stage optimization in crashworthiness evaluation.

[309] Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection

Denis Janiak, Jakub Binkowski, Tomasz Kajdanowicz

Main category: cs.LG

TL;DR: This paper analyzes Mahalanobis distance methods for OOD detection, showing they aren’t universally reliable. It defines ideal representation geometry and uses spectral metrics to predict OOD performance, then proposes radially scaled ℓ2 normalization to improve detection by controlling feature space geometry.

Details

Motivation: To understand the impact of representation geometry and normalization on Mahalanobis distance methods for OOD detection, which is critical for reliable deep learning deployment but not fully understood.

Method: Comprehensive empirical study across diverse image foundation models, datasets, and normalization schemes. Analysis of representation geometry using spectral and intrinsic-dimensionality metrics. Proposal of radially scaled ℓ2 normalization with tunable parameter to control feature space geometry.

Result: Mahalanobis-based methods aren’t universally reliable. Spectral and intrinsic-dimensionality metrics can accurately predict OOD performance. Radially scaled ℓ2 normalization significantly improves OOD detection by systematically contracting or expanding representations.

Conclusion: The findings bridge the gap between representation geometry, normalization, and OOD performance, offering new insights for designing more effective and reliable deep learning models.

Abstract: Out-of-distribution (OOD) detection is critical for the reliable deployment of deep learning models. hile Mahalanobis distance methods are widely used, the impact of representation geometry and normalization on their performance is not fully understood, which may limit their downstream application. To address this gap, we conducted a comprehensive empirical study across diverse image foundation models, datasets, and distance normalization schemes. First, our analysis shows that Mahalanobis-based methods aren’t universally reliable. Second, we define the ideal geometry for data representations and demonstrate that spectral and intrinsic-dimensionality metrics can accurately predict a model’s OOD performance. Finally, we analyze how normalization impacts OOD performance. Building upon these studies, we propose radially scaled $\ell_2$ normalization, a method that generalizes the standard $\ell_2$ normalization recently applied to Mahalanobis-based OOD detection. Our approach introduces a tunable parameter to directly control the radial geometry of the feature space, systematically contracting or expanding representations to significantly improve OOD detection performance. By bridging the gap between representation geometry, normalization, and OOD performance, our findings offer new insights into the design of more effective and reliable deep learning models.

[310] ReasonIF: Large Reasoning Models Fail to Follow Instructions During Reasoning

Yongchan Kwon, Shang Zhu, Federico Bianchi, Kaitlyn Zhou, James Zou

Main category: cs.LG

TL;DR: The paper introduces ReasonIF, a benchmark to evaluate how well large reasoning models follow user instructions during their reasoning process, finding current models perform poorly (below 25% compliance) and proposing methods to improve instruction fidelity.

Details

Motivation: Current evaluation focuses on final responses, but it's critical for large reasoning models to follow instructions throughout their reasoning process for better controllability, transparency, and to reduce risks like hallucinations and reward hacking.

Method: Created ReasonIF benchmark with six instruction categories (multilingual reasoning, formatting, length control) and evaluated open-source LRMs. Explored two improvement strategies: multi-turn reasoning and Reasoning Instruction Finetuning (RIF) using synthetic data.

Result: Substantial failures in reasoning instruction adherence - highest instruction following score remains below 0.25 (less than 25% compliance). Performance degrades further with increased task difficulty. RIF improved GPT-OSS-20B’s score from 0.11 to 0.27.

Conclusion: Current large reasoning models have poor reasoning instruction following capabilities, but targeted interventions like RIF show measurable improvement, though significant room for enhancement remains.

Abstract: The ability of large language models (LLMs) to follow user instructions is central to their reliability, safety, and usefulness. While prior studies assess instruction adherence in the model’s main responses, we argue that it is also critical for large reasoning models (LRMs) to follow user instructions throughout their reasoning process. Reasoning instruction following makes LRMs more controllable and transparent, while reducing risks of undesirable shortcuts, hallucinations, or reward hacking within reasoning traces. To evaluate this dimension, we introduce ReasonIF, a systematic benchmark for assessing reasoning instruction following. ReasonIF includes six categories of instruction prompts, spanning multilingual reasoning, formatting and length control. Across many open-source LRMs including GPT-OSS, Qwen3, and DeepSeek-R1, we find substantial failures in reasoning instruction adherence: the highest instruction following score (IFS) remains below 0.25, meaning that fewer than $25%$ of reasoning traces comply with the given instructions. Notably, as task difficulty increases, reasoning instruction following degrades further. We also explore two strategies to enhance reasoning instruction fidelity. (1) multi-turn reasoning and (2) Reasoning Instruction Finetuning (RIF) using synthetic data. RIF improves the IFS of $GPT-OSS-20B$ from 0.11 to 0.27, indicating measurable progress but leaving ample room for improvement.

[311] Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential

Xuansheng Wu, Xiaoman Pan, Wenlin Yao, Jianshu Chen

Main category: cs.LG

TL;DR: The paper identifies that a model’s reasoning potential after RLVR depends on its pre-trained ability to distinguish sound from unsound knowledge, measured by the Soundness-Aware Level (SAL) metric.

Details

Motivation: To understand why reinforcement learning with verifiable rewards (RLVR) performance varies dramatically across different base models, and identify the microscopic property responsible.

Method: Formalize reasoning as chains of Horn clauses built from features extracted via cross-layer sparse autoencoders, estimate transition probabilities, categorize rules by semantic soundness levels, and introduce SAL metric using Jensen-Shannon Divergence.

Result: High-potential models are soundness-aware with distinct probability distributions for different soundness levels, while weaker models are soundness-agnostic. SAL predicts post-RLVR reasoning performance with R^2=0.87 across diverse models and scales.

Conclusion: Reasoning potential is tied to intrinsic pre-trained ability to distinguish sound from unsound knowledge, highlighting the critical role of pre-training and offering a practical metric for model selection.

Abstract: Reinforcement learning with verifiable rewards (RLVR) can elicit strong reasoning in large language models (LLMs), while their performance after RLVR varies dramatically across different base models. This raises a fundamental question: what microscopic property of pre-trained models leads to this variation? To investigate, we formalize reasoning as chains of Horn clauses (“if-then” rules) built from features extracted from the LLM’s latent space via cross-layer sparse autoencoders (SAEs). We estimate the transition probabilities between its features, and further categorize each rule by its semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key discovery is that high-potential models are inherently soundness-aware: their internal probability distributions systematically shift across rules’ soundness levels, becoming highly distinct for “strict” versus “noisy” rules. In contrast, weaker models are soundness-agnostic, collapsing to one distribution regardless of soundness levels. To quantify this, we introduce the Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon Divergence to measure the separation between these distributions. We show that SAL’s predictions of post-RLVR reasoning performance follow a precise empirical law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek) and scales (0.5B-14B). This reveals that a model’s reasoning potential is tied to its intrinsic, pre-trained ability to distinguish sound knowledge from unsound ones. These findings underscore the critical role of model pre-training in shaping reasoning and offer a practical metric grounded in the model’s internal mechanisms for selecting/designing stronger base models.

[312] Reflections from Research Roundtables at the Conference on Health, Inference, and Learning (CHIL) 2025

Emily Alsentzer, Marie-Laure Charpignon, Bill Chen, Niharika D’Souza, Jason Fries, Yixing Jiang, Aparajita Kashyap, Chanwoo Kim, Simon Lee, Aishwarya Mandyam, Ashery Christopher Mbilinyi, Nikita Mehandru, Nitish Nagesh, Brighton Nuwagira, Emma Pierson, Arvind Pillai, Akane Sano, Tanveer Syeda-Mahmood, Shashank Yadav, Elias Adhanom, Muhammad Umar Afza, Amelia Archer, Suhana Bedi, Vasiliki Bikia, Trenton Chang, George H. Chen, Winston Chen, Erica Chiang, Edward Choi, Octavia Ciora, Paz Dozie-Nnamah, Shaza Elsharief, Matthew Engelhard, Ali Eshragh, Jean Feng, Josh Fessel, Scott Fleming, Kei Sen Fong, Thomas Frost, Soham Gadgil, Judy Gichoya, Leeor Hershkovich, Sujeong Im, Bhavya Jain, Vincent Jeanselme, Furong Jia, Qixuan, Jin, Yuxuan Jin, Daniel Kapash, Geetika Kapoor, Behdokht Kiafar, Matthias Kleiner, Stefan Kraft, Annika Kumar, Daeun Kyung, Zhongyuan Liang, Joanna Lin, Qianchu, Liu, Chang Liu, Hongzhou Luan, Chris Lunt, Leopoldo Julían Lechuga López, Matthew B. A. McDermott, Shahriar Noroozizadeh, Connor O’Brien, YongKyung Oh, Mixail Ota, Stephen Pfohl, Meagan Pi, Tanmoy Sarkar Pias, Emma Rocheteau, Avishaan Sethi, Toru Shirakawa, Anita Silver, Neha Simha, Kamile Stankeviciute, Max Sunog, Peter Szolovits, Shengpu Tang, Jialu Tang, Aaron Tierney, John Valdovinos, Byron Wallace, Will Ke Wang, Peter Washington, Jeremy Weiss, Daniel Wolfe, Emily Wong, Hye Sun Yun, Xiaoman Zhang, Xiao Yu Cindy Zhang, Hayoung Jeong, Kaveri A. Thakoor

Main category: cs.LG

TL;DR: The CHIL 2025 conference featured Research Roundtables on key ML-healthcare topics including explainability, fairness, causality, and foundation models.

Details

Motivation: To foster collaborative dialogue and address critical challenges at the intersection of machine learning and healthcare through small-group discussions.

Method: Hosted eight Research Roundtables moderated by senior and junior chairs, focusing on rigorous discussion, exploration of opportunities, and collective ideation.

Result: Eight roundtables were successfully conducted on topics covering explainability, uncertainty/bias/fairness, causality, domain adaptation, foundation models, small medical data learning, multimodal methods, and scalable healthcare solutions.

Conclusion: The Research Roundtables successfully catalyzed collaborative dialogue and collective ideation toward actionable directions in health ML research.

Abstract: The 6th Annual Conference on Health, Inference, and Learning (CHIL 2025), hosted by the Association for Health Learning and Inference (AHLI), was held in person on June 25-27, 2025, at the University of California, Berkeley, in Berkeley, California, USA. As part of this year’s program, we hosted Research Roundtables to catalyze collaborative, small-group dialogue around critical, timely topics at the intersection of machine learning and healthcare. Each roundtable was moderated by a team of senior and junior chairs who fostered open exchange, intellectual curiosity, and inclusive engagement. The sessions emphasized rigorous discussion of key challenges, exploration of emerging opportunities, and collective ideation toward actionable directions in the field. In total, eight roundtables were held by 19 roundtable chairs on topics of “Explainability, Interpretability, and Transparency,” “Uncertainty, Bias, and Fairness,” “Causality,” “Domain Adaptation,” “Foundation Models,” “Learning from Small Medical Data,” “Multimodal Methods,” and “Scalable, Translational Healthcare Solutions.”

[313] Machine Learning for Early Detection of Meningitis: Stacked Ensemble Learning with EHR data

Han Ouyang, Jesse Hamilton, Saeed Amal

Main category: cs.LG

TL;DR: The paper develops an ensemble learning model using Random Forest, LightGBM, and DNN to diagnose meningitis from MIMIC-III data, achieving high AUC scores (0.9637 and 0.9472) on test sets.

Details

Motivation: To create an AI-driven diagnostic tool for meningitis that simulates real-world ER scenarios, enhancing clinical applicability despite current deployment challenges.

Method: Used MIMIC-III database with 214 meningitis and 46,303 non-meningitis patients. Applied data preprocessing, feature selection (gender, high-risk ICD codes), and ensemble learning with three base models aggregated through logistic regression meta-model.

Result: Excellent performance with AUC of 0.9637 on Testing Set 1 and 0.9472 on Testing Set 2, demonstrating high diagnostic accuracy for meningitis.

Conclusion: The ensemble learning approach shows promise for future AI-driven meningitis diagnosis, though direct clinical deployment remains challenging; the study establishes a foundation for such applications.

Abstract: We utilized a cohort of 214 meningitis patients and 46,303 non-meningitis patients from the MIMIC-III database. After extensive data preprocessing, which included ICD-based cohort selection, one-hot encoding of coding, and a two-stage feature selection process (for both the training set and the testing sets), clinically relevant features such as gender and high-risk ICD codes (including subarachnoid hemorrhage, secondary malignant neoplasm of the brain, and generalized epilepsy) are selected. Overall, these clinically reasonable and temporally adherent features provided excellent modeling performance. Three models (Random Forest, LightGBM, and Deep Neural Networks (DNN) are trained as base models for Ensemble Learning. Base model outputs are aggregated and stacked into a meta model (Logistic Regression) that uses the base model outputs as input values in training. Ultimately, soldier outputs (AUC of Testing Set 1: 0.9637, AUC of Testing Set 2: 0.9472) are obtained through ensemble learning. We created a challenging condition for diagnosing meningitis, simulating a real-world ER (Emergency Room) scenario to enhance clinical use in real-world applications. While directly deploying a diagnostic tool that clinicians can use is challenging, this paper paves the way for a potential future AI-driven diagnostic approach for meningitis using Ensemble Learning.

[314] Integrating Product Coefficients for Improved 3D LiDAR Data Classification (Part II)

Patricia Medina, Rasika Karkare

Main category: cs.LG

TL;DR: This paper extends previous work on 3D LiDAR classification by combining product coefficients with autoencoder representations and KNN classifiers, showing consistent performance improvements over PCA baselines and earlier methods.

Details

Motivation: To enhance 3D LiDAR point-cloud classification by building upon previous work with product coefficients, aiming to achieve better performance through improved feature representations.

Method: Combines product coefficients with autoencoder representations and KNN classifiers, systematically adding product coefficients level by level to analyze their impact on classification performance.

Result: The combination delivers consistent performance gains over PCA-based baselines and previous frameworks. Richer sets of product coefficients systematically improve class separability and overall accuracy.

Conclusion: Hierarchical product-coefficient features combined with autoencoders provide significant value for pushing LiDAR classification performance further.

Abstract: This work extends our previous study on enhancing 3D LiDAR point-cloud classification with product coefficients \cite{medina2025integratingproductcoefficientsimproved}, measure-theoretic descriptors that complement the original spatial Lidar features. Here, we show that combining product coefficients with an autoencoder representation and a KNN classifier delivers consistent performance gains over both PCA-based baselines and our earlier framework. We also investigate the effect of adding product coefficients level by level, revealing a clear trend: richer sets of coefficients systematically improve class separability and overall accuracy. The results highlight the value of combining hierarchical product-coefficient features with autoencoders to push LiDAR classification performance further.

[315] Stress-Aware Learning under KL Drift via Trust-Decayed Mirror Descent

Gabriel Nixon Raj

Main category: cs.LG

TL;DR: The paper proposes entropy-regularized trust-decay for sequential decision-making under distribution drift, achieving O(√T) dynamic regret under KL-drift path length constraints with stress-aware exponential tilting in belief updates and decisions.

Details

Motivation: To address sequential decision-making problems where the underlying data distribution changes over time (distribution drift), requiring robust algorithms that can adapt to uncertainty and maintain performance guarantees.

Method: Entropy-regularized trust-decay with stress-aware exponential tilting in both belief updates and mirror-descent decisions, using Fenchel-dual equivalence on the simplex. Includes parameter-free hedge adaptation, calibrated-stress bounds, and extensions to various settings.

Result: Achieves O(√T) dynamic regret under KL-drift path length ST = ∑t≥2√KL(Dt|Dt-1)/2, with O(1) per-switch regret. Provides high-probability sensitivity bounds and shows stress-free updates incur Ω(1) tails.

Conclusion: The framework unifies dynamic-regret analysis, distributionally robust objectives, and KL-regularized control within a single stress-adaptive update, providing robust performance guarantees under distribution drift.

Abstract: We study sequential decision-making under distribution drift. We propose entropy-regularized trust-decay, which injects stress-aware exponential tilting into both belief updates and mirror-descent decisions. On the simplex, a Fenchel-dual equivalence shows that belief tilt and decision tilt coincide. We formalize robustness via fragility (worst-case excess risk in a KL ball), belief bandwidth (radius sustaining a target excess), and a decision-space Fragility Index (drift tolerated at $O(\sqrt{T})$ regret). We prove high-probability sensitivity bounds and establish dynamic-regret guarantees of $\tilde{O}(\sqrt{T})$ under KL-drift path length $S_T = \sum_{t\ge2}\sqrt{{\rm KL}(D_t|D_{t-1})/2}$. In particular, trust-decay achieves $O(1)$ per-switch regret, while stress-free updates incur $\Omega(1)$ tails. A parameter-free hedge adapts the tilt to unknown drift, whereas persistent over-tilting yields an $\Omega(\lambda^2 T)$ stationary penalty. We further obtain calibrated-stress bounds and extensions to second-order updates, bandit feedback, outliers, stress variation, distributed optimization, and plug-in KL-drift estimation. The framework unifies dynamic-regret analysis, distributionally robust objectives, and KL-regularized control within a single stress-adaptive update.

[316] FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain

Tiansheng Hu, Tongyan Hu, Liuyang Bai, Yilun Zhao, Arman Cohan, Chen Zhao

Main category: cs.LG

TL;DR: FinTrust is a comprehensive benchmark for evaluating LLM trustworthiness in finance applications, assessing 11 LLMs across multiple alignment dimensions including safety, fairness, and legal awareness.

Details

Motivation: Applying LLMs in real-world finance is challenging due to high risks and stakes, requiring comprehensive trustworthiness evaluation beyond basic capabilities.

Method: Developed FinTrust benchmark with fine-grained tasks across multiple trustworthiness dimensions based on practical finance contexts, then evaluated 11 LLMs including proprietary and open-source models.

Result: Proprietary models like o4-mini excel in safety tasks, while open-source models like DeepSeek-V3 show advantages in industry-level fairness. All models struggle with fiduciary alignment and disclosure tasks, indicating significant legal awareness gaps.

Conclusion: FinTrust provides a valuable benchmark for evaluating LLM trustworthiness in finance, revealing current limitations in legal compliance and highlighting the need for specialized development in financial applications.

Abstract: Recent LLMs have demonstrated promising ability in solving finance related problems. However, applying LLMs in real-world finance application remains challenging due to its high risk and high stakes property. This paper introduces FinTrust, a comprehensive benchmark specifically designed for evaluating the trustworthiness of LLMs in finance applications. Our benchmark focuses on a wide range of alignment issues based on practical context and features fine-grained tasks for each dimension of trustworthiness evaluation. We assess eleven LLMs on FinTrust and find that proprietary models like o4-mini outperforms in most tasks such as safety while open-source models like DeepSeek-V3 have advantage in specific areas like industry-level fairness. For challenging task like fiduciary alignment and disclosure, all LLMs fall short, showing a significant gap in legal awareness. We believe that FinTrust can be a valuable benchmark for LLMs’ trustworthiness evaluation in finance domain.

[317] Adaptive Individual Uncertainty under Out-Of-Distribution Shift with Expert-Routed Conformal Prediction

Amitesh Badkul, Lei Xie

Main category: cs.LG

TL;DR: TESSERA is a novel uncertainty quantification method that provides reliable per-sample uncertainty with coverage guarantees, adaptive prediction intervals that track absolute error, and maintains competitive performance under distribution shifts in protein-ligand affinity prediction.

Details

Motivation: Current ML methods lack reliable uncertainty quantification, especially for risk-sensitive domains like drug discovery where protein-ligand affinity prediction faces challenges with heterogeneous assay noise, imbalanced chemical space, and distribution shifts.

Method: TESSERA unifies Mixture of Expert (MoE) diversity with conformal calibration to provide per-sample uncertainty with reliable coverage guarantees and adaptive prediction intervals that track absolute error.

Result: TESSERA achieves near-nominal coverage and the best coverage-width trade-off (measured by CWC) while maintaining competitive adaptivity (lowest AUSE) under both i.i.d. and scaffold-based OOD splits in protein-ligand binding affinity prediction.

Conclusion: TESSERA delivers trustworthy, tight, and adaptive uncertainties suitable for selective prediction and downstream decision-making in drug discovery and other applications by combining MoE diversity with conformal calibration.

Abstract: Reliable, informative, and individual uncertainty quantification (UQ) remains missing in current ML community. This hinders the effective application of AI/ML to risk-sensitive domains. Most methods either fail to provide coverage on new data, inflate intervals so broadly that they are not actionable, or assign uncertainties that do not track actual error, especially under a distribution shift. In high-stakes drug discovery, protein-ligand affinity (PLI) prediction is especially challenging as assay noise is heterogeneous, chemical space is imbalanced and large, and practical evaluations routinely involve distribution shift. In this work, we introduce a novel uncertainty quantification method, Trustworthy Expert Split-conformal with Scaled Estimation for Efficient Reliable Adaptive intervals (TESSERA), that provides per-sample uncertainty with reliable coverage guarantee, informative and adaptive prediction interval widths that track the absolute error. We evaluate on protein-ligand binding affinity prediction under both independent and identically distributed (i.i.d.) and scaffold-based out-of-distribution (OOD) splits, comparing against strong UQ baselines. TESSERA attains near-nominal coverage and the best coverage-width trade-off as measured by the Coverage-Width Criterion (CWC), while maintaining competitive adaptivity (lowest Area Under the Sparsification Error (AUSE)). Size-Stratified Coverage (SSC) further confirms that intervals are right-sized, indicating width increases when data are scarce or noisy, and remain tight when predictions are reliable. By unifying Mixture of Expert (MoE) diversity with conformal calibration, TESSERA delivers trustworthy, tight, and adaptive uncertainties that are well-suited to selective prediction and downstream decision-making in the drug-discovery pipeline and other applications.

[318] Dual-Weighted Reinforcement Learning for Generative Preference Modeling

Shengyu Feng, Yun He, Shuang Ma, Beibin Li, Yuanhao Xiong, Vincent Li, Karishma Mandyam, Julian Katz-Samuels, Shengjie Bi, Licheng Yu, Hejia Zhang, Karthik Abinav Sankararaman, Han Fang, Riham Mansour, Yiming Yang, Manaal Faruqui

Main category: cs.LG

TL;DR: DWRL is a new reinforcement learning framework that integrates chain-of-thought reasoning with preference modeling using dual weights to handle non-verifiable tasks with human preference pairs.

Details

Motivation: Extending RL to non-verifiable tasks with human preference pairs remains challenging and underexplored, while current methods mainly focus on tasks with verifiable answers.

Method: Dual-weighted RL objective that combines instance-wise misalignment weight and group-wise conditional preference score, training generative preference models to generate thoughts and predict human preference scores.

Result: DWRL consistently outperforms both GPM baselines and scalar models across multiple benchmarks and model scales (Llama3 and Qwen2.5), producing coherent and interpretable thoughts.

Conclusion: DWRL serves as a general framework for reasoning-enhanced preference learning that extends beyond verifiable tasks to handle human preference modeling.

Abstract: Reinforcement learning (RL) has recently proven effective at scaling chain-of-thought (CoT) reasoning in large language models on tasks with verifiable answers. However, extending RL to more general non-verifiable tasks, typically in the format of human preference pairs, remains both challenging and underexplored. In this work, we propose Dual-Weighted Reinforcement Learning (DWRL), a new framework for preference modeling that integrates CoT reasoning with the Bradley-Terry (BT) model via a dual-weighted RL objective that preserves preference-modeling inductive bias. DWRL approximates the maximum-likelihood objective of the BT model with two complementary weights: an instance-wise misalignment weight, which emphasizes under-trained pairs misaligned with human preference, and a group-wise (self-normalized) conditional preference score, which promotes promising thoughts. In this paper, we apply DWRL to preference modeling by training generative preference models (GPMs) to first generate a thought and then predict the human preference score. Across multiple benchmarks and model scales (Llama3 and Qwen2.5), DWRL consistently outperforms both GPM baselines and scalar models, while producing coherent, interpretable thoughts. In summary, our results position DWRL as a general framework for reasoning-enhanced preference learning beyond verifiable tasks.

[319] Reinforcement Learning with Stochastic Reward Machines

Jan Corazza, Ivan Gavran, Daniel Neider

Main category: cs.LG

TL;DR: Introduces stochastic reward machines to handle noisy rewards in reinforcement learning, with a constraint-solving algorithm that learns minimal models and guarantees optimal policy convergence.

Details

Motivation: Existing reward machine learning algorithms assume noise-free rewards, which is impractical in real-world scenarios with noisy reward functions.

Method: Proposes stochastic reward machines and a constraint-solving algorithm that learns minimal models from RL agent explorations, compatible with existing RM algorithms.

Result: The algorithm outperforms existing methods and naive approaches for handling noisy rewards in two case studies.

Conclusion: Stochastic reward machines effectively handle noisy reward functions in RL, with the proposed algorithm providing convergence guarantees and practical performance improvements.

Abstract: Reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequences of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. To overcome this practical limitation, we introduce a novel type of reward machines, called stochastic reward machines, and an algorithm for learning them. Our algorithm, based on constraint solving, learns minimal stochastic reward machines from the explorations of a reinforcement learning agent. This algorithm can easily be paired with existing reinforcement learning algorithms for reward machines and guarantees to converge to an optimal policy in the limit. We demonstrate the effectiveness of our algorithm in two case studies and show that it outperforms both existing methods and a naive approach for handling noisy reward functions.

[320] Spatiotemporal Transformers for Predicting Avian Disease Risk from Migration Trajectories

Dingya Feng, Dingyuan Xue

Main category: cs.LG

TL;DR: Transformer-based framework for predicting disease risk at migratory bird trajectory endpoints using GPS tracking, outbreak records, and geospatial data.

Details

Motivation: Accurate forecasting of avian disease outbreaks is critical for wildlife conservation and public health.

Method: Integrates multi-source datasets (GPS tracking, outbreak records, geospatial context) using H3 hierarchical geospatial encoding and Transformer architecture to learn spatiotemporal dependencies from bird movement sequences.

Result: Achieved accuracy of 0.9821, AUC of 0.9803, average precision of 0.9299, and F1-score of 0.8836 on test set.

Conclusion: Transformer architectures show strong potential for early-warning systems in avian disease surveillance, enabling timely intervention and prevention strategies.

Abstract: Accurate forecasting of avian disease outbreaks is critical for wildlife conservation and public health. This study presents a Transformer-based framework for predicting the disease risk at the terminal locations of migratory bird trajectories. We integrate multi-source datasets, including GPS tracking data from Movebank, outbreak records from the World Organisation for Animal Health (WOAH), and geospatial context from GADM and Natural Earth. The raw coordinates are processed using H3 hierarchical geospatial encoding to capture spatial patterns. The model learns spatiotemporal dependencies from bird movement sequences to estimate endpoint disease risk. Evaluation on a held-out test set demonstrates strong predictive performance, achieving an accuracy of 0.9821, area under the ROC curve (AUC) of 0.9803, average precision (AP) of 0.9299, and an F1-score of 0.8836 at the optimal threshold. These results highlight the potential of Transformer architectures to support early-warning systems for avian disease surveillance, enabling timely intervention and prevention strategies.

[321] DRO-InstructZero: Distributionally Robust Prompt Optimization for Large Language Models

Yangyang Li

Main category: cs.LG

TL;DR: DRO-InstructZero uses robust Bayesian optimization to create prompts that maintain performance under distribution shifts, improving reliability across tasks like formality rewriting and code debugging.

Details

Motivation: Existing prompt search methods degrade under distribution shift and adversarial evaluation because they optimize for a single evaluation distribution, leading to prompts that don't transfer well.

Method: Formulates zero-shot prompt optimization as robust Bayesian optimization using f-divergence balls to define ambiguity sets around evaluation distributions, with a robust acquisition rule maximizing worst-case expected utility.

Result: Achieved significant improvements: formality rewriting accuracy increased from 61.3% to 85-90% (+25-30 points), code debugging gained ~25 points under domain shift, while maintaining >96% on stable tasks like cause-and-effect.

Conclusion: DRO-InstructZero connects distributionally robust optimization with prompt learning, providing a plug-and-play approach for reliable prompt alignment under real-world uncertainty.

Abstract: Large language models are highly sensitive to prompt wording. However, popular automatic prompt search methods, including InstructZero, often degrade under distribution shift and adversarial evaluation because they optimize expected performance under a single evaluation distribution. Consequently, prompts that work in one setting frequently fail to transfer. To address this, DRO-InstructZero formulates zero-shot prompt optimization as robust Bayesian optimization. Specifically, an f-divergence ball defines an ambiguity set around the evaluation distribution, and a robust acquisition rule maximizes worst-case expected utility while retaining the query efficiency of Bayesian search. Therefore, the search explicitly targets reliability under distribution shift rather than average behavior alone. Experiments follow the instruction-induction protocol with matched query budgets across formality rewriting, code debugging, and translation. For example, on BIG-Bench informative-to-formal rewriting, accuracy improves from 61.3 +/- 0.7% to approximately 85-90%, yielding an absolute gain of about 25-30 points. Moreover, auto-debugging shows about +25-point gains under domain shift. Meanwhile, stable tasks such as cause-and-effect remain above 96%, indicating no loss on in-distribution cases. Furthermore, improvements are consistent across divergence choices and decoding temperatures. Overall, DRO-InstructZero connects distributionally robust optimization with prompt learning, offering a plug-and-play and general approach for reliable, transferable prompt alignment under real-world uncertainty.

[322] Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

Zhiyuan Fan, Yifeng Liu, Qingyue Zhao, Angela Yuan, Quanquan Gu

Main category: cs.LG

TL;DR: The paper introduces a weight-decay scaling rule for AdamW that enables zero-shot transfer of learning rates and weight decay across different model widths, extending μP beyond the near-init regime by controlling optimizer-governed steady-state scales.

Details

Motivation: Modern scale-invariant architectures quickly enter an optimizer-governed steady state where normalization layers create backward scale sensitivity, making the effective learning rate width-dependent and degrading μP transfer capabilities.

Method: The authors propose a weight-decay scaling rule λ₂ ∝ √d for matrix-like parameters combined with vector-like parameters trained at constant learning rate and zero weight decay. This is based on empirical observations of singular-value spectrum scaling and maintains sublayer gain invariance across widths.

Result: The proposed method enables zero-shot transfer of both learning rate and weight decay from proxy to target widths, eliminating the need for per-width hyperparameter sweeps. Validation on LLaMA-style Transformers and synthetic settings confirms the effectiveness.

Conclusion: The work extends μP beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, providing a practical recipe for width-robust hyperparameter transfer under AdamW.

Abstract: Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization ($\mu$P) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady state where normalization layers create backward scale sensitivity and the effective learning rate becomes width dependent, degrading $\mu$P transfer. We address this by introducing a weight-decay scaling rule for AdamW that preserves sublayer gain across widths. Empirically, the singular-value spectrum of each matrix parameter scales in norm as $\sqrt{\eta/\lambda}$ with an approximately invariant shape; under width scaling $d$, we observe that the top singular value scales approximately as $\sqrt{\eta/\lambda}\cdot d^{0.75}$. Combining this observation with the $\mu$P learning-rate rule $\eta_2\propto d^{-1}$ for matrix-like parameters implies an empirical weight-decay scaling rule $\lambda_2\propto \sqrt{d}$ that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at $\eta_1=\Theta_d(1)$ and $\lambda_1=0$, this yields \emph{zero-shot} transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps. We validate the rule on LLaMA-style Transformers and in a minimal synthetic setting, and we provide a simple diagnostic, matching top singular values, to check sublayer-gain invariance. Our results extend $\mu$P beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, offering a practical recipe for width-robust hyperparameter transfer under AdamW.

[323] Causal Time Series Modeling of Supraglacial Lake Evolution in Greenland under Distribution Shift

Emam Hossain, Muhammad Hasan Ferdous, Devon Dunmire, Aneesh Subramanian, Md Osman Gani

Main category: cs.LG

TL;DR: RIC-TSC framework embeds causal discovery into time-series classification for Earth observation, achieving 12.59% higher accuracy than correlation-based methods under distribution shifts.

Details

Motivation: Current spatiotemporal Earth observation models rely on correlational features that fail to transfer across heterogeneous domains, while causal modeling offers principled foundation for stable, invariant relationships.

Method: Propose RIC-TSC framework using Joint PCMCI+ for causal discovery with multi-modal satellite data (Sentinel-1, Sentinel-2, Landsat-8, CARRA), estimating causal graphs globally and per basin, then feeding validated predictors to lightweight classifiers.

Result: On 1000 manually labeled lakes from contrasting melt seasons (2018-2019), causal models achieved up to 12.59% higher accuracy than correlation-based baselines under out-of-distribution evaluation.

Conclusion: Causal discovery is not only a feature selection method but also enables generalizable and mechanistically grounded models of dynamic Earth surface processes.

Abstract: Causal modeling offers a principled foundation for uncovering stable, invariant relationships in time-series data, thereby improving robustness and generalization under distribution shifts. Yet its potential is underutilized in spatiotemporal Earth observation, where models often depend on purely correlational features that fail to transfer across heterogeneous domains. We propose RIC-TSC, a regionally-informed causal time-series classification framework that embeds lag-aware causal discovery directly into sequence modeling, enabling both predictive accuracy and scientific interpretability. Using multi-modal satellite and reanalysis data-including Sentinel-1 microwave backscatter, Sentinel-2 and Landsat-8 optical reflectance, and CARRA meteorological variables-we leverage Joint PCMCI+ (J-PCMCI+) to identify region-specific and invariant predictors of supraglacial lake evolution in Greenland. Causal graphs are estimated globally and per basin, with validated predictors and their time lags supplied to lightweight classifiers. On a balanced benchmark of 1000 manually labeled lakes from two contrasting melt seasons (2018-2019), causal models achieve up to 12.59% higher accuracy than correlation-based baselines under out-of-distribution evaluation. These results show that causal discovery is not only a means of feature selection but also a pathway to generalizable and mechanistically grounded models of dynamic Earth surface processes.

[324] Semi-Supervised Regression with Heteroscedastic Pseudo-Labels

Xueqing Sun, Renzhen Wang, Quanziang Wang, Yichen Wu, Xixi Jia, Deyu Meng

Main category: cs.LG

TL;DR: Proposes an uncertainty-aware pseudo-labeling framework for semi-supervised regression that dynamically adjusts pseudo-label influence through bi-level optimization to handle continuous outputs with heteroscedastic noise.

Details

Motivation: Pseudo-labeling in semi-supervised regression is challenging due to continuous outputs with heteroscedastic noise, making it difficult to assess pseudo-label reliability, which can lead to error accumulation and overfitting to incorrect labels.

Method: Uncertainty-aware pseudo-labeling framework using bi-level optimization that jointly minimizes empirical risk over all data while optimizing uncertainty estimates to enhance generalization on labeled data.

Result: Extensive experiments on benchmark SSR datasets demonstrate superior robustness and performance compared to existing methods.

Conclusion: The proposed uncertainty-aware pseudo-labeling framework effectively mitigates the impact of unreliable pseudo-labels in semi-supervised regression through dynamic adjustment of pseudo-label influence.

Abstract: Pseudo-labeling is a commonly used paradigm in semi-supervised learning, yet its application to semi-supervised regression (SSR) remains relatively under-explored. Unlike classification, where pseudo-labels are discrete and confidence-based filtering is effective, SSR involves continuous outputs with heteroscedastic noise, making it challenging to assess pseudo-label reliability. As a result, naive pseudo-labeling can lead to error accumulation and overfitting to incorrect labels. To address this, we propose an uncertainty-aware pseudo-labeling framework that dynamically adjusts pseudo-label influence from a bi-level optimization perspective. By jointly minimizing empirical risk over all data and optimizing uncertainty estimates to enhance generalization on labeled data, our method effectively mitigates the impact of unreliable pseudo-labels. We provide theoretical insights and extensive experiments to validate our approach across various benchmark SSR datasets, and the results demonstrate superior robustness and performance compared to existing methods. Our code is available at https://github.com/sxq/Heteroscedastic-Pseudo-Labels.

[325] Foundation Models for Scientific Discovery: From Paradigm Enhancement to Paradigm Transition

Fan Liu, Jindong Han, Tengfei Lyu, Weijia Zhang, Zhe-Rui Yang, Lu Dai, Cancheng Liu, Hao Liu

Main category: cs.LG

TL;DR: Foundation models are transforming scientific research through a three-stage evolution from enhancement to autonomous discovery, potentially redefining how science is conducted.

Details

Motivation: To explore whether foundation models are merely enhancing existing scientific methodologies or fundamentally redefining the way science is conducted, addressing the transformative potential of FMs in scientific research.

Method: Introduces a three-stage framework: (1) Meta-Scientific Integration - FMs enhance traditional workflows; (2) Hybrid Human-AI Co-Creation - FMs become active collaborators; (3) Autonomous Scientific Discovery - FMs operate as independent agents with minimal human intervention.

Result: The paper reviews current applications and emerging capabilities of FMs across scientific paradigms, identifying their transformative role in accelerating tasks like hypothesis generation, experimental design, and result interpretation.

Conclusion: Foundation models are catalyzing a transition toward a new scientific paradigm, requiring the scientific community to understand their transformative role and reflect on the future of scientific discovery.

Abstract: Foundation models (FMs), such as GPT-4 and AlphaFold, are reshaping the landscape of scientific research. Beyond accelerating tasks such as hypothesis generation, experimental design, and result interpretation, they prompt a more fundamental question: Are FMs merely enhancing existing scientific methodologies, or are they redefining the way science is conducted? In this paper, we argue that FMs are catalyzing a transition toward a new scientific paradigm. We introduce a three-stage framework to describe this evolution: (1) Meta-Scientific Integration, where FMs enhance workflows within traditional paradigms; (2) Hybrid Human-AI Co-Creation, where FMs become active collaborators in problem formulation, reasoning, and discovery; and (3) Autonomous Scientific Discovery, where FMs operate as independent agents capable of generating new scientific knowledge with minimal human intervention. Through this lens, we review current applications and emerging capabilities of FMs across existing scientific paradigms. We further identify risks and future directions for FM-enabled scientific discovery. This position paper aims to support the scientific community in understanding the transformative role of FMs and to foster reflection on the future of scientific discovery. Our project is available at https://github.com/usail-hkust/Awesome-Foundation-Models-for-Scientific-Discovery.

[326] Small Ensemble-based Data Assimilation: A Machine Learning-Enhanced Data Assimilation Method with Limited Ensemble Size

Zhilin Li, Zhou Yao, Xianglong Li, Zeng Liu, Zhaokuan Lu, Shanlin Xu, Seungnam Kim, Guangyao Wang

Main category: cs.LG

TL;DR: A novel machine learning-based data assimilation method that combines ensemble Kalman filter with neural networks to improve accuracy without significant computational cost increase.

Details

Motivation: Address the trade-off between analysis accuracy and computational efficiency in ensemble-based data assimilation methods, where larger ensemble sizes for higher accuracy lead to greater computational costs.

Method: Use small ensemble size EnKF to generate preliminary analysis states, then employ fully connected neural network to learn and predict correction terms to mitigate performance degradation from limited ensemble size.

Result: Achieves higher accuracy than traditional EnKF with same ensemble size in Lorenz systems and nonlinear ocean wave field simulations, with negligible additional computational cost.

Conclusion: The EnKF-FCNN method effectively improves data assimilation performance and is adaptable to diverse applications through coupling with different models and alternative ensemble-based DA methods.

Abstract: Ensemble-based data assimilation (DA) methods have become increasingly popular due to their inherent ability to address nonlinear dynamic problems. However, these methods often face a trade-off between analysis accuracy and computational efficiency, as larger ensemble sizes required for higher accuracy also lead to greater computational cost. In this study, we propose a novel machine learning-based data assimilation approach that combines the traditional ensemble Kalman filter (EnKF) with a fully connected neural network (FCNN). Specifically, our method uses a relatively small ensemble size to generate preliminary yet suboptimal analysis states via EnKF. A FCNN is then employed to learn and predict correction terms for these states, thereby mitigating the performance degradation induced by the limited ensemble size. We evaluate the performance of our proposed EnKF-FCNN method through numerical experiments involving Lorenz systems and nonlinear ocean wave field simulations. The results consistently demonstrate that the new method achieves higher accuracy than traditional EnKF with the same ensemble size, while incurring negligible additional computational cost. Moreover, the EnKF-FCNN method is adaptable to diverse applications through coupling with different models and the use of alternative ensemble-based DA methods.

[327] Identifying internal patterns in (1+1)-dimensional directed percolation using neural networks

Danil Parkhomenko, Pavel Ovchinnikov, Konstantin Soldatov, Vitalii Kapitan, Gennady Y. Chitov

Main category: cs.LG

TL;DR: Neural network method for automatic detection of phase transitions and classification of hidden percolation patterns in (1+1)-dimensional replication processes

Details

Motivation: To develop an automated approach for detecting phase transitions and classifying hidden percolation patterns without manual feature extraction

Method: Combination of CNN, TCN and GRU networks trained directly on raw configurations

Result: The network successfully reproduces the phase diagram and assigns phase labels to configurations

Conclusion: Deep architectures can effectively extract hierarchical structures from raw numerical experiment data

Abstract: In this paper we present a neural network-based method for the automatic detection of phase transitions and classification of hidden percolation patterns in a (1+1)-dimensional replication process. The proposed network model is based on the combination of CNN, TCN and GRU networks, which are trained directly on raw configurations without any manual feature extraction. The network reproduces the phase diagram and assigns phase labels to configurations. It shows that deep architectures are capable of extracting hierarchical structures from the raw data of numerical experiments.

[328] DFCA: Decentralized Federated Clustering Algorithm

Jonas Kirch, Sebastian Becker, Tiago Koketsu Rodrigues, Stefan Harmeling

Main category: cs.LG

TL;DR: DFCA is a fully decentralized clustered federated learning algorithm that eliminates the need for a central server, enabling clients to collaboratively train cluster-specific models through sequential running average aggregation from neighbors.

Details

Motivation: Existing clustered FL methods like IFCA rely on central servers, creating bottlenecks and single points of failure, limiting their applicability in realistic decentralized settings.

Method: DFCA uses sequential running average to aggregate models from neighbors as updates arrive, providing communication-efficient alternative to batch aggregation while maintaining clustering performance in a fully decentralized setting.

Result: Experiments show DFCA outperforms other decentralized algorithms and performs comparably to centralized IFCA, even under sparse connectivity.

Conclusion: DFCA demonstrates robustness and practicality for dynamic real-world decentralized networks by eliminating central coordination while maintaining clustering performance.

Abstract: Clustered Federated Learning has emerged as an effective approach for handling heterogeneous data across clients by partitioning them into clusters with similar or identical data distributions. However, most existing methods, including the Iterative Federated Clustering Algorithm (IFCA), rely on a central server to coordinate model updates, which creates a bottleneck and a single point of failure, limiting their applicability in more realistic decentralized learning settings. In this work, we introduce DFCA, a fully decentralized clustered FL algorithm that enables clients to collaboratively train cluster-specific models without central coordination. DFCA uses a sequential running average to aggregate models from neighbors as updates arrive, providing a communication-efficient alternative to batch aggregation while maintaining clustering performance. Our experiments on various datasets demonstrate that DFCA outperforms other decentralized algorithms and performs comparably to centralized IFCA, even under sparse connectivity, highlighting its robustness and practicality for dynamic real-world decentralized networks.

[329] On the Generalization Properties of Learning the Random Feature Models with Learnable Activation Functions

Zailin Ma, Jiansheng Yang, Yaodong Yang

Main category: cs.LG

TL;DR: This paper provides sharp generalization bounds for Random Feature models with Learnable Activation Functions (RFLAF), showing that data-dependent weighted sampling significantly reduces the required number of features compared to plain sampling.

Details

Motivation: To improve the generalization properties and reduce the computational complexity of kernel methods by developing sharper bounds on the required number of features for RFLAF models.

Method: Applied data-dependent sampling schemes (leverage weighted sampling) for generating features, proposed an algorithm to find approximate kernels, and analyzed both plain sampling and weighted sampling schemes.

Result: Weighted sampling improved bounds from Ω(1/ε²) to Õ((1/ε)^{1/t}) for MSE loss (and Ω(1) for finite-rank Gram matrices), and from Ω(1/ε²) to Õ((1/ε²)^{1/t}) for Lipschitz loss. Empirical results showed weighted RFLAF achieves same performance with significantly fewer features.

Conclusion: Data-dependent weighted sampling provides substantial improvements in feature efficiency for RFLAF models, with theoretical bounds and empirical evidence supporting the effectiveness of this approach.

Abstract: This paper studies the generalization properties of a recently proposed kernel method, the Random Feature models with Learnable Activation Functions (RFLAF). By applying a data-dependent sampling scheme for generating features, we provide by far the sharpest bounds on the required number of features for learning RFLAF in both the regression and classification tasks. We provide a unified theorem that describes the complexity of the feature number $s$, and discuss the results for the plain sampling scheme and the data-dependent leverage weighted scheme. Through weighted sampling, the bound on $s$ in the MSE loss case is improved from $\Omega(1/\epsilon^2)$ to $\tilde{\Omega}((1/\epsilon)^{1/t})$ in general $(t\geq 1)$, and even to $\Omega(1)$ when the Gram matrix has a finite rank. For the Lipschitz loss case, the bound is improved from $\Omega(1/\epsilon^2)$ to $\tilde{\Omega}((1/\epsilon^2)^{1/t})$. To learn the weighted RFLAF, we also propose an algorithm to find an approximate kernel and then apply the leverage weighted sampling. Empirical results show that the weighted RFLAF achieves the same performances with a significantly fewer number of features compared to the plainly sampled RFLAF, validating our theories and the effectiveness of this method.

[330] Backdoor or Manipulation? Graph Mixture of Experts Can Defend Against Various Graph Adversarial Attacks

Yuyuan Feng, Bin Ma, Enyan Dai

Main category: cs.LG

TL;DR: Proposes a unified defense framework using Mixture of Experts (MoE) architecture to protect graph neural networks against multiple adversarial attacks including backdoor, edge manipulation, and node injection attacks.

Details

Motivation: Existing GNN defenses focus on single attack types, lacking a unified approach to handle multiple threats simultaneously. There's a need for comprehensive protection against various adversarial attacks on graph neural networks.

Method: Uses Mixture of Experts architecture with MI-based logic diversity loss to encourage experts to focus on different neighborhood structures, and a robustness-aware router that identifies perturbation patterns and routes perturbed nodes to robust experts.

Result: Extensive experiments show the method consistently achieves superior robustness against multiple graph adversarial attacks across various adversarial settings.

Conclusion: The proposed MoE-based framework provides an effective unified defense solution that outperforms existing methods in protecting GNNs against diverse adversarial threats.

Abstract: Extensive research has highlighted the vulnerability of graph neural networks (GNNs) to adversarial attacks, including manipulation, node injection, and the recently emerging threat of backdoor attacks. However, existing defenses typically focus on a single type of attack, lacking a unified approach to simultaneously defend against multiple threats. In this work, we leverage the flexibility of the Mixture of Experts (MoE) architecture to design a scalable and unified framework for defending against backdoor, edge manipulation, and node injection attacks. Specifically, we propose an MI-based logic diversity loss to encourage individual experts to focus on distinct neighborhood structures in their decision processes, thus ensuring a sufficient subset of experts remains unaffected under perturbations in local structures. Moreover, we introduce a robustness-aware router that identifies perturbation patterns and adaptively routes perturbed nodes to corresponding robust experts. Extensive experiments conducted under various adversarial settings demonstrate that our method consistently achieves superior robustness against multiple graph adversarial attacks.

[331] Sequence Modeling with Spectral Mean Flows

Jinwoo Kim, Max Beier, Petar Bevanda, Nayun Kim, Seunghoon Hong

Main category: cs.LG

TL;DR: A novel sequence modeling approach based on operator theory that embeds sequence distributions as tensors in Hilbert spaces and uses MMD gradient flow for generation, with spectral decomposition for scalability.

Details

Motivation: To address the challenge of representing and learning highly nonlinear probabilistic state dynamics in sequence modeling by leveraging operator theory, which offers an appealing but overlooked perspective on dynamics as linear maps in Hilbert spaces.

Method: Proposes spectral mean flows: (1) uses spectral decomposition of linear operators to create scalable tensor network decomposition of sequence mean embeddings, (2) extends MMD gradient flows to time-dependent Hilbert spaces and connects them to flow matching via continuity equation for simulation-free learning.

Result: Demonstrates competitive results on various time-series modeling datasets, overcoming challenges with large tensors and slow sampling convergence.

Conclusion: The operator-theoretic approach combined with spectral decomposition and MMD gradient flows provides an effective framework for sequence modeling with improved scalability and sampling efficiency.

Abstract: A key question in sequence modeling with neural networks is how to represent and learn highly nonlinear and probabilistic state dynamics. Operator theory views such dynamics as linear maps on Hilbert spaces containing mean embedding vectors of distributions, offering an appealing but currently overlooked perspective. We propose a new approach to sequence modeling based on an operator-theoretic view of a hidden Markov model (HMM). Instead of materializing stochastic recurrence, we embed the full sequence distribution as a tensor in the product Hilbert space. A generative process is then defined as maximum mean discrepancy (MMD) gradient flow in the space of sequences. To overcome challenges with large tensors and slow sampling convergence, we introduce spectral mean flows, a novel tractable algorithm integrating two core concepts. First, we propose a new neural architecture by leveraging spectral decomposition of linear operators to derive a scalable tensor network decomposition of sequence mean embeddings. Second, we extend MMD gradient flows to time-dependent Hilbert spaces and connect them to flow matching via the continuity equation, enabling simulation-free learning and faster sampling. We demonstrate competitive results on a range of time-series modeling datasets. Code is available at https://github.com/jw9730/spectral-mean-flow.

[332] Towards Robust Zero-Shot Reinforcement Learning

Kexin Zheng, Lauriane Teyssier, Yinan Zheng, Yu Luo, Xiayuan Zhan

Main category: cs.LG

TL;DR: BREEZE is an improved zero-shot RL framework that enhances Forward-Backward representations with behavioral regularization, diffusion-based policy extraction, and attention-based architectures to address expressivity limitations and OOD issues.

Details

Motivation: Existing zero-shot RL methods like Forward-Backward representations suffer from limited expressivity and extrapolation errors from out-of-distribution actions during offline learning, leading to biased representations and suboptimal performance.

Method: BREEZE introduces behavioral regularization for stable in-sample policy learning, uses task-conditioned diffusion models for multimodal policy extraction, and employs expressive attention-based architectures for representation modeling.

Result: Extensive experiments on ExORL and D4RL Kitchen show BREEZE achieves best or near-best performance with superior robustness compared to prior offline zero-shot RL methods.

Conclusion: BREEZE successfully addresses the limitations of existing zero-shot RL methods by enhancing learning stability, policy extraction capability, and representation quality through its integrated framework.

Abstract: The recent development of zero-shot reinforcement learning (RL) has opened a new avenue for learning pre-trained generalist policies that can adapt to arbitrary new tasks in a zero-shot manner. While the popular Forward-Backward representations (FB) and related methods have shown promise in zero-shot RL, we empirically found that their modeling lacks expressivity and that extrapolation errors caused by out-of-distribution (OOD) actions during offline learning sometimes lead to biased representations, ultimately resulting in suboptimal performance. To address these issues, we propose Behavior-REgularizEd Zero-shot RL with Expressivity enhancement (BREEZE), an upgraded FB-based framework that simultaneously enhances learning stability, policy extraction capability, and representation learning quality. BREEZE introduces behavioral regularization in zero-shot RL policy learning, transforming policy optimization into a stable in-sample learning paradigm. Additionally, BREEZE extracts the policy using a task-conditioned diffusion model, enabling the generation of high-quality and multimodal action distributions in zero-shot RL settings. Moreover, BREEZE employs expressive attention-based architectures for representation modeling to capture the complex relationships between environmental dynamics. Extensive experiments on ExORL and D4RL Kitchen demonstrate that BREEZE achieves the best or near-the-best performance while exhibiting superior robustness compared to prior offline zero-shot RL methods. The official implementation is available at: https://github.com/Whiterrrrr/BREEZE.

Mingyang Sun, Pengxiang Ding, Weinan Zhang, Donglin Wang

Main category: cs.LG

TL;DR: SWFP framework enables stable online fine-tuning of flow policies by discretizing flow matching into stepwise transformations aligned with JKO optimal transport principles, overcoming distributional shift in behavior cloning.

Details

Motivation: Behavior cloning with flow policies is vulnerable to distributional shift, and standard RL methods struggle to fine-tune these models due to iterative inference processes and existing workaround limitations.

Method: Discretizes flow matching inference via fixed-step Euler scheme aligned with variational JKO principle, decomposing global flow into incremental transformations between proximate distributions with entropic regularization.

Result: SWFP achieves enhanced stability, efficiency, and superior adaptation performance across diverse robotic control benchmarks with simpler training, reduced computational costs, and provable stability.

Conclusion: The stepwise decomposition approach provides an effective framework for fine-tuning pre-trained flow policies with theoretical guarantees and practical advantages for robotic control applications.

Abstract: While behavior cloning with flow/diffusion policies excels at learning complex skills from demonstrations, it remains vulnerable to distributional shift, and standard RL methods struggle to fine-tune these models due to their iterative inference process and the limitations of existing workarounds. In this work, we introduce the Stepwise Flow Policy (SWFP) framework, founded on the key insight that discretizing the flow matching inference process via a fixed-step Euler scheme inherently aligns it with the variational Jordan-Kinderlehrer-Otto (JKO) principle from optimal transport. SWFP decomposes the global flow into a sequence of small, incremental transformations between proximate distributions. Each step corresponds to a JKO update, regularizing policy changes to stay near the previous iterate and ensuring stable online adaptation with entropic regularization. This decomposition yields an efficient algorithm that fine-tunes pre-trained flows via a cascade of small flow blocks, offering significant advantages: simpler/faster training of sub-models, reduced computational/memory costs, and provable stability grounded in Wasserstein trust regions. Comprehensive experiments demonstrate SWFP’s enhanced stability, efficiency, and superior adaptation performance across diverse robotic control benchmarks.

[334] Geometric Mixture Models for Electrolyte Conductivity Prediction

Anyi Li, Jiacheng Cen, Songyou Li, Mingze Li, Yang Yu, Wenbing Huang

Main category: cs.LG

TL;DR: GeoMix is a geometry-aware framework for predicting ionic conductivity in electrolyte systems that addresses challenges in standardized benchmarks and geometric modeling of mixture systems.

Details

Motivation: Current research faces two fundamental challenges: lack of high-quality standardized benchmarks and inadequate modeling of geometric structure and intermolecular interactions in mixture systems.

Method: Reorganize CALiSol and DiffMix datasets with geometric graph representations, then propose GeoMix framework with Geometric Interaction Network (GIN) for equivariant intermolecular geometric message passing that preserves Set-SE(3) equivariance.

Result: GeoMix consistently outperforms diverse baselines (MLPs, GNNs, and geometric GNNs) across both datasets, validating the importance of cross-molecular geometric interactions and equivariant message passing.

Conclusion: This work establishes new benchmarks for electrolyte research and provides a general geometric learning framework that advances modeling of mixture systems in energy materials, pharmaceutical development, and beyond.

Abstract: Accurate prediction of ionic conductivity in electrolyte systems is crucial for advancing numerous scientific and technological applications. While significant progress has been made, current research faces two fundamental challenges: (1) the lack of high-quality standardized benchmarks, and (2) inadequate modeling of geometric structure and intermolecular interactions in mixture systems. To address these limitations, we first reorganize and enhance the CALiSol and DiffMix electrolyte datasets by incorporating geometric graph representations of molecules. We then propose GeoMix, a novel geometry-aware framework that preserves Set-SE(3) equivariance-an essential but challenging property for mixture systems. At the heart of GeoMix lies the Geometric Interaction Network (GIN), an equivariant module specifically designed for intermolecular geometric message passing. Comprehensive experiments demonstrate that GeoMix consistently outperforms diverse baselines (including MLPs, GNNs, and geometric GNNs) across both datasets, validating the importance of cross-molecular geometric interactions and equivariant message passing for accurate property prediction. This work not only establishes new benchmarks for electrolyte research but also provides a general geometric learning framework that advances modeling of mixture systems in energy materials, pharmaceutical development, and beyond.

[335] Online Kernel Dynamic Mode Decomposition for Streaming Time Series Forecasting with Adaptive Windowing

Christopher Salazar, Krithika Manohar, Ashis G. Banerjee

Main category: cs.LG

TL;DR: WORK-DMD is an online forecasting method that combines Random Fourier Features with Dynamic Mode Decomposition to handle non-stationary streaming data with fixed computational cost and competitive accuracy.

Details

Motivation: Address challenges in real-time forecasting: handling non-stationary dynamics, operating under computational constraints, and adapting rapidly without catastrophic forgetting, overcoming trade-offs between accuracy, adaptability, and efficiency.

Method: Combines Random Fourier Features with online Dynamic Mode Decomposition using explicit feature mapping, employs Sherman-Morrison updates within rolling windows for continuous adaptation from current data only.

Result: Achieves higher accuracy than state-of-the-art online forecasting methods across benchmark datasets, requires only single pass through data, shows strong performance in short-term forecasting with minimal data requirements.

Conclusion: Combining kernel evaluations with adaptive matrix updates achieves strong predictive performance with minimal data, offering practical alternative to deep learning for streaming forecasting applications.

Abstract: Real-time forecasting from streaming data poses critical challenges: handling non-stationary dynamics, operating under strict computational limits, and adapting rapidly without catastrophic forgetting. However, many existing approaches face trade-offs between accuracy, adaptability, and efficiency, particularly when deployed in constrained computing environments. We introduce WORK-DMD (Windowed Online Random Kernel Dynamic Mode Decomposition), a method that combines Random Fourier Features with online Dynamic Mode Decomposition to capture nonlinear dynamics through explicit feature mapping, while preserving fixed computational cost and competitive predictive accuracy across evolving data. WORK-DMD employs Sherman-Morrison updates within rolling windows, enabling continuous adaptation to evolving dynamics from only current data, eliminating the need for lengthy training or large storage requirements for historical data. Experiments on benchmark datasets across several domains show that WORK-DMD achieves higher accuracy than several state-of-the-art online forecasting methods, while requiring only a single pass through the data and demonstrating particularly strong performance in short-term forecasting. Our results show that combining kernel evaluations with adaptive matrix updates achieves strong predictive performance with minimal data requirements. This sample efficiency offers a practical alternative to deep learning for streaming forecasting applications.

[336] ParaFormer: Shallow Parallel Transformers with Progressive Approximation

Wei Wang, Xiao-Yong Wei, Qing Li

Main category: cs.LG

TL;DR: ParaFormer is a shallow Transformer architecture that achieves high performance through parallel branches instead of deep sequential layers, enabling faster training, inference, and model compression while maintaining competitive performance.

Details

Motivation: Address challenges of deep Transformers including long training times, high inference latency, and impracticality on resource-constrained devices by moving away from the 'deeper is better' philosophy.

Method: Formulates Transformers as function approximators in closed-form, organizes layers into parallel branches with progressive approximation that ensures each new branch reduces loss from preceding branches, enabling inter-layer collaboration without sequential constraints.

Result: Outperforms standard Transformers like ViT, supports up to 15.07x model compression, enables 3.30x faster deployment than FairScale on multi-GPU systems, and facilitates adaptive continuous learning through model expansion.

Conclusion: The closed-form formulation based on Universal Approximation Theorem explains the ‘depth belief’ and opens new avenues for designing efficient Transformer architectures through parallel rather than sequential designs.

Abstract: The widespread ‘deeper is better’ philosophy has driven the creation of architectures like ResNet and Transformer, which achieve high performance by stacking numerous layers. However, increasing model depth comes with challenges such as longer training times, higher inference latency, and impracticality on resource-constrained devices. To address these issues, we propose ParaFormer, a shallow Transformer architecture designed for true parallelism in both structure and computation. By formulating standard Transformers as function approximators in closed-form, our theoretical analysis shows that their performance relies on inter-layer collaboration for progressive approximation, rather than depth itself. While deep Transformers enforce this collaboration through sequential designs, we demonstrate that such collaboration is not inherently tied to sequential structures. ParaFormer removes the sequential constraint by organizing layers into parallel branches, enforcing inter-layer collaboration algorithmically. Specifically, we implement progressive approximation, ensuring that each new branch further reduces the loss from preceding branches, enabling faster convergence. Extensive experiments validate ParaFormer’s effectiveness, outperforming standard Transformers like ViT. Moreover, ParaFormer supports up to 15.07x model compression and facilitates model expansion for adaptive continuous learning. Experimental results on multi-GPU deployment demonstrate that ParaFormer is 3.30x faster than widely used parallelism solutions such as FairScale. These advancements stem from our closed-form formulation of Transformers based on the Universal Approximation Theorem, which not only explains the ``depth belief’’ but also opens new avenues for designing efficient Transformer architectures. Source code: https://(open-upon-acceptance)

[337] Safe, Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models

Shashank Gupta

Main category: cs.LG

TL;DR: This dissertation develops safe, sample-efficient reinforcement learning methods for ranking systems and text-to-image diffusion models, with theoretical guarantees and practical algorithms including exposure-based bounds, optimal baselines, and the LOOP algorithm.

Details

Motivation: To address the need for safe deployment of RL methods in real-world applications like ranking/recommendation systems and generative models, ensuring they don't underperform existing policies while maintaining efficiency and robustness.

Method: Uses contextual-bandit RL framework with three main approaches: 1) exposure-based generalization bounds and counterfactual risk minimization for ranking systems, 2) unified baseline-correction framework with optimal baseline for off-policy learning, 3) LOOP algorithm combining PPO and REINFORCE for text-to-image diffusion models.

Result: Developed theoretical guarantees against underperformance, optimal baselines that minimize variance, and LOOP algorithm that achieves PPO-level efficiency with better text alignment in generations.

Conclusion: The dissertation provides comprehensive RL frameworks that ensure safety, efficiency, and robustness across different application domains, with both theoretical foundations and practical algorithms that outperform existing methods.

Abstract: This dissertation investigates how reinforcement learning (RL) methods can be designed to be safe, sample-efficient, and robust. Framed through the unifying perspective of contextual-bandit RL, the work addresses two major application domains - ranking and recommendation, and text-to-image diffusion models. The first part of the thesis develops theory and algorithms for safe deployment in ranking systems. An exposure-based generalisation bound is derived, leading to a counterfactual risk-minimisation objective whose solution is guaranteed not to underperform the logging policy, even with sparse feedback. This guarantee is extended to doubly robust estimators, enabling safety even under adversarial or misspecified user models and offering practitioners explicit control over permissible utility loss. The second part turns to single-action bandits, where various off-policy estimators are unified within a baseline-correction framework. A closed-form optimal baseline is proposed and shown to minimise both evaluation and policy-gradient variance, thereby improving off-policy learning reliability. The final part examines the trade-offs between efficiency and effectiveness in generative RL. A systematic study of PPO and REINFORCE motivates the Leave-One-Out PPO (LOOP) algorithm, which combines multiple diffusion trajectories with a REINFORCE-style baseline inside PPO’s clipped objective. LOOP achieves PPO-level sample efficiency while producing generations that align more faithfully with textual attributes.

[338] A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning

Zhi Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Lan-Zhe Guo, Yu-Feng Li, Xiaoxing Ma

Main category: cs.LG

TL;DR: RPC is a hybrid test-time scaling method that combines self-consistency and perplexity approaches to improve LLM reasoning performance while reducing sampling costs by 50%.

Details

Motivation: Current sampling-based test-time scaling methods lack theoretical foundations, with self-consistency suffering from high estimation error and perplexity having substantial modeling error and potential degradation.

Method: RPC uses two components: Perplexity Consistency (combining self-consistency and perplexity to boost convergence rate) and Reasoning Pruning (eliminating low-probability reasoning paths).

Result: RPC achieves reasoning performance comparable to self-consistency while enhancing confidence reliability and reducing sampling costs by 50% across seven benchmark datasets.

Conclusion: RPC provides a theoretically-grounded hybrid approach that effectively addresses limitations of existing test-time scaling methods and demonstrates strong potential for reducing reasoning errors in LLMs.

Abstract: Test-time scaling seeks to improve the reasoning performance of large language models (LLMs) by adding computational resources. A prevalent approach within the field is sampling-based test-time scaling methods, which enhance reasoning by generating multiple reasoning paths for a given input during inference. However, despite its practical success, the theoretical foundations remain underexplored. In this paper, we provide the first theoretical framework for analyzing sampling-based test-time scaling methods, grounded in the perspective of confidence estimation. Based on the framework, we analyze two dominant paradigms: self-consistency and perplexity, and reveal key limitations: self-consistency suffers from high estimation error while perplexity exhibits substantial modeling error and possible degradation of the estimation error convergence. To address these limitations, we introduce RPC, a hybrid method that leverages our theoretical insights through two key components: Perplexity Consistency and Reasoning Pruning. Perplexity Consistency combines the strengths of self-consistency and perplexity, boosting the convergence rate of estimation error from linear to exponential while preserving model error. Reasoning Pruning prevents degradation by eliminating low-probability reasoning paths. Both theoretical analysis and empirical results across seven benchmark datasets demonstrate that RPC has a strong potential for reducing reasoning error. Notably, RPC achieves reasoning performance comparable to self-consistency while not only enhancing confidence reliability but also reducing sampling costs by 50%. The code and resources are available at https://wnjxyk.github.io/RPC.

[339] Particle Dynamics for Latent-Variable Energy-Based Models

Shiqin Tang, Shuxin Zhuang, Rong Feng, Runsheng Yu, Hongzong Li, Youzhi Zhang

Main category: cs.LG

TL;DR: This paper presents a novel training method for latent-variable energy-based models by reformulating maximum-likelihood training as a saddle problem over distributions, using coupled Wasserstein gradient flows without requiring discriminators or auxiliary networks.

Details

Motivation: To develop an expressive generative modeling approach that captures hidden structure in data through latent-variable energy-based models, while avoiding the need for discriminators or auxiliary networks.

Method: Recast maximum-likelihood training as a saddle problem over distributions on latent and joint manifolds, using coupled Wasserstein gradient flows with alternating overdamped Langevin updates for joint negative pool and conditional latent particles with stochastic parameter ascent.

Result: Proved existence and convergence under standard assumptions with decay rates in KL divergence and Wasserstein-2 distance, achieving competitive performance on numerical approximations of physical systems compared to comparable approaches.

Conclusion: The saddle-point view provides an ELBO strictly tighter than bounds from restricted amortized posteriors, offering an effective training method for latent-variable energy-based models that captures hidden structure without requiring additional networks.

Abstract: Latent-variable energy-based models (LVEBMs) assign a single normalized energy to joint pairs of observed data and latent variables, offering expressive generative modeling while capturing hidden structure. We recast maximum-likelihood training as a saddle problem over distributions on the latent and joint manifolds and view the inner updates as coupled Wasserstein gradient flows. The resulting algorithm alternates overdamped Langevin updates for a joint negative pool and for conditional latent particles with stochastic parameter ascent, requiring no discriminator or auxiliary networks. We prove existence and convergence under standard smoothness and dissipativity assumptions, with decay rates in KL divergence and Wasserstein-2 distance. The saddle-point view further yields an ELBO strictly tighter than bounds obtained with restricted amortized posteriors. Our method is evaluated on numerical approximations of physical systems and performs competitively against comparable approaches.

[340] Expediting Reinforcement Learning by Incorporating Knowledge About Temporal Causality in the Environment

Jan Corazza, Hadi Partovi Aria, Daniel Neider, Zhe Xu

Main category: cs.LG

TL;DR: This paper proposes a method to incorporate causal information via Temporal Logic-based Causal Diagrams into probabilistic reward machines to improve RL for sparse-reward tasks and enable better transfer learning.

Details

Motivation: RL struggles with sparse rewards and complex temporal dependencies. Probabilistic reward machines help but are hard to design manually, limiting the use of causal knowledge and transfer to new domains.

Method: Incorporates causal information using Temporal Logic-based Causal Diagrams into the reward formalism to structure probabilistic reward machines.

Result: The method expedites policy learning, aids task specification transfer to new environments, and has theoretical convergence guarantees to optimal policy.

Conclusion: The proposed approach effectively leverages causal knowledge to improve RL performance in sparse-reward settings and enables better transfer learning across domains.

Abstract: Reinforcement learning (RL) algorithms struggle with learning optimal policies for tasks where reward feedback is sparse and depends on a complex sequence of events in the environment. Probabilistic reward machines (PRMs) are finite-state formalisms that can capture temporal dependencies in the reward signal, along with nondeterministic task outcomes. While special RL algorithms can exploit this finite-state structure to expedite learning, PRMs remain difficult to modify and design by hand. This hinders the already difficult tasks of utilizing high-level causal knowledge about the environment, and transferring the reward formalism into a new domain with a different causal structure. This paper proposes a novel method to incorporate causal information in the form of Temporal Logic-based Causal Diagrams into the reward formalism, thereby expediting policy learning and aiding the transfer of task specifications to new environments. Furthermore, we provide a theoretical result about convergence to optimal policy for our method, and demonstrate its strengths empirically.

[341] Learning to Answer from Correct Demonstrations

Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma, Nathan Srebro

Main category: cs.LG

TL;DR: The paper proposes a new approach for learning from correct demonstrations in contextual bandits, where multiple answers can be correct. Instead of assuming low-complexity policy classes, it assumes low-cardinality reward classes and shows that maximum likelihood estimation can fail in this setting.

Details

Motivation: To address the problem of learning from demonstrations where multiple correct answers exist, moving beyond traditional maximum likelihood estimation which assumes low-complexity policy classes.

Method: Formalizes the problem as offline imitation learning in contextual bandits, proposes an alternative approach that relies on low-cardinality reward classes rather than low-complexity policy classes, and develops a method with logarithmic sample complexity in reward class cardinality.

Result: Shows that likelihood maximization methods can fail in the proposed setting, and presents a novel approach that achieves sample complexity logarithmic in the cardinality of the reward class.

Conclusion: Motivates looking beyond likelihood maximization when learning from correct demonstrations, especially in settings with multiple acceptable answers and low-cardinality reward classes.

Abstract: We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each training question, as in Supervised Fine Tuning (SFT). We formalize the problem as offline imitation learning in contextual bandits, with demonstrations from some optimal policy, without explicitly observed rewards. Prior work assumes that the demonstrator belongs to a low-complexity policy class, which motivates maximum likelihood estimation (i.e., log-loss minimization). In contrast, we propose relying only on the reward model (specifying which answers are correct) being in a low-cardinality class, which we argue is a weaker assumption. We show that likelihood maximization methods can fail in this case, and instead devise an alternative novel approach that learns with sample complexity logarithmic in the cardinality of the reward class. Our work motivates looking beyond likelihood maximization when learning from correct demonstrations.

[342] Adversary-Free Counterfactual Prediction via Information-Regularized Representations

Shiqin Tang, Rong Feng, Shuxin Zhuang, Hongzong Li, Youzhi Zhang

Main category: cs.LG

TL;DR: Proposes an information-theoretic approach for counterfactual prediction under assignment bias, using mutual information minimization to remove treatment-covariate dependence without adversarial training.

Details

Motivation: Address assignment bias in counterfactual prediction where treatment assignment depends on covariates, avoiding unstable adversarial training methods.

Method: Learn stochastic representation Z predictive of outcomes while minimizing I(Z;T) using variational objective that upper-bounds information term, coupled with supervised decoder. Extends to dynamic settings.

Result: Outperforms state-of-the-art balancing, reweighting, and adversarial baselines on controlled simulations and clinical dataset across likelihood, counterfactual error, and policy evaluation metrics.

Conclusion: The approach provides stable, provably motivated training that avoids adversarial training instabilities while achieving favorable performance in counterfactual prediction.

Abstract: We study counterfactual prediction under assignment bias and propose a mathematically grounded, information-theoretic approach that removes treatment-covariate dependence without adversarial training. Starting from a bound that links the counterfactual-factual risk gap to mutual information, we learn a stochastic representation Z that is predictive of outcomes while minimizing I(Z; T). We derive a tractable variational objective that upper-bounds the information term and couples it with a supervised decoder, yielding a stable, provably motivated training criterion. The framework extends naturally to dynamic settings by applying the information penalty to sequential representations at each decision time. We evaluate the method on controlled numerical simulations and a real-world clinical dataset, comparing against recent state-of-the-art balancing, reweighting, and adversarial baselines. Across metrics of likelihood, counterfactual error, and policy evaluation, our approach performs favorably while avoiding the training instabilities and tuning burden of adversarial schemes.

[343] OffSim: Offline Simulator for Model-based Offline Inverse Reinforcement Learning

Woo-Jin Ahn, Sang-Ryul Baek, Yong-Jun Lee, Hyun-Duck Choi, Myo-Taeg Lim

Main category: cs.LG

TL;DR: OffSim is a model-based offline inverse reinforcement learning framework that learns environmental dynamics and reward functions from expert trajectories, enabling policy training without real environment interaction.

Details

Motivation: Traditional RL requires time-consuming simulator development and manual reward function design. OffSim addresses this by learning both dynamics and rewards directly from expert data.

Method: Jointly optimizes high-entropy transition model and IRL-based reward function from state-action trajectories. OffSim+ extends this with marginal reward for multi-dataset settings.

Result: Extensive MuJoCo experiments show substantial performance gains over existing offline IRL methods, demonstrating efficacy and robustness.

Conclusion: OffSim provides an effective framework for learning environmental dynamics and rewards from expert data, enabling offline policy training without real environment interaction.

Abstract: Reinforcement learning algorithms typically utilize an interactive simulator (i.e., environment) with a predefined reward function for policy training. Developing such simulators and manually defining reward functions, however, is often time-consuming and labor-intensive. To address this, we propose an Offline Simulator (OffSim), a novel model-based offline inverse reinforcement learning (IRL) framework, to emulate environmental dynamics and reward structure directly from expert-generated state-action trajectories. OffSim jointly optimizes a high-entropy transition model and an IRL-based reward function to enhance exploration and improve the generalizability of the learned reward. Leveraging these learned components, OffSim can subsequently train a policy offline without further interaction with the real environment. Additionally, we introduce OffSim$^+$, an extension that incorporates a marginal reward for multi-dataset settings to enhance exploration. Extensive MuJoCo experiments demonstrate that OffSim achieves substantial performance gains over existing offline IRL methods, confirming its efficacy and robustness.

[344] The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling

Shijia Kang, Muhan Zhang

Main category: cs.LG

TL;DR: SESA introduces sequential sampling to address exploration limitations and entropy collapse in RL-trained LLMs, improving diversity and performance through structured solution generation.

Details

Motivation: Traditional RL methods for LLMs suffer from limited exploration and entropy collapse, where models exploit narrow solution sets, reducing sampling diversity and preventing further performance improvements.

Method: Proposes SESA framework that generates diverse solution sketches sequentially before expanding them into full reasoning paths, conditioning each new output on previous ones to promote diversity.

Result: SESA consistently outperforms traditional RL methods in path diversity and collapse recovery. On three agent benchmarks, it achieves success rate improvements of +0.25, +0.42, and +0.07 absolute (up to 211% relative improvement over baseline RL).

Conclusion: SESA provides a structured approach to exploration that enhances reasoning diversity and effectiveness in RL-trained LLMs, paving the way for more robust performance improvements.

Abstract: Reinforcement learning (RL) has been pivotal in enhancing the reasoning capabilities of large language models (LLMs), but it often suffers from limited exploration and entropy collapse, where models exploit a narrow set of solutions, leading to a loss of sampling diversity and subsequently preventing RL from further improving performance. This issue is exacerbated in parallel sampling methods, where multiple outputs are drawn from the same distribution, potentially causing the model to converge to similar solutions. We propose SESA, a novel SEquential SAmpling framework that mitigates this challenge by generating diverse solution sketches sequentially before expanding them into full reasoning paths. This approach ensures broader exploration by conditioning each new output on previous ones, promoting diversity throughout the process and preventing policy collapse. Our experiments on a synthetic task show that sequential sampling consistently outperforms traditional RL methods in terms of path diversity and recovery from collapse. Further evaluations on real-world tasks demonstrate that SESA improves both the exploration of valid strategies and the overall performance of LLMs. On three agent benchmarks, SESA lifts success rates by $+0.25$, $+0.42$, and $+0.07$ absolute over the base model (up to an additional $211%$ relative improvement over baseline RL), underscoring its exploration advantage. This work introduces a structured approach to exploration, paving the way for more effective and diverse reasoning in RL-trained LLMs. Our code is released at https://github.com/MuLabPKU/sesa.

Naoki Yoshida, Satoshi Hayakawa, Yuhta Takida, Toshimitsu Uesaka, Hiromi Wakaki, Yuki Mitsufuji

Main category: cs.LG

TL;DR: KME-CLIP enhances CLIP by using kernel methods to better approximate pointwise mutual information (PMI) between modalities, improving performance on retrieval and classification tasks.

Details

Motivation: Current CLIP implementations don't fully utilize the linear structure of PMI, which theory shows should be the optimal similarity metric between paired modalities.

Method: Proposes KME-CLIP that leverages the inner product in a reproducing kernel Hilbert space to approximate PMI more accurately.

Result: The method theoretically approximates PMI with arbitrary accuracy and empirically outperforms standard CLIP on several retrieval and classification tasks.

Conclusion: Kernel-based similarity computation better captures the theoretical PMI structure, leading to improved multi-modal contrastive learning performance.

Abstract: In this study, we propose an enhancement to the similarity computation mechanism in multi-modal contrastive pretraining frameworks such as CLIP. Prior theoretical research has demonstrated that the optimal similarity metrics between paired modalities should correspond to the pointwise mutual information (PMI) between the two modalities. However, the current implementations of CLIP and its variants fail to fully utilize the underlying linear structure of PMI. We therefore propose KME-CLIP, which leverages this structure through the inner product in a reproducing kernel Hilbert space. We theoretically prove that our method can approximate PMI with arbitrary accuracy and empirically demonstrate that our approach overall outperforms the standard CLIP formulation across several retrieval and classification tasks.

[346] Language Models are Injective and Hence Invertible

Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodola’

Main category: cs.LG

TL;DR: Transformer language models are injective (lossless) - different inputs always produce different representations, enabling exact input reconstruction from hidden activations.

Details

Motivation: Challenge the common assumption that transformer components like non-linear activations and normalization make models non-injective and prevent exact input recovery.

Method: Mathematical proof of injectivity at initialization and preservation during training, empirical collision tests on six state-of-the-art models, and development of SipIt algorithm for exact input reconstruction.

Result: No collisions found in billions of tests, SipIt algorithm successfully reconstructs exact input text from hidden activations with linear-time guarantees.

Conclusion: Injectivity is a fundamental property of transformer language models with implications for transparency, interpretability, and safe deployment.

Abstract: Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model’s representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.

[347] Revisiting Knowledge Distillation: The Hidden Role of Dataset Size

Giulia Lanzillotta, Felix Sarnthein, Gil Kur, Thomas Hofmann, Bobby He

Main category: cs.LG

TL;DR: Knowledge distillation becomes more effective in low-data regimes, disproving the label smoothing hypothesis and supporting dark knowledge theory.

Details

Motivation: To understand how knowledge distillation works by studying its relationship with dataset size, as previous research focused mainly on model size and generalization.

Method: Conducted extensive experiments across various datasets, tasks, and neural architectures to analyze distillation effects across different dataset sizes.

Result: Found that distillation’s effectiveness is amplified in low-data regimes (data efficiency), disproving label smoothing hypothesis and supporting dark knowledge.

Conclusion: Dataset size is a fundamental but overlooked variable in understanding distillation mechanisms, with significant implications for low-data scenarios.

Abstract: The concept of knowledge distillation (KD) describes the training of a student model from a teacher model and is a widely adopted technique in deep learning. However, it is still not clear how and why distillation works. Previous studies focus on two central aspects of distillation: model size, and generalisation. In this work we study distillation in a third dimension: dataset size. We present a suite of experiments across a wide range of datasets, tasks and neural architectures, demonstrating that the effect of distillation is not only preserved but amplified in low-data regimes. We call this newly discovered property the data efficiency of distillation. Equipped with this new perspective, we test the predictive power of existing theories of KD as we vary the dataset size. Our results disprove the hypothesis that distillation can be understood as label smoothing, and provide further evidence in support of the dark knowledge hypothesis. Finally, we analyse the impact of modelling factors such as the objective, scale and relative number of samples on the observed phenomenon. Ultimately, this work reveals that the dataset size may be a fundamental but overlooked variable in the mechanisms underpinning distillation.

[348] Compressive Modeling and Visualization of Multivariate Scientific Data using Implicit Neural Representation

Abhay Kumar Dwivedi, Shanu Saklani, Soumya Dutta

Main category: cs.LG

TL;DR: Compressed neural representations for multivariate datasets using a single network with parameter sharing achieve state-of-the-art data compression and superior visualization quality.

Details

Motivation: The growing use of Deep Neural Networks in scientific visualization tasks and recent successes with implicit neural representations for spatiotemporal volume visualization and super-resolution inspired the development of compressed representations for multivariate datasets with many variables.

Method: Uses a single network to learn representations for all data variables simultaneously through parameter sharing, enabling efficient compression of multivariate datasets containing tens to hundreds of variables.

Result: Achieves state-of-the-art data compression with superior performance in reconstructed data quality, rendering and visualization quality, preservation of dependency information among variables, and storage efficiency.

Conclusion: The approach successfully develops compressed neural representations for multivariate datasets that outperform existing methods across multiple metrics including data quality, visualization quality, and storage efficiency.

Abstract: The extensive adoption of Deep Neural Networks has led to their increased utilization in challenging scientific visualization tasks. Recent advancements in building compressed data models using implicit neural representations have shown promising results for tasks like spatiotemporal volume visualization and super-resolution. Inspired by these successes, we develop compressed neural representations for multivariate datasets containing tens to hundreds of variables. Our approach utilizes a single network to learn representations for all data variables simultaneously through parameter sharing. This allows us to achieve state-of-the-art data compression. Through comprehensive evaluations, we demonstrate superior performance in terms of reconstructed data quality, rendering and visualization quality, preservation of dependency information among variables, and storage efficiency.

[349] Doubly Robust Estimation of Causal Effects in Strategic Equilibrium Systems

Sibo Xiao

Main category: cs.LG

TL;DR: SDR is a novel causal inference framework combining strategic equilibrium modeling with doubly robust estimation to handle endogenous treatment from strategic agent behavior.

Details

Motivation: Address endogenous treatment assignment caused by strategic agent behavior in causal inference, where traditional methods fail to account for strategic responses to interventions.

Method: Integrates strategic equilibrium modeling with doubly robust estimation, maintaining double robustness while incorporating strategic considerations under strategic unconfoundedness.

Result: Achieves 7.6%-29.3% bias reduction over baseline methods across varying strategic strengths, with robust scalability as agent populations increase.

Conclusion: SDR provides a principled approach for reliable causal inference in strategic environments where agents respond strategically to interventions.

Abstract: We introduce the Strategic Doubly Robust (SDR) estimator, a novel framework that integrates strategic equilibrium modeling with doubly robust estimation for causal inference in strategic environments. SDR addresses endogenous treatment assignment arising from strategic agent behavior, maintaining double robustness while incorporating strategic considerations. Theoretical analysis confirms SDR’s consistency and asymptotic normality under strategic unconfoundedness. Empirical evaluations demonstrate SDR’s superior performance over baseline methods, achieving 7.6%-29.3% bias reduction across varying strategic strengths and maintaining robust scalability with agent populations. The framework provides a principled approach for reliable causal inference when agents respond strategically to interventions.

[350] On the Neural Feature Ansatz for Deep Neural Networks

Edward Tansley, Estelle Massart, Coralia Cartis

Main category: cs.LG

TL;DR: The paper extends the Neural Feature Ansatz (NFA) to multi-layer linear networks, showing depth-dependent exponent α=1/L, proves asymptotic NFA for unbalanced initialization with weight decay, and provides counterexamples for nonlinear networks.

Details

Motivation: To understand feature learning in deep neural networks by extending the Neural Feature Ansatz beyond two-layer linear networks and investigating its depth dependency and limitations.

Method: Theoretical analysis using gradient flow dynamics with balanced/unbalanced weight initialization, mathematical proofs for linear networks, counterexamples for nonlinear architectures, and numerical validation across various optimization settings.

Result: Proved NFA holds for L≥2 layer linear networks with exponent α=1/L, showed asymptotic NFA for unbalanced initialization with weight decay, and demonstrated NFA fails for some nonlinear networks despite good training performance.

Conclusion: The NFA exhibits depth dependency in linear networks but has limitations for nonlinear architectures, highlighting the complexity of feature learning in deep neural networks.

Abstract: Understanding feature learning is an important open question in establishing a mathematical foundation for deep neural networks. The Neural Feature Ansatz (NFA) states that after training, the Gram matrix of the first-layer weights of a deep neural network is proportional to some power $\alpha>0$ of the average gradient outer product (AGOP) of this network with respect to its inputs. Assuming gradient flow dynamics with balanced weight initialization, the NFA was proven to hold throughout training for two-layer linear networks with exponent $\alpha = 1/2$ (Radhakrishnan et al., 2024). We extend this result to networks with $L \geq 2$ layers, showing that the NFA holds with exponent $\alpha = 1/L$, thus demonstrating a depth dependency of the NFA. Furthermore, we prove that for unbalanced initialization, the NFA holds asymptotically through training if weight decay is applied. We also provide counterexamples showing that the NFA does not hold for some network architectures with nonlinear activations, even when these networks fit arbitrarily well the training data. We thoroughly validate our theoretical results through numerical experiments across a variety of optimization algorithms, weight decay rates and initialization schemes.

[351] CQD-SHAP: Explainable Complex Query Answering via Shapley Values

Parsa Abbasi, Stefan Heindorf

Main category: cs.LG

TL;DR: CQD-SHAP is a framework that explains complex query answering by computing the contribution of each query part using Shapley values, addressing interpretability issues in neural and neurosymbolic methods.

Details

Motivation: Current neural and neurosymbolic complex query answering methods are black-box models that lack interpretability, making it difficult for users to understand which parts of the query are most important for the answer.

Method: The proposed CQD-SHAP framework uses Shapley values from cooperative game theory to compute the contribution of each query part to answer ranking, satisfying all fundamental Shapley axioms.

Result: Automated evaluation shows CQD-SHAP is effective for most query types in terms of necessary and sufficient explanations, outperforming various baselines.

Conclusion: CQD-SHAP provides interpretable explanations for complex query answering by quantifying the importance of different query parts, enhancing trust in neural predictors that infer new knowledge from incomplete knowledge graphs.

Abstract: Complex query answering (CQA) goes beyond the well-studied link prediction task by addressing more sophisticated queries that require multi-hop reasoning over incomplete knowledge graphs (KGs). Research on neural and neurosymbolic CQA methods is still an emerging field. Almost all of these methods can be regarded as black-box models, which may raise concerns about user trust. Although neurosymbolic approaches like CQD are slightly more interpretable, allowing intermediate results to be tracked, the importance of different parts of the query remains unexplained. In this paper, we propose CQD-SHAP, a novel framework that computes the contribution of each query part to the ranking of a specific answer. This contribution explains the value of leveraging a neural predictor that can infer new knowledge from an incomplete KG, rather than a symbolic approach relying solely on existing facts in the KG. CQD-SHAP is formulated based on Shapley values from cooperative game theory and satisfies all the fundamental Shapley axioms. Automated evaluation of these explanations in terms of necessary and sufficient explanations, and comparisons with various baselines, shows the effectiveness of this approach for most query types.

[352] Attn-JGNN: Attention Enhanced Join-Graph Neural Networks

Jixin Zhang, Yong Lai

Main category: cs.LG

TL;DR: Attn-JGNN is an attention-enhanced join-graph neural network model that improves #SAT solving accuracy by combining tree decomposition with attention mechanisms for better probabilistic inference.

Details

Motivation: To improve the accuracy of #SAT problem solving by enhancing existing neural network approaches with attention mechanisms that focus on key variables and reduce redundant calculations.

Method: Uses tree decomposition to encode CNF formulas into join-graphs, performs iterative message passing, applies attention mechanisms within and between clusters, and learns partition functions to approximate model counts.

Result: Attn-JGNN achieves better solving accuracy than other neural network methods for #SAT problems.

Conclusion: The attention-enhanced join-graph neural network approach effectively improves #SAT solving by focusing computational resources on critical variables and clusters during probabilistic inference.

Abstract: We propose an Attention Enhanced Join-Graph Neural Networks(Attn-JGNN) model for solving #SAT problems, which significantly improves the solving accuracy. Inspired by the Iterative Join Graph Propagation (IJGP) algorithm, Attn-JGNN uses tree decomposition to encode the CNF formula into a join-graph, then performs iterative message passing on the join-graph, and finally approximates the model number by learning partition functions. In order to further improve the accuracy of the solution, we apply the attention mechanism in and between clusters of the join-graphs, which makes Attn-JGNN pay more attention to the key variables and clusters in probabilistic inference, and reduces the redundant calculation. Finally, our experiments show that our Attn-JGNN model achieves better results than other neural network methods.

[353] GRATING: Low-Latency and Memory-Efficient Semantic Selection on Device

Jiahao Zhou, Chengliang Lin, Dingji Li, Mingkai Dong, Haibo Chen

Main category: cs.LG

TL;DR: GRATING is a training-free inference system that accelerates semantic top-K selection by exploiting sequence-level sparsity and progressive cluster pruning, achieving up to 89% latency reduction and 94.9% memory savings without precision loss.

Details

Motivation: Semantic top-K selection with cross-encoder rerankers dominates latency and memory budgets on edge hardware for on-device AI services like retrieval-augmented generation and personalized recommendation.

Method: Monolithic forwarding with progressive cluster pruning that leverages sequence-level sparsity where relative rankings stabilize early in intermediate layers. Uses dual-layer sliding window and chunked execution to overlap I/O with computation.

Result: GRATING reduces latency by up to 89.0% and peak memory by up to 94.9% in microbenchmarks, and 11.6%-51.0% latency reduction and 18.6%-77.8% memory savings in real-world applications across rerankers from 0.6B to 8B parameters.

Conclusion: GRATING demonstrates substantial improvements in efficiency and deployability for on-device AI applications by exploiting relative ranking properties and sequence-level sparsity without requiring model retraining.

Abstract: Semantic top-K selection with cross-encoder rerankers underpins of on-device AI services, such as retrieval-augmented generation, agent memory, and personalized recommendation. However, its latency and memory demands dominate end-to-end budgets on edge hardware. Revisiting the objective of top-K selection, we reveal that only relative rankings matter, not exact per-candidate scores. We further observe sequence-level sparsity: relative rankings stabilize early in intermediate layers, allowing pruning opportunities prior to completing full inference. Building on this insight, we propose monolithic forwarding and develop a training-free inference system, GRATING. By maintaining a global view of all candidates, it reduces latency through progressive cluster pruning. It also bounds peak memory usage by strategically overlapping I/O with computation via dual-layer sliding window and chunked execution. We evaluate GRATING against state-of-the-art baselines on rerankers from 0.6B to 8B parameters across Apple M2 and RTX 5070. GRATING consistently reduces latency by up to 89.0% and peak memory by up to 94.9% in microbenchmarks, without any loss in precision. Across three real-world on-device AI applications, GRATING lowers latency by 11.6%-51.0% and peak memory by 18.6%-77.8%, demonstrating substantial improvements in efficiency and deployability.

[354] Decentralized Parameter-Free Online Learning

Tomas Ortega, Hamid Jafarkhani

Main category: cs.LG

TL;DR: First parameter-free decentralized online learning algorithms with sublinear network regret, connecting multi-agent coin-betting and decentralized learning via gossip steps.

Details

Motivation: To develop decentralized online learning algorithms that achieve sublinear regret without requiring hyperparameter tuning, addressing the need for practical distributed learning systems.

Method: Introduces a novel betting function formulation for coin-betting that simplifies multi-agent regret analysis, combined with gossip steps for decentralized communication.

Result: Achieves sublinear network regret bounds, validated through experiments on synthetic and real datasets.

Conclusion: The proposed parameter-free decentralized algorithms are applicable to distributed sensing, decentralized optimization, and collaborative machine learning applications.

Abstract: We propose the first parameter-free decentralized online learning algorithms with network regret guarantees, which achieve sublinear regret without requiring hyperparameter tuning. This family of algorithms connects multi-agent coin-betting and decentralized online learning via gossip steps. To enable our decentralized analysis, we introduce a novel “betting function” formulation for coin-betting that simplifies the multi-agent regret analysis. Our analysis shows sublinear network regret bounds and is validated through experiments on synthetic and real datasets. This family of algorithms is applicable to distributed sensing, decentralized optimization, and collaborative ML applications.

[355] CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning

Yung-Chen Tang, Pin-Yu Chen, Andrea Cavallaro

Main category: cs.LG

TL;DR: CarBoN is a test-time calibration framework that improves reasoning efficiency by adaptively guiding LLM generation toward high-reward paths using input-specific temperature and shift parameters, achieving up to 4× fewer rollouts for same accuracy.

Details

Motivation: Current test-time scaling methods like Best-of-N sampling show diminishing returns as N increases, leading to inefficient computation during inference for reasoning tasks.

Method: Two-phase approach: first explores solution space, then learns calibration via input-specific temperature T and additive shift vector δ to guide generation toward reliable reasoning paths.

Result: Experiments on MATH-500 and AIME-2024 show CarBoN improves efficiency with up to 4× fewer rollouts to reach same accuracy, often achieving higher accuracy under fixed budgets.

Conclusion: The framework effectively balances output diversity and correctness through complementary roles of T and δ, and generalizes to step-level sampling strategies like beam search.

Abstract: Allocating more computation during inference time (test-time scaling) improves language model performance, especially for reasoning tasks. However, popular methods like Best-of-$N$ sampling often show diminishing returns as $N$ increases. To address this inefficiency, we introduce a general test-time calibration framework that adaptively modifies the model toward high-reward reasoning paths, with theoretical guarantees of improving the lower bound of expected reward under finite sampling, all without large language model (LLM) retraining. Within this framework, we propose CarBoN (Calibrated Best-of-$N$), a two-phase method that first explores the solution space and then learns a calibration of the logits via an input-specific temperature $T$ and additive shift vector $\delta$, guiding generation toward more reliable reasoning. Experiments on MATH-500 and AIME-2024 show that CarBoN improves efficiency, with up to $4\times$ fewer rollouts to reach the same accuracy, while often achieving higher accuracy under fixed budgets. We also analyze the complementary roles of $T$ and $\delta$ in balancing output diversity and correctness, and demonstrate that the framework also generalizes to step-level sampling strategies such as beam search. For more information, please refer to our project page at huggingface.co/spaces/TrustSafeAI/Test-Time-Calibration.

[356] Deep Neural ODE Operator Networks for PDEs

Ziqian Li, Kang Liu, Yongcun Song, Hangrui Yue, Enrique Zuazua

Main category: cs.LG

TL;DR: NODE-ONet is a neural ODE operator network that incorporates PDE physics into operator learning, improving temporal dynamics modeling and generalization beyond training time frames.

Details

Motivation: Existing operator learning approaches overlook domain knowledge in PDEs, leading to challenges in capturing temporal dynamics and poor generalization beyond training time frames.

Method: Encoder-decoder architecture with three components: spatial encoder, neural ODE for latent temporal dynamics, and decoder for physical space reconstruction. Uses physics-encoded neural ODEs to incorporate PDE-specific properties.

Result: Numerical experiments on nonlinear diffusion-reaction and Navier-Stokes equations show high accuracy, computational efficiency, and prediction capabilities beyond training time frames.

Conclusion: The framework offers flexibility with diverse encoders/decoders, generalizes across related PDE families, and serves as a scalable, physics-encoded tool for scientific machine learning.

Abstract: Operator learning has emerged as a promising paradigm for developing efficient surrogate models to solve partial differential equations (PDEs). However, existing approaches often overlook the domain knowledge inherent in the underlying PDEs and hence suffer from challenges in capturing temporal dynamics and generalization issues beyond training time frames. This paper introduces a deep neural ordinary differential equation (ODE) operator network framework, termed NODE-ONet, to alleviate these limitations. The framework adopts an encoder-decoder architecture comprising three core components: an encoder that spatially discretizes input functions, a neural ODE capturing latent temporal dynamics, and a decoder reconstructing solutions in physical spaces. Theoretically, error analysis for the encoder-decoder architecture is investigated. Computationally, we propose novel physics-encoded neural ODEs to incorporate PDE-specific physical properties. Such well-designed neural ODEs significantly reduce the framework’s complexity while enhancing numerical efficiency, robustness, applicability, and generalization capacity. Numerical experiments on nonlinear diffusion-reaction and Navier-Stokes equations demonstrate high accuracy, computational efficiency, and prediction capabilities beyond training time frames. Additionally, the framework’s flexibility to accommodate diverse encoders/decoders and its ability to generalize across related PDE families further underscore its potential as a scalable, physics-encoded tool for scientific machine learning.

[357] Fast and Compact Tsetlin Machine Inference on CPUs Using Instruction-Level Optimization

Yefan Zeng, Shengyu Duan, Rishad Shafik, Alex Yakovlev

Main category: cs.LG

TL;DR: The paper presents an optimized Tsetlin Machine implementation using bitwise operations and early exit mechanisms to achieve up to 96.71% reduction in inference time on ARM processors while maintaining code density.

Details

Motivation: To leverage the Tsetlin Machine's logic-driven operations for high-speed inference on resource-constrained devices like CPUs, taking advantage of natural parallel execution capabilities on modern CPU architectures.

Method: Proposed an efficient TM implementation using instruction-level bitwise operations for compact model representation, introduced an early exit mechanism to avoid unnecessary computations, and developed a literal reorder strategy applied during post-training to maximize early exits through statistical analysis of literals and Tsetlin Automata actions.

Result: Experimental results using gem5 simulator with ARM processor showed up to 96.71% reduction in inference time compared to conventional integer-based TM implementations while maintaining comparable code density.

Conclusion: The proposed optimizations significantly accelerate Tsetlin Machine inference on CPU architectures through bitwise operations and intelligent early exit strategies, making TM more practical for resource-constrained devices.

Abstract: The Tsetlin Machine (TM) offers high-speed inference on resource-constrained devices such as CPUs. Its logic-driven operations naturally lend themselves to parallel execution on modern CPU architectures. Motivated by this, we propose an efficient software implementation of the TM by leveraging instruction-level bitwise operations for compact model representation and accelerated processing. To further improve inference speed, we introduce an early exit mechanism, which exploits the TM’s AND-based clause evaluation to avoid unnecessary computations. Building upon this, we propose a literal Reorder strategy designed to maximize the likelihood of early exits. This strategy is applied during a post-training, pre-inference stage through statistical analysis of all literals and the corresponding actions of their associated Tsetlin Automata (TA), introducing negligible runtime overhead. Experimental results using the gem5 simulator with an ARM processor show that our optimized implementation reduces inference time by up to 96.71% compared to the conventional integer-based TM implementations while maintaining comparable code density.

[358] WARP-LUTs - Walsh-Assisted Relaxation for Probabilistic Look Up Tables

Lino Gerlach, Liv Våge, Thore Gerlach, Elliott Kauffman

Main category: cs.LG

TL;DR: WARP-LUTs is a novel gradient-based method that efficiently learns combinations of logic gates with fewer parameters, achieving faster convergence than DLGNs while maintaining comparable accuracy on CIFAR-10.

Details

Motivation: Current multiplication-free models like DLGNs suffer from high computational cost during training and poor generalization to logic blocks with more inputs, despite their impressive hardware-aware design.

Method: WARP-LUTs use Walsh-Assisted Relaxation for Probabilistic Look-Up Tables - a gradient-based framework that learns optimal combinations of logic gates with substantially fewer trainable parameters.

Result: WARP-LUTs achieve significantly faster convergence on CIFAR-10 compared to DLGNs while maintaining comparable accuracy.

Conclusion: The approach shows potential for extension to higher-input logic blocks, enabling extremely efficient deployment on modern FPGAs for real-time science applications.

Abstract: Fast and efficient machine learning is of growing interest to the scientific community and has spurred significant research into novel model architectures and hardware-aware design. Recent hard? and software co-design approaches have demonstrated impressive results with entirely multiplication-free models. Differentiable Logic Gate Networks (DLGNs), for instance, provide a gradient-based framework for learning optimal combinations of low-level logic gates, setting state-of-the-art trade-offs between accuracy, resource usage, and latency. However, these models suffer from high computational cost during training and do not generalize well to logic blocks with more inputs. In this work, we introduce Walsh-Assisted Relaxation for Probabilistic Look-Up Tables (WARP-LUTs) - a novel gradient-based method that efficiently learns combinations of logic gates with substantially fewer trainable parameters. We demonstrate that WARP-LUTs achieve significantly faster convergence on CIFAR-10 compared to DLGNs, while maintaining comparable accuracy. Furthermore, our approach suggests potential for extension to higher-input logic blocks, motivating future research on extremely efficient deployment on modern FPGAs and its real-time science applications.

[359] KS-Net: Multi-layer network model for determining the rotor type from motor parameters in interior PMSMs

Kivanc Dogan, Ahmet Orhan

Main category: cs.LG

TL;DR: This study uses machine learning to classify IPMSM rotor shapes (2D, V, Nabla types) using electromagnetic parameters as a fast alternative to traditional FEM analysis.

Details

Motivation: Traditional finite element method (FEM) for rotor shape analysis in IPMSMs is computationally expensive, creating need for faster, cost-effective alternatives.

Method: Used custom deep learning model KS-Net and compared with classical ML algorithms (Cubic SVM, Quadratic SVM, Fine KNN, Cosine KNN, Fine Tree) on 9,000 samples using 10-fold cross-validation.

Result: Cubic SVM and Quadratic SVM achieved 100% accuracy, KS-Net achieved 99.98% accuracy with only 2 misclassifications, demonstrating competitive performance with classical methods.

Conclusion: Machine learning approaches can accurately predict IPMSM rotor shapes, providing fast and cost-effective alternatives to FEM-based analyses for motor design acceleration and automated rotor identification.

Abstract: The demand for high efficiency and precise control in electric drive systems has led to the widespread adoption of Interior Permanent Magnet Synchronous Motors (IPMSMs). The performance of these motors is significantly influenced by rotor geometry. Traditionally, rotor shape analysis has been conducted using the finite element method (FEM), which involves high computational costs. This study aims to classify the rotor shape (2D type, V type, Nabla type) of IPMSMs using electromagnetic parameters through machine learning-based methods and to demonstrate the applicability of this approach as an alternative to classical methods. In this context, a custom deep learning model, KS-Net, developed by the user, was comparatively evaluated against Cubic SVM, Quadratic SVM, Fine KNN, Cosine KNN, and Fine Tree algorithms. The balanced dataset, consisting of 9,000 samples, was tested using 10-fold cross-validation, and performance metrics such as accuracy, precision, recall, and F1-score were employed. The results indicate that the Cubic SVM and Quadratic SVM algorithms classified all samples flawlessly, achieving 100% accuracy, while the KS-Net model achieved 99.98% accuracy with only two misclassifications, demonstrating competitiveness with classical methods. This study shows that the rotor shape of IPMSMs can be predicted with high accuracy using data-driven approaches, offering a fast and cost-effective alternative to FEM-based analyses. The findings provide a solid foundation for accelerating motor design processes, developing automated rotor identification systems, and enabling data-driven fault diagnosis in engineering applications.

[360] Constrained Adversarial Perturbation

Virendra Nishad, Bhaskar Mukhoty, Hilal AlQuabeh, Sandeep K. Shukla, Sayak Ray Chowdhury

Main category: cs.LG

TL;DR: CAP is a novel method for generating universal adversarial perturbations that respect domain-specific feature constraints, achieving higher attack success rates and faster runtime compared to existing approaches.

Details

Motivation: Existing UAP methods ignore domain-specific constraints that govern feature relationships, making adversarial examples implausible or easily detectable in real-world applications like finance and network systems.

Method: Formulated an augmented Lagrangian min-max optimization problem to enforce multiple complex constraints, and proposed CAP algorithm using gradient-based alternating optimization strategy.

Result: CAP achieved higher attack success rates while significantly reducing runtime across finance, IT networks, and cyber-physical systems, and also performed well for individual adversarial perturbations.

Conclusion: CAP enables effective universal adversarial attacks in constrained feature spaces and provides a principled way to learn feature constraints from data for broad applicability across structured domains.

Abstract: Deep neural networks have achieved remarkable success in a wide range of classification tasks. However, they remain highly susceptible to adversarial examples - inputs that are subtly perturbed to induce misclassification while appearing unchanged to humans. Among various attack strategies, Universal Adversarial Perturbations (UAPs) have emerged as a powerful tool for both stress testing model robustness and facilitating scalable adversarial training. Despite their effectiveness, most existing UAP methods neglect domain specific constraints that govern feature relationships. Violating such constraints, such as debt to income ratios in credit scoring or packet flow invariants in network communication, can render adversarial examples implausible or easily detectable, thereby limiting their real world applicability. In this work, we advance universal adversarial attacks to constrained feature spaces by formulating an augmented Lagrangian based min max optimization problem that enforces multiple, potentially complex constraints of varying importance. We propose Constrained Adversarial Perturbation (CAP), an efficient algorithm that solves this problem using a gradient based alternating optimization strategy. We evaluate CAP across diverse domains including finance, IT networks, and cyber physical systems, and demonstrate that it achieves higher attack success rates while significantly reducing runtime compared to existing baselines. Our approach also generalizes seamlessly to individual adversarial perturbations, where we observe similar strong performance gains. Finally, we introduce a principled procedure for learning feature constraints directly from data, enabling broad applicability across domains with structured input spaces.

[361] ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations

Alex Gu, Bartosz Piotrowski, Fabian Gloeckle, Kaiyu Yang, Aram H. Markosyan

Main category: cs.LG

TL;DR: ProofOptimizer is a language model trained to simplify excessively long Lean proofs generated by neural theorem provers, reducing proof length by 49-87% across benchmarks while maintaining correctness.

Details

Motivation: Neural theorem provers generate mechanically verified but excessively long proofs that are difficult for humans to comprehend, limiting mathematical insight. Proof simplification is a critical bottleneck with scarce training data.

Method: ProofOptimizer is trained via expert iteration and reinforcement learning using Lean to verify simplifications. It operates within an iterative proof-shortening workflow that progressively reduces proof length.

Result: ProofOptimizer substantially compresses proofs: 87% reduction on miniF2F, 57% on PutnamBench, and 49% on Seed-Prover’s IMO 2025 proofs. Simplified proofs check faster in Lean and improve downstream prover performance when reused as training data.

Conclusion: ProofOptimizer successfully addresses the proof simplification bottleneck for neural theorem proving, producing more concise, human-comprehensible proofs while maintaining correctness and even improving downstream performance.

Abstract: Neural theorem proving has advanced rapidly in the past year, reaching IMO gold-medalist capabilities and producing formal proofs that span thousands of lines. Although such proofs are mechanically verified by formal systems like Lean, their excessive length renders them difficult for humans to comprehend and limits their usefulness for mathematical insight. Proof simplification is therefore a critical bottleneck. Yet, training data for this task is scarce, and existing methods – mainly agentic scaffolding with off-the-shelf LLMs – struggle with the extremely long proofs generated by RL-trained provers. We introduce ProofOptimizer, the first language model trained to simplify Lean proofs without requiring additional human supervision. ProofOptimizer is trained via expert iteration and reinforcement learning, using Lean to verify simplifications and provide training signal. At inference time, it operates within an iterative proof-shortening workflow, progressively reducing proof length. Experiments show that ProofOptimizer substantially compresses proofs generated by state-of-the-art RL-trained provers on standard benchmarks, reducing proof length by 87% on miniF2F, 57% on PutnamBench, and 49% on Seed-Prover’s IMO 2025 proofs. Beyond conciseness, the simplified proofs check faster in Lean and further improve downstream prover performance when reused as training data for supervised finetuning.

[362] ProSh: Probabilistic Shielding for Model-free Reinforcement Learning

Edwin Hamel-De le Court, Gaspard Ohlmann, Francesco Belardinelli

Main category: cs.LG

TL;DR: ProSh is a model-free safe RL algorithm that uses risk augmentation and shielding to ensure safety under cost constraints while preserving optimality in deterministic environments.

Details

Motivation: Safety is critical for deploying RL systems in real-world applications, requiring formal guarantees about system safety under cost constraints.

Method: Augments Constrained MDP state space with risk budget and applies a shield to policy distribution using learned cost critic to ensure sampled actions remain safe in expectation.

Result: Provides tight upper-bound on expected cost depending on backup-critic accuracy, and guarantees safety during training under practical assumptions.

Conclusion: ProSh enables safe reinforcement learning with formal safety guarantees while maintaining optimal performance in deterministic environments.

Abstract: Safety is a major concern in reinforcement learning (RL): we aim at developing RL systems that not only perform optimally, but are also safe to deploy by providing formal guarantees about their safety. To this end, we introduce Probabilistic Shielding via Risk Augmentation (ProSh), a model-free algorithm for safe reinforcement learning under cost constraints. ProSh augments the Constrained MDP state space with a risk budget and enforces safety by applying a shield to the agent’s policy distribution using a learned cost critic. The shield ensures that all sampled actions remain safe in expectation. We also show that optimality is preserved when the environment is deterministic. Since ProSh is model-free, safety during training depends on the knowledge we have acquired about the environment. We provide a tight upper-bound on the cost in expectation, depending only on the backup-critic accuracy, that is always satisfied during training. Under mild, practically achievable assumptions, ProSh guarantees safety even at training time, as shown in the experiments.

[363] RLAF: Reinforcement Learning from Automaton Feedback

Mahyar Alinejad, Alvaro Velasquez, Yue Wang, George Atia

Main category: cs.LG

TL;DR: A novel RL approach using automaton-based preferences instead of explicit reward functions, with static and dynamic methods for policy optimization, outperforming traditional reward engineering in handling non-Markovian rewards.

Details

Motivation: Traditional RL struggles with complex, history-dependent reward structures, requiring manual reward engineering which is difficult and time-consuming.

Method: Uses deterministic finite automaton (DFA) to generate preferences over trajectories, learning reward functions automatically. Offers static approach (direct policy optimization) and dynamic approach (iterative reward and policy refinement).

Result: Outperforms traditional reward engineering and automaton-based baselines in discrete and continuous environments with temporal dependencies. Provides convergence guarantee for near-optimal policies.

Conclusion: Automaton-based preferences offer scalable, efficient, human-independent alternative for handling non-Markovian rewards in RL, with proven convergence guarantees.

Abstract: Reinforcement Learning (RL) in environments with complex, history-dependent reward structures poses significant challenges for traditional methods. In this work, we introduce a novel approach that leverages automaton-based feedback to guide the learning process, replacing explicit reward functions with preferences derived from a deterministic finite automaton (DFA). Unlike conventional approaches that use automata for direct reward specification, our method employs the structure of the DFA to generate preferences over trajectories that are used to learn a reward function, eliminating the need for manual reward engineering. Our framework introduces a static approach that uses the learned reward function directly for policy optimization and a dynamic approach that involves continuous refining of the reward function and policy through iterative updates until convergence. Our experiments in both discrete and continuous environments demonstrate that our approach enables the RL agent to learn effective policies for tasks with temporal dependencies, outperforming traditional reward engineering and automaton-based baselines such as reward machines and LTL-guided methods. Our results highlight the advantages of automaton-based preferences in handling non-Markovian rewards, offering a scalable, efficient, and human-independent alternative to traditional reward modeling. We also provide a convergence guarantee showing that under standard assumptions our automaton-guided preference-based framework learns a policy that is near-optimal with respect to the true non-Markovian objective.

[364] A Comprehensive Evaluation of Graph Neural Networks and Physics Informed Learning for Surrogate Modelling of Finite Element Analysis

Nayan Kumar Singh

Main category: cs.LG

TL;DR: This paper evaluates GNNs and 3D U-Nets as FEA surrogates for parametric I-beams, showing GNNs outperform U-Nets, with MPNN and Graph Transformers achieving highest accuracy. A PINN framework with curriculum learning stabilizes training and improves generalization.

Details

Motivation: FEA is computationally expensive for design optimization, and deep learning models can provide efficient alternatives, but selecting the right architecture that accurately emulates FEA is challenging.

Method: Comprehensive evaluation of GNNs and 3D U-Nets with Physics-Informed Neural Network (PINN) framework governed by Navier-Cauchy equations. Uses curriculum learning strategy with pretraining on data followed by physics-informed fine-tuning.

Result: GNNs fundamentally outperform U-Nets. MPNN and Graph Transformers achieved highest accuracy (3.5% and 2.6% relative L2 error). PINN improved generalization by up to 11.3%. Graph Transformer is most accurate but 37.5% slower than MPNN PINN.

Conclusion: PINN-enhanced MPNN provides the most practical solution with good compromise between predictive performance, model size, and inference speed.

Abstract: Although Finite Element Analysis (FEA) is an integral part of the product design lifecycle, the analysis is computationally expensive, making it unsuitable for many design optimization problems. The deep learning models can be a great solution. However, selecting the architecture that emulates the FEA with great accuracy is a challenge. This paper presents a comprehensive evaluation of graph neural networks (GNNs) and 3D U-Nets as surrogates for FEA of parametric I-beams. We introduce a Physics-Informed Neural Network (PINN) framework, governed by the Navier Cauchy equations, to enforce physical laws. Crucially, we demonstrate that a curriculum learning strategy, pretraining on data followed by physics informed fine tuning, is essential for stabilizing training. Our results show that GNNs fundamentally outperform the U-Net. Even the worst performer among GNNs, the GCN framework, achieved a relative L2 error of 8.7% while the best framework among U Net, U Net with attention mechanism trained on high resolution data, achieved 13.0% score. Among the graph-based architectures, the Message Passing Neural Networks (MPNN) and Graph Transformers achieved the highest accuracy, achieving a relative L2 score of 3.5% and 2.6% respectively. The inclusion of physics fundamental laws (PINN) significantly improved the generalization, reducing error by up to 11.3% on high-signal tasks. While the Graph Transformer is the most accurate model, it is more 37.5% slower during inference when compared to second best model, MPNN PINN. The PINN enhanced MPNN (MPNN PINN) provides the most practical solution. It offers a good compromise between predictive performance, model size, and inference speed.

[365] AB-UPT for Automotive and Aerospace Applications

Benedikt Alkin, Richard Kurle, Louis Serrano, Dennis Just, Johannes Brandstetter

Main category: cs.LG

TL;DR: AB-UPT neural surrogates achieve near-perfect aerodynamic force predictions in seconds using simple geometry representations, requiring orders of magnitude less compute than traditional CFD solvers.

Details

Motivation: To demonstrate AB-UPT's capabilities for automotive and aircraft CFD simulations with high-quality datasets and state-of-the-art neural surrogates.

Method: Used AB-UPT (Anchored-Branched Universal Physics Transformers) with two new datasets (SHIFT-SUV and SHIFT-Wing) generated via Luminary Cloud platform, comparing against transformer-based baselines.

Result: AB-UPT shows strong performance across both datasets, achieving near-perfect prediction of integrated aerodynamic forces within seconds from simple geometry representations, trainable in one day on a single GPU.

Conclusion: AB-UPT enables industry-scale applications by providing fast, accurate CFD simulations with significantly reduced computational requirements compared to traditional numerical solvers.

Abstract: The recently proposed Anchored-Branched Universal Physics Transformers (AB-UPT) shows strong capabilities to replicate automotive computational fluid dynamics simulations requiring orders of magnitudes less compute than traditional numerical solvers. In this technical report, we add two new datasets to the body of empirically evaluated use-cases of AB-UPT, combining high-quality data generation with state-of-the-art neural surrogates. Both datasets were generated with the Luminary Cloud platform containing automotives (SHIFT-SUV) and aircrafts (SHIFT-Wing). We start by detailing the data generation. Next, we show favorable performances of AB-UPT against previous state-of-the-art transformer-based baselines on both datasets, followed by extensive qualitative and quantitative evaluations of our best AB-UPT model. AB-UPT shows strong performances across the board. Notably, it obtains near perfect prediction of integrated aerodynamic forces within seconds from a simple isotopically tesselate geometry representation and is trainable within a day on a single GPU, paving the way for industry-scale applications.

[366] SAMix: Calibrated and Accurate Continual Learning via Sphere-Adaptive Mixup and Neural Collapse

Trung-Anh Dang, Vincent Nguyen, Ngoc-Son Vu, Christel Vrain

Main category: cs.LG

TL;DR: SAMix is a novel adaptive mixup method that improves both performance and calibration in continual learning by leveraging neural collapse properties for better feature alignment and regularization.

Details

Motivation: Most continual learning methods focus on accuracy and forgetting but overlook network calibration, which is crucial for reliable predictions. Neural collapse shows benefits in continual learning, but few works address calibration improvement.

Method: Proposed Sphere-Adaptive Mixup (SAMix) - an adaptive mixup strategy tailored for neural collapse-based methods that adapts mixing to geometric properties of feature spaces under neural collapse.

Result: SAMix significantly boosts performance, surpassing state-of-the-art methods in continual learning while improving model calibration. It enhances both across-task accuracy and prediction reliability.

Conclusion: SAMix represents a promising advancement for robust continual learning systems by improving both performance metrics and the broader reliability of predictions through better calibration.

Abstract: While most continual learning methods focus on mitigating forgetting and improving accuracy, they often overlook the critical aspect of network calibration, despite its importance. Neural collapse, a phenomenon where last-layer features collapse to their class means, has demonstrated advantages in continual learning by reducing feature-classifier misalignment. Few works aim to improve the calibration of continual models for more reliable predictions. Our work goes a step further by proposing a novel method that not only enhances calibration but also improves performance by reducing overconfidence, mitigating forgetting, and increasing accuracy. We introduce Sphere-Adaptive Mixup (SAMix), an adaptive mixup strategy tailored for neural collapse-based methods. SAMix adapts the mixing process to the geometric properties of feature spaces under neural collapse, ensuring more robust regularization and alignment. Experiments show that SAMix significantly boosts performance, surpassing SOTA methods in continual learning while also improving model calibration. SAMix enhances both across-task accuracy and the broader reliability of predictions, making it a promising advancement for robust continual learning systems.

[367] Chronos-2: From Univariate to Universal Forecasting

Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, Mononito Goswami, Shubham Kapoor, Danielle C. Maddix, Pablo Guerron, Tony Hu, Junming Yin, Nick Erickson, Prateek Mutalik Desai, Hao Wang, Huzefa Rangwala, George Karypis, Yuyang Wang, Michael Bohlke-Schneider

Main category: cs.LG

TL;DR: Chronos-2 is a pretrained time series forecasting model that handles univariate, multivariate, and covariate-informed forecasting tasks in a zero-shot manner using group attention for efficient in-context learning.

Details

Motivation: Existing pretrained time series models focus mainly on univariate forecasting, limiting their real-world applicability where multivariate data and covariates are crucial.

Method: Uses group attention mechanism for in-context learning across multiple time series within groups, trained on synthetic datasets that impose diverse multivariate structures on univariate series.

Result: Achieves state-of-the-art performance across three benchmarks (fev-bench, GIFT-Eval, Chronos Benchmark II), with substantial improvements in multivariate and covariate-informed forecasting, consistently outperforming baselines.

Conclusion: Chronos-2’s in-context learning capabilities establish it as a general-purpose forecasting model that can be used directly in real-world forecasting pipelines without task-specific training.

Abstract: Pretrained time series models have enabled inference-only forecasting systems that produce accurate predictions without task-specific training. However, existing approaches largely focus on univariate forecasting, limiting their applicability in real-world scenarios where multivariate data and covariates play a crucial role. We present Chronos-2, a pretrained model capable of handling univariate, multivariate, and covariate-informed forecasting tasks in a zero-shot manner. Chronos-2 employs a group attention mechanism that facilitates in-context learning (ICL) through efficient information sharing across multiple time series within a group, which may represent sets of related series, variates of a multivariate series, or targets and covariates in a forecasting task. These general capabilities are achieved through training on synthetic datasets that impose diverse multivariate structures on univariate series. Chronos-2 delivers state-of-the-art performance across three comprehensive benchmarks: fev-bench, GIFT-Eval, and Chronos Benchmark II. On fev-bench, which emphasizes multivariate and covariate-informed forecasting, Chronos-2’s universal ICL capabilities lead to substantial improvements over existing models. On tasks involving covariates, it consistently outperforms baselines by a wide margin. Case studies in the energy and retail domains further highlight its practical advantages. The in-context learning capabilities of Chronos-2 establish it as a general-purpose forecasting model that can be used “as is” in real-world forecasting pipelines.

[368] Poultry Farm Intelligence: An Integrated Multi-Sensor AI Platform for Enhanced Welfare and Productivity

Pieris Panagi, Savvas Karatsiolis, Kyriacos Mosphilis, Nicholas Hadjisavvas, Andreas Kamilaris, Nicolas Nicolaou, Efstathios Stavrakis, Vassilis Vassiliades

Main category: cs.LG

TL;DR: PoultryFI is a modular AI platform that integrates six modules for continuous poultry farm monitoring, including camera optimization, audio-visual welfare tracking, analytics, egg counting, forecasting, and recommendations.

Details

Motivation: Small and medium-sized poultry farms lack affordable integrated tools for continuous monitoring and rely on manual inspections, creating a need for cost-effective automated solutions.

Method: Uses evolutionary algorithms for camera placement optimization, synchronized audio-visual data analysis, edge vision models for egg counting, forecasting models for production prediction, and integrates weather data for recommendations.

Result: Field trials show 100% egg-count accuracy on Raspberry Pi 5, robust anomaly detection, and reliable short-term forecasting up to 10 days ahead.

Conclusion: PoultryFI bridges the gap between pilot tools and scalable farm-wide intelligence, enabling proactive welfare and profitability management.

Abstract: Poultry farming faces increasing pressure to meet productivity targets while ensuring animal welfare and environmental compliance. Yet many small and medium-sized farms lack affordable, integrated tools for continuous monitoring and decision-making, relying instead on manual, reactive inspections. This paper presents Poultry Farm Intelligence (PoultryFI) - a modular, cost-effective platform that integrates six AI-powered modules: Camera Placement Optimizer, Audio-Visual Monitoring, Analytics & Alerting, Real-Time Egg Counting, Production & Profitability Forecasting, and a Recommendation Module. Camera layouts are first optimized offline using evolutionary algorithms for full poultry house coverage with minimal hardware. The Audio-Visual Monitoring module extracts welfare indicators from synchronized video, audio, and feeding data. Analytics & Alerting produces daily summaries and real-time notifications, while Real-Time Egg Counting uses an edge vision model to automate production tracking. Forecasting models predict egg yield and feed consumption up to 10 days in advance, and the Recommendation Module integrates forecasts with weather data to guide environmental and operational adjustments. This is among the first systems to combine low-cost sensing, edge analytics, and prescriptive AI to continuously monitor flocks, predict production, and optimize performance. Field trials demonstrate 100% egg-count accuracy on Raspberry Pi 5, robust anomaly detection, and reliable short-term forecasting. PoultryFI bridges the gap between isolated pilot tools and scalable, farm-wide intelligence, empowering producers to proactively safeguard welfare and profitability.

[369] Cavity Duplexer Tuning with 1d Resnet-like Neural Networks

Anton Raskovalov

Main category: cs.LG

TL;DR: Machine learning method for tuning cavity duplexers with many adjustment screws using supervised learning and neural networks with external control.

Details

Motivation: To efficiently tune cavity duplexers with large numbers of adjustment screws, avoiding the limitations of conventional reinforcement learning approaches.

Method: Supervised learning setup with neural network architecture featuring 1D ResNet-like backbone and processing of S-parameter characteristics (curve shape, peak positions, amplitudes), combined with external control algorithm.

Result: The system can achieve nearly tuned state of the duplexer within 4-5 rotations per screw.

Conclusion: The proposed supervised learning approach with specialized neural network architecture and external control provides effective tuning of complex cavity duplexers.

Abstract: This paper presents machine learning method for tuning of cavity duplexer with a large amount of adjustment screws. After testing we declined conventional reinforcement learning approach and reformulated our task in the supervised learning setup. The suggested neural network architecture includes 1d ResNet-like backbone and processing of some additional information about S-parameters, like the shape of curve and peaks positions and amplitudes. This neural network with external control algorithm is capable to reach almost the tuned state of the duplexer within 4-5 rotations per screw.

[370] SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients

Dominik Kallusky, Vinay Rao, Vishal Nandavanam, Hao-Jun Michael Shi

Main category: cs.LG

TL;DR: The paper introduces SNOO (Step-K Nesterov Outer Optimizer), a Lookahead variant that applies Nesterov momentum to pseudo-gradients, achieving 1.5-2.5× compute gains in non-distributed training up to 1e23 FLOPs.

Details

Motivation: To understand why DiLoCo optimizer works effectively in non-distributed settings and develop a more efficient optimization technique for large language models.

Method: Proposed SNOO, which applies Nesterov momentum to the pseudo-gradient in a two-loop Lookahead framework, maintaining fast and slow weight sets with minimal overhead.

Result: SNOO achieves compute factor gains of 1.5-2.5× in non-distributed training, with improvements scaling with model size, and works with various inner optimizers like AdamW and Muon.

Conclusion: SNOO is a practical enhancement for LLM training that provides significant compute efficiency gains while maintaining compatibility with existing optimization methods.

Abstract: The rapid development of large language models (LLMs) has driven the demand for more efficient optimization techniques. Among these, the Lookahead family of optimizers employs a two-loop framework, maintaining fast and slow sets of model weights. Multiple inner optimizer steps on the fast weights produce a trajectory - the pseudo-gradient - that is used to update the slow weights. DiLoCo, a notable example originally designed for distributed training, applies Nesterov momentum to the averaged pseudo-gradient from multiple workers, claiming to even outperform AdamW in a non-distributed setup. In this paper, we empirically show that DiLoCo’s surprising effectiveness stems primarily from applying Nesterov momentum to the pseudo-gradient, which improves training in a non-distributed setting. We call this Lookahead variant the Step-$K$ Nesterov Outer Optimizer (SNOO). We demonstrate that SNOO achieves compute factor gains of 1.5 - 2.5$\times$ in a non-distributed setting up to a scale of 1e23 training FLOPs, with improvements that increase with model size. Because of its minimal compute and memory overhead and compatibility with model sharding, SNOO is a practical enhancement for a variety of inner optimizers, including AdamW and Muon.

[371] FIDDLE: Reinforcement Learning for Quantum Fidelity Enhancement

Hoang M. Ngo, Tamer Kahveci, My T. Thai

Main category: cs.LG

TL;DR: FIDDLE is a learning framework that directly maximizes quantum circuit process fidelity during routing using Gaussian Process surrogate modeling and reinforcement learning, outperforming traditional indirect methods.

Details

Motivation: Current quantum devices suffer from noise that reduces reliability, and existing transpilation methods use indirect metrics like circuit depth rather than directly optimizing process fidelity.

Method: FIDDLE framework with two modules: Gaussian Process-based surrogate model for fidelity estimation with limited samples, and reinforcement learning module for routing optimization.

Result: FIDDLE provides better fidelity estimation than existing techniques and significantly improves process fidelity across various noise models compared to state-of-the-art methods.

Conclusion: Directly optimizing process fidelity during routing is more effective than traditional indirect approaches, and FIDDLE demonstrates superior performance in improving quantum circuit reliability.

Abstract: Quantum computing has the potential to revolutionize fields like quantum optimization and quantum machine learning. However, current quantum devices are hindered by noise, reducing their reliability. A key challenge in gate-based quantum computing is improving the reliability of quantum circuits, measured by process fidelity, during the transpilation process, particularly in the routing stage. In this paper, we address the Fidelity Maximization in Routing Stage (FMRS) problem by introducing FIDDLE, a novel learning framework comprising two modules: a Gaussian Process-based surrogate model to estimate process fidelity with limited training samples and a reinforcement learning module to optimize routing. Our approach is the first to directly maximize process fidelity, outperforming traditional methods that rely on indirect metrics such as circuit depth or gate count. We rigorously evaluate FIDDLE by comparing it with state-of-the-art fidelity estimation techniques and routing optimization methods. The results demonstrate that our proposed surrogate model is able to provide a better estimation on the process fidelity compared to existing learning techniques, and our end-to-end framework significantly improves the process fidelity of quantum circuits across various noise models.

[372] Self-Certifying Primal-Dual Optimization Proxies for Large-Scale Batch Economic Dispatch

Michael Klamkin, Mathieu Tanneau, Pascal Van Hentenryck

Main category: cs.LG

TL;DR: A hybrid solver combining optimization proxies with classical solvers to guarantee optimality gaps while achieving significant speedups.

Details

Motivation: Optimization proxies achieve high average performance but have unreliable worst-case gaps, making them untrustworthy for practical deployment.

Method: Proposes a hybrid solver using duality theory to bound optimality gaps, with fallback to classical solvers when certification fails, plus combined primal-dual proxy training.

Result: Achieves over 1000x speedup compared to parallelized simplex solver while guaranteeing maximum 2% optimality gap on large-scale transmission systems.

Conclusion: The hybrid approach enables trustworthy deployments with interpretable speed-optimality tradeoffs based on user-defined thresholds.

Abstract: Recent research has shown that optimization proxies can be trained to high fidelity, achieving average optimality gaps under 1% for large-scale problems. However, worst-case analyses show that there exist in-distribution queries that result in orders of magnitude higher optimality gap, making it difficult to trust the predictions in practice. This paper aims at striking a balance between classical solvers and optimization proxies in order to enable trustworthy deployments with interpretable speed-optimality tradeoffs based on a user-defined optimality threshold. To this end, the paper proposes a hybrid solver that leverages duality theory to efficiently bound the optimality gap of predictions, falling back to a classical solver for queries where optimality cannot be certified. To improve the achieved speedup of the hybrid solver, the paper proposes an alternative training procedure that combines the primal and dual proxy training. Experiments on large-scale transmission systems show that the hybrid solver is highly scalable. The proposed hybrid solver achieves speedups of over 1000x compared to a parallelized simplex-based solver while guaranteeing a maximum optimality gap of 2%.

[373] Transfer Orthology Networks

Vikash Singh

Main category: cs.LG

TL;DR: TRON is a neural network architecture for cross-species transfer learning that uses orthologous relationships as a bipartite graph to guide knowledge transfer between species.

Details

Motivation: To enable effective cross-species transfer learning by leveraging biological orthology relationships, allowing better utilization of available transcriptomic data across different species.

Method: Prepend a learned species conversion layer (masked by orthology bipartite graph) to a pre-trained feedforward network, learning linear transformations to map gene expression from source to target species.

Result: The architecture enables efficient knowledge transfer and provides interpretable weights that can reveal functional orthology insights.

Conclusion: TRON offers a biologically grounded and interpretable approach to cross-species transfer learning, with experimental validation currently in progress.

Abstract: We present Transfer Orthology Networks (TRON), a novel neural network architecture designed for cross-species transfer learning. TRON leverages orthologous relationships, represented as a bipartite graph between species, to guide knowledge transfer. Specifically, we prepend a learned species conversion layer, whose weights are masked by the biadjacency matrix of this bipartite graph, to a pre-trained feedforward neural network that predicts a phenotype from gene expression data in a source species. This allows for efficient transfer of knowledge to a target species by learning a linear transformation that maps gene expression from the source to the target species’ gene space. The learned weights of this conversion layer offer a potential avenue for interpreting functional orthology, providing insights into how genes across species contribute to the phenotype of interest. TRON offers a biologically grounded and interpretable approach to cross-species transfer learning, paving the way for more effective utilization of available transcriptomic data. We are in the process of collecting cross-species transcriptomic/phenotypic data to gain experimental validation of the TRON architecture.

[374] Learning Correlated Reward Models: Statistical Barriers and Opportunities

Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Gabriele Farina, Sobhan Mohammadpour

Main category: cs.LG

TL;DR: The paper shows that pairwise preference data is insufficient for learning correlated probit models, but best-of-three preference data enables efficient estimation of correlated utilities with improved personalization.

Details

Motivation: Random Utility Models (RUMs) used in RLHF suffer from the Independence of Irrelevant Alternatives (IIA) assumption, which oversimplifies human preferences by assuming a universal utility function. Existing methods lack statistical and computational guarantees for models that avoid IIA.

Method: The paper investigates learning correlated probit models, which avoid IIA. It first proves pairwise preference data is fundamentally insufficient for learning correlational information, then demonstrates that best-of-three preference data enables efficient estimation with theoretical guarantees.

Result: The proposed estimator using best-of-three preference data achieves statistically and computationally efficient performance with near-optimal guarantees. Experimental validation on real-world datasets shows improved personalization of human preferences.

Conclusion: Higher-order preference data (specifically best-of-three) is crucial for learning correlated utilities, enabling more fine-grained modeling of human preferences and overcoming limitations of traditional pairwise approaches.

Abstract: Random Utility Models (RUMs) are a classical framework for modeling user preferences and play a key role in reward modeling for Reinforcement Learning from Human Feedback (RLHF). However, a crucial shortcoming of many of these techniques is the Independence of Irrelevant Alternatives (IIA) assumption, which collapses \emph{all} human preferences to a universal underlying utility function, yielding a coarse approximation of the range of human preferences. On the other hand, statistical and computational guarantees for models avoiding this assumption are scarce. In this paper, we investigate the statistical and computational challenges of learning a \emph{correlated} probit model, a fundamental RUM that avoids the IIA assumption. First, we establish that the classical data collection paradigm of pairwise preference data is \emph{fundamentally insufficient} to learn correlational information, explaining the lack of statistical and computational guarantees in this setting. Next, we demonstrate that \emph{best-of-three} preference data provably overcomes these shortcomings, and devise a statistically and computationally efficient estimator with near-optimal performance. These results highlight the benefits of higher-order preference data in learning correlated utilities, allowing for more fine-grained modeling of human preferences. Finally, we validate these theoretical guarantees on several real-world datasets, demonstrating improved personalization of human preferences.

[375] METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation

Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Shaoting Feng, Ganesh Ananthanarayanan, Ravi Netravali, Junchen Jiang

Main category: cs.LG

TL;DR: METIS is a RAG system that jointly schedules queries and adapts RAG configurations to balance quality and response delay, reducing latency by 1.64-2.54× without quality loss.

Details

Motivation: Existing RAG systems either reduce response delay or maximize quality, but fail to optimize the tradeoff between delay and quality in RAG responses.

Method: METIS jointly schedules queries and adapts key RAG configurations per query (number of retrieved text chunks and synthesis methods) to balance quality optimization and delay reduction.

Result: On 4 popular RAG-QA datasets, METIS reduces generation latency by 1.64-2.54× compared to state-of-the-art RAG optimization schemes without sacrificing generation quality.

Conclusion: METIS successfully optimizes the quality-delay tradeoff in RAG systems through joint query scheduling and configuration adaptation.

Abstract: RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents METIS, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, METIS reduces the generation latency by $1.64-2.54\times$ without sacrificing generation quality.

[376] Scalable Multi-phase Word Embedding Using Conjunctive Propositional Clauses

Ahmed K. Kadhim, Lei Jiao, Rishad Shafik, Ole-Christoffer Granmo, Bimal Bhattarai

Main category: cs.LG

TL;DR: A novel two-phase training approach for Tsetlin Machine word embeddings that addresses scalability issues while maintaining interpretability, achieving competitive performance on sentiment analysis tasks.

Details

Motivation: Previous Tsetlin Machine word embedding approaches faced scalability challenges with increasing input sizes, limiting their practical application in larger datasets.

Method: Two-phase training that first extracts knowledge for individual words in the vocabulary, then constructs embeddings for input sequences using this extracted knowledge.

Result: The method achieved competitive performance compared to previous approaches and human benchmarks, with successful application to IMDB sentiment analysis providing transparent end-to-end solutions.

Conclusion: The proposed approach enables scalable Tsetlin Machine embeddings while preserving interpretability, offering a viable transparent alternative to conventional ML methods.

Abstract: The Tsetlin Machine (TM) architecture has recently demonstrated effectiveness in Machine Learning (ML), particularly within Natural Language Processing (NLP). It has been utilized to construct word embedding using conjunctive propositional clauses, thereby significantly enhancing our understanding and interpretation of machine-derived decisions. The previous approach performed the word embedding over a sequence of input words to consolidate the information into a cohesive and unified representation. However, that approach encounters scalability challenges as the input size increases. In this study, we introduce a novel approach incorporating two-phase training to discover contextual embeddings of input sequences. Specifically, this method encapsulates the knowledge for each input word within the dataset’s vocabulary, subsequently constructing embeddings for a sequence of input words utilizing the extracted knowledge. This technique not only facilitates the design of a scalable model but also preserves interpretability. Our experimental findings revealed that the proposed method yields competitive performance compared to the previous approaches, demonstrating promising results in contrast to human-generated benchmarks. Furthermore, we applied the proposed approach to sentiment analysis on the IMDB dataset, where the TM embedding and the TM classifier, along with other interpretable classifiers, offered a transparent end-to-end solution with competitive performance.

[377] Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models

Dhruva Karkada, James B. Simon, Yasaman Bahri, Michael R. DeWeese

Main category: cs.LG

TL;DR: The paper analyzes word2vec through a quartic Taylor approximation, showing similar training dynamics and performance. It provides analytical solutions for gradient flow and final embeddings, revealing that models learn orthogonal linear subspaces incrementally.

Details

Motivation: To understand the representation learning dynamics in word2vec through mathematical analysis, specifically examining how semantic concepts emerge during training.

Method: Using quartic Taylor approximation of word2vec loss around origin, analytical solutions for gradient flow training dynamics and final embeddings in terms of corpus statistics and hyperparameters.

Result: Models learn orthogonal linear subspaces incrementally until capacity saturation; each top subspace represents interpretable topic-level concepts; linear representations of abstract semantic concepts emerge during training.

Conclusion: The analytical framework successfully explains word2vec’s training dynamics and semantic representation learning, providing insights into how abstract concepts emerge and can be used for analogies via vector addition.

Abstract: Self-supervised word embedding algorithms such as word2vec provide a minimal setting for studying representation learning in language modeling. We examine the quartic Taylor approximation of the word2vec loss around the origin, and we show that both the resulting training dynamics and the final performance on downstream tasks are empirically very similar to those of word2vec. Our main contribution is to analytically solve for both the gradient flow training dynamics and the final word embeddings in terms of only the corpus statistics and training hyperparameters. The solutions reveal that these models learn orthogonal linear subspaces one at a time, each one incrementing the effective rank of the embeddings until model capacity is saturated. Training on Wikipedia, we find that each of the top linear subspaces represents an interpretable topic-level concept. Finally, we apply our theory to describe how linear representations of more abstract semantic concepts emerge during training; these can be used to complete analogies via vector addition.

[378] Variational Autoencoders for Efficient Simulation-Based Inference

Mayank Nautiyal, Andrey Shternshis, Andreas Hellander, Prashant Singh

Main category: cs.LG

TL;DR: A variational inference approach using latent variables in VAEs for likelihood-free simulation-based inference, with two variations: adaptive prior network and standard Gaussian prior.

Details

Motivation: To efficiently estimate complex posterior distributions from stochastic simulations in likelihood-free inference scenarios where traditional methods are computationally expensive.

Method: Uses variational autoencoders with latent variables, exploring two variations: one with adaptive multivariate prior network that adapts to observed data, and another with standard Gaussian prior for simplicity.

Result: Demonstrates ability to approximate complex posteriors while maintaining computational efficiency on established benchmark problems.

Conclusion: The proposed generative modeling approach effectively handles complex posterior distributions in simulation-based inference, with both adaptive and simple prior variations showing promising results.

Abstract: We present a generative modeling approach based on the variational inference framework for likelihood-free simulation-based inference. The method leverages latent variables within variational autoencoders to efficiently estimate complex posterior distributions arising from stochastic simulations. We explore two variations of this approach distinguished by their treatment of the prior distribution. The first model adapts the prior based on observed data using a multivariate prior network, enhancing generalization across various posterior queries. In contrast, the second model utilizes a standard Gaussian prior, offering simplicity while still effectively capturing complex posterior distributions. We demonstrate the ability of the proposed approach to approximate complex posteriors while maintaining computational efficiency on well-established benchmark problems.

[379] Retro3D: A 3D-aware Template-free Method for Enhancing Retrosynthesis via Molecular Conformer Information

Jiaxi Zhuang, Yu Zhang, Yan Zhang, Ying Qian, Aimin Zhou

Main category: cs.LG

TL;DR: A novel transformer-based retrosynthesis method that incorporates 3D molecular conformations and spatial information through Atom-align Fusion and Distance-weighted Attention mechanisms, outperforming previous template-free approaches.

Details

Motivation: Existing retrosynthesis methods overlook 3D conformational details and spatial organization, making it challenging to predict reactants that follow genuine chemical principles, especially for complex structures like polycyclic and heteroaromatic compounds.

Method: Transformer-based template-free approach with Atom-align Fusion module for integrating 3D positional data and Distance-weighted Attention mechanism that refines self-attention by focusing on relevant atom pairs in 3D space.

Result: Extensive experiments on USPTO-50K dataset show the model outperforms previous template-free methods and sets a new benchmark. Case study demonstrates ability to predict reasonable and accurate reactants.

Conclusion: The proposed 3D-aware transformer approach successfully addresses limitations of existing methods by incorporating spatial information, providing more chemically accurate retrosynthesis predictions for complex molecular structures.

Abstract: Retrosynthesis plays a crucial role in the fields of organic synthesis and drug development, where the goal is to identify suitable reactants that can yield a target product molecule. Although existing methods have achieved notable success, they typically overlook the 3D conformational details and internal spatial organization of molecules. This oversight makes it challenging to predict reactants that conform to genuine chemical principles, particularly when dealing with complex molecular structures, such as polycyclic and heteroaromatic compounds. In response to this challenge, we introduce a novel transformer-based, template-free approach that incorporates 3D conformer data and spatial information. Our approach includes an Atom-align Fusion module that integrates 3D positional data at the input stage, ensuring correct alignment between atom tokens and their respective 3D coordinates. Additionally, we propose a Distance-weighted Attention mechanism that refines the self-attention process, constricting the model s focus to relevant atom pairs in 3D space. Extensive experiments on the USPTO-50K dataset demonstrate that our model outperforms previous template-free methods, setting a new benchmark for the field. A case study further highlights our method s ability to predict reasonable and accurate reactants.

[380] WebInject: Prompt Injection Attack to Web Agents

Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, Neil Zhenqiang Gong

Main category: cs.LG

TL;DR: WebInject is a prompt injection attack that manipulates webpage screenshots to make MLLM-based web agents perform attacker-specified actions by adding pixel perturbations.

Details

Motivation: To demonstrate vulnerabilities in MLLM-based web agents that interact with webpages through screenshots, showing they can be manipulated via pixel-level perturbations.

Method: Formulate perturbation finding as optimization problem, train neural network to approximate non-differentiable screenshot mapping, and use projected gradient descent to solve the optimization.

Result: WebInject is highly effective across multiple datasets and significantly outperforms baseline methods.

Conclusion: MLLM-based web agents are vulnerable to pixel-level prompt injection attacks, highlighting security concerns in these systems.

Abstract: Multi-modal large language model (MLLM)-based web agents interact with webpage environments by generating actions based on screenshots of the webpages. In this work, we propose WebInject, a prompt injection attack that manipulates the webpage environment to induce a web agent to perform an attacker-specified action. Our attack adds a perturbation to the raw pixel values of the rendered webpage. After these perturbed pixels are mapped into a screenshot, the perturbation induces the web agent to perform the attacker-specified action. We formulate the task of finding the perturbation as an optimization problem. A key challenge in solving this problem is that the mapping between raw pixel values and screenshot is non-differentiable, making it difficult to backpropagate gradients to the perturbation. To overcome this, we train a neural network to approximate the mapping and apply projected gradient descent to solve the reformulated optimization problem. Extensive evaluation on multiple datasets shows that WebInject is highly effective and significantly outperforms baselines.

[381] FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model

Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li

Main category: cs.LG

TL;DR: FEMBA is a novel self-supervised framework for EEG analysis that uses bidirectional state-space modeling to achieve linear scaling with sequence length, making it more efficient than Transformer models while maintaining competitive performance.

Details

Motivation: Traditional deep learning models like Transformers have quadratic time and memory complexity, making them unsuitable for resource-constrained environments in EEG analysis for seizure and artifact detection.

Method: FEMBA uses bidirectional state-space modeling to scale linearly with sequence length, trained on over 21,000 hours of unlabeled EEG data and fine-tuned on downstream tasks.

Result: FEMBA achieves 81.82% balanced accuracy (0.8921 AUROC) on TUAB and 0.949 AUROC on TUAR, with a tiny 7.8M-parameter variant showing viability for resource-constrained devices.

Conclusion: FEMBA enables scalable, general-purpose EEG analytics with significantly lower computational cost than transformers, making it promising for both clinical and wearable applications.

Abstract: Accurate and efficient electroencephalography (EEG) analysis is essential for detecting seizures and artifacts in long-term monitoring, with applications spanning hospital diagnostics to wearable health devices. Robust EEG analytics have the potential to greatly improve patient care. However, traditional deep learning models, especially Transformer-based architectures, are hindered by their quadratic time and memory complexity, making them less suitable for resource-constrained environments. To address these challenges, we present FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel self-supervised framework that establishes new efficiency benchmarks for EEG analysis through bidirectional state-space modeling. Unlike Transformer-based models, which incur quadratic time and memory complexity, FEMBA scales linearly with sequence length, enabling more scalable and efficient processing of extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and fine-tuned on three downstream tasks, FEMBA achieves competitive performance in comparison with transformer models, with significantly lower computational cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates viability for resource-constrained devices. These results pave the way for scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as a promising candidate for wearable applications.

[382] A Weakly Supervised Transformer for Rare Disease Diagnosis and Subphenotyping from EHRs with Pulmonary Case Studies

Kimberly F. Greco, Zongxin Yang, Mengyan Li, Han Tong, Sara Morini Sweet, Alon Geva, Kenneth D. Mandl, Benjamin A. Raby, Tianxi Cai

Main category: cs.LG

TL;DR: WEST is a weakly supervised transformer framework that combines EHR data with limited expert-validated cases to enable large-scale rare disease phenotyping, outperforming existing methods in classification, subphenotyping, and progression prediction.

Details

Motivation: Rare diseases affect millions but remain underdiagnosed due to low prevalence and limited clinician familiarity. Computational phenotyping is hindered by scarce high-quality labeled data, with expert-labeled datasets being limited and EHR-derived labels being noisy.

Method: WEST employs a weakly supervised transformer model trained on probabilistic silver-standard labels derived from structured and unstructured EHR features, which are iteratively refined during training to improve model calibration. It combines routinely collected EHR data with a limited set of expert-validated cases and controls.

Result: WEST outperforms existing methods in phenotype classification, identification of clinically meaningful subphenotypes, and prediction of disease progression when evaluated on two rare pulmonary diseases using EHR data from Boston Children’s Hospital.

Conclusion: By reducing reliance on manual annotation, WEST enables data-efficient rare disease phenotyping that improves cohort definition, supports earlier and more accurate diagnosis, and accelerates data-driven discovery for the rare disease community.

Abstract: Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions remain underdiagnosed and poorly characterized due to their low prevalence and limited clinician familiarity. Computational phenotyping offers a scalable approach to improving rare disease detection, but algorithm development is hindered by the scarcity of high-quality labeled data for training. Expert-labeled datasets from chart reviews and registries are clinically accurate but limited in scope and availability, whereas labels derived from electronic health records (EHRs) provide broader coverage but are often noisy or incomplete. To address these challenges, we propose WEST (WEakly Supervised Transformer for rare disease phenotyping and subphenotyping from EHRs), a framework that combines routinely collected EHR data with a limited set of expert-validated cases and controls to enable large-scale phenotyping. At its core, WEST employs a weakly supervised transformer model trained on extensive probabilistic silver-standard labels - derived from both structured and unstructured EHR features - that are iteratively refined during training to improve model calibration. We evaluate WEST on two rare pulmonary diseases using EHR data from Boston Children’s Hospital and show that it outperforms existing methods in phenotype classification, identification of clinically meaningful subphenotypes, and prediction of disease progression. By reducing reliance on manual annotation, WEST enables data-efficient rare disease phenotyping that improves cohort definition, supports earlier and more accurate diagnosis, and accelerates data-driven discovery for the rare disease community.

[383] Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs

Nir Ailon, Akhiad Bercovich, Omri Weinstein

Main category: cs.LG

TL;DR: The paper proposes Strassen-Tile (STL), a GPU-native bilinear operator that replaces matrix multiplications in neural networks, offering a tradeoff between speed, accuracy, and parameter count with significantly fewer FLOPs.

Details

Motivation: Modern AI faces scalability problems due to huge matrix multiplications during inference and training, creating computational bottlenecks.

Method: STL applies local learnable change-of-basis transformations on tiles of weight and activation matrices, followed by element-wise products implemented via MatMul, with theory-backed initializations inspired by fast matrix and polynomial multiplication.

Result: STL can approximate 4x4 MatMul of tiles while reducing FLOPs by 2.66x, improves Imagenet-1K accuracy of T2T-ViT-7 while lowering FLOPs, and achieves wall-clock speedups even with non-optimized PyTorch code.

Conclusion: STL is a promising building block for scalable and cost-efficient AI due to its theoretical grounds and demonstrated performance improvements.

Abstract: Modern AI relies on huge matrix multiplications (MatMuls), whose computation poses a scalability problem for inference and training. We propose an alternative, GPU native bilinear operator to MatMuls in neural networks, which offers a three-way tradeoff between: speed, accuracy and parameter count. In particular, this operator requires substantially fewer FLOPs to evaluate ($\ll n^3$), yet increases the parameter count compared to MatMul ($\gg n^2$). We call this operator Strassen-Tile (STL). The key idea behind STL is a local learnable change-of-basis, applied on tiles of the weight and activation matrices, followed by an element-wise product between the tiles, implemented simultaneously via MatMul. The key technical question we study is how to optimize the change-of-basis of a given layer, which is a highly non-convex problem. We show that theory-backed initializations (inspired by fast matrix and polynomial multiplication) lead to substantially better accuracy than random SGD initialization. This phenomenon motivates further algorithmic study of STL optimization in DNNs. Our experiments demonstrate that STL can approximate 4x4 MatMul of tiles while reducing FLOPs by a factor of 2.66, and can improve Imagenet-1K accuracy of SoTA T2T-ViT-7 (4.3M parameters) while lowering FLOPs. Even with non-CUDA optimized PyTorch code, STL achieves wall-clock speedups in the compute-bound regime. These results, together with its theoretical grounds, suggest STL as a promising building block for scalable and cost-efficient AI.

[384] Learning to Interpret Weight Differences in Language Models

Avichal Goel, Yoon Kim, Nir Shavit, Tony T. Wang

Main category: cs.LG

TL;DR: Diff Interpretation Tuning (DIT) trains models to describe their own finetuning-induced weight changes using natural language, enabling interpretable understanding of model modifications.

Details

Motivation: Weight changes from finetuning are not interpretable, and finetuning datasets are often unavailable or too large, making it difficult to understand how models have been modified.

Method: Uses synthetic, labeled weight diffs to train a DIT-adapter that can be applied to finetuned models to make them describe their changes in natural language.

Result: In proof-of-concept settings (reporting hidden behaviors and summarizing finetuned knowledge), models accurately describe their finetuning-induced modifications using natural language.

Conclusion: DIT enables comprehensive understanding of weight diffs through natural language descriptions, making model finetuning modifications interpretable.

Abstract: Finetuning (pretrained) language models is a standard approach for updating their internal parametric knowledge and specializing them to new tasks and domains. However, the corresponding model weight changes (“weight diffs”) are not generally interpretable. While inspecting the finetuning dataset can give a sense of how the model might have changed, these datasets are often not publicly available or are too large to work with directly. Towards the goal of comprehensively understanding weight diffs in natural language, we introduce Diff Interpretation Tuning (DIT), a method that trains models to describe their own finetuning-induced modifications. Our approach uses synthetic, labeled weight diffs to train a DIT-adapter, which can be applied to a compatible finetuned model to make it describe how it has changed. We demonstrate in two proof-of-concept settings (reporting hidden behaviors and summarizing finetuned knowledge) that our method enables models to describe their finetuning-induced modifications using accurate natural language descriptions.

[385] Interpretable Hybrid-Rule Temporal Point Processes

Yunyang Cao, Juekai Lin, Hongye Wang, Wenhao Li, Bo Jin

Main category: cs.LG

TL;DR: HRTPP is a novel framework that integrates temporal logic rules with numerical features to improve both interpretability and predictive accuracy in Temporal Point Processes for medical event modeling.

Details

Motivation: Traditional Temporal Point Processes lack interpretability, and recent interpretable TPPs fail to incorporate numerical features, limiting their predictive precision in medical applications.

Method: HRTPP combines three intensity components: basic intensity for intrinsic event likelihood, rule-based intensity for temporal dependencies, and numerical feature intensity for dynamic probability modulation. It uses a two-phase rule mining strategy with Bayesian optimization to discover valid rules.

Result: Experimental results on real-world medical datasets show HRTPP outperforms state-of-the-art interpretable TPPs in both predictive performance and clinical interpretability. Extracted rules effectively explain disease progression.

Conclusion: HRTPP successfully addresses the limitations of existing interpretable TPPs by integrating numerical features with temporal logic rules, providing both accurate predictions and clinically meaningful interpretations for medical diagnosis.

Abstract: Temporal Point Processes (TPPs) are widely used for modeling event sequences in various medical domains, such as disease onset prediction, progression analysis, and clinical decision support. Although TPPs effectively capture temporal dynamics, their lack of interpretability remains a critical challenge. Recent advancements have introduced interpretable TPPs. However, these methods fail to incorporate numerical features, thereby limiting their ability to generate precise predictions. To address this issue, we propose Hybrid-Rule Temporal Point Processes (HRTPP), a novel framework that integrates temporal logic rules with numerical features, improving both interpretability and predictive accuracy in event modeling. HRTPP comprises three key components: basic intensity for intrinsic event likelihood, rule-based intensity for structured temporal dependencies, and numerical feature intensity for dynamic probability modulation. To effectively discover valid rules, we introduce a two-phase rule mining strategy with Bayesian optimization. To evaluate our method, we establish a multi-criteria assessment framework, incorporating rule validity, model fitting, and temporal predictive accuracy. Experimental results on real-world medical datasets demonstrate that HRTPP outperforms state-of-the-art interpretable TPPs in terms of predictive performance and clinical interpretability. In case studies, the rules extracted by HRTPP explain the disease progression, offering valuable contributions to medical diagnosis.

[386] Hyperbolic Dataset Distillation

Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Main category: cs.LG

TL;DR: Proposes HDD, a hyperbolic dataset distillation method that embeds data into Lorentz hyperbolic space to preserve hierarchical relationships, achieving efficient distillation with only 20% of core set needed for comparable performance.

Details

Motivation: Existing distribution matching methods for dataset distillation operate in Euclidean space and treat data as independent points, overlooking complex geometric and hierarchical relationships in data.

Method: Embeds features into Lorentz hyperbolic space using a shallow network, measures discrepancy between synthetic and original data using hyperbolic geodesic distance between centroids, and optimizes this distance to integrate hierarchical structure.

Result: HDD successfully preserves hierarchical structure in distilled data, requires only 20% of distilled core set to retain model performance through hyperbolic pruning, and significantly improves training stability.

Conclusion: HDD is the first dataset distillation method to incorporate hyperbolic space, effectively capturing hierarchical relationships while maintaining computational efficiency and enabling aggressive pruning without performance loss.

Abstract: To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD. Hyperbolic space, characterized by negative curvature and exponential volume growth with distance, naturally models hierarchical and tree-like structures. HDD embeds features extracted by a shallow network into the Lorentz hyperbolic space, where the discrepancy between synthetic and original data is measured by the hyperbolic (geodesic) distance between their centroids. By optimizing this distance, the hierarchical structure is explicitly integrated into the distillation process, guiding synthetic samples to gravitate towards the root-centric regions of the original data distribution while preserving their underlying geometric characteristics. Furthermore, we find that pruning in hyperbolic space requires only 20% of the distilled core set to retain model performance, while significantly improving training stability. To the best of our knowledge, this is the first work to incorporate the hyperbolic space into the dataset distillation process. The code is available at https://github.com/Guang000/HDD.

[387] When Does Closeness in Distribution Imply Representational Similarity? An Identifiability Perspective

Beatrix M. G. Nielsen, Emanuele Marconato, Andrea Dittadi, Luigi Gresele

Main category: cs.LG

TL;DR: This paper analyzes when deep neural networks learn similar representations, finding that models with similar output distributions can still have dissimilar representations, and proposes a new distributional distance that better correlates with representational similarity.

Details

Motivation: To understand when and why different neural networks learn similar representations, addressing the gap between distributional similarity and representational similarity.

Method: Uses identifiability theory to define representational similarity measures, analyzes models including autoregressive language models, proves theoretical results about KL divergence limitations, and conducts experiments on CIFAR-10 and synthetic data.

Result: Shows that small KL divergence between model distributions doesn’t guarantee similar representations, and that models with near-maximum likelihood can learn dissimilar representations. Defines a new distributional distance that better predicts representational similarity, finding wider networks learn closer distributions and more similar representations.

Conclusion: Clarifies the relationship between distributional closeness and representational similarity, providing theoretical and empirical evidence that standard distributional measures like KL divergence are insufficient for predicting representational similarity.

Abstract: When and why representations learned by different deep neural networks are similar is an active research topic. We choose to address these questions from the perspective of identifiability theory, which suggests that a measure of representational similarity should be invariant to transformations that leave the model distribution unchanged. Focusing on a model family which includes several popular pre-training approaches, e.g., autoregressive language models, we explore when models which generate distributions that are close have similar representations. We prove that a small Kullback–Leibler divergence between the model distributions does not guarantee that the corresponding representations are similar. This has the important corollary that models with near-maximum data likelihood can still learn dissimilar representations – a phenomenon mirrored in our experiments with models trained on CIFAR-10. We then define a distributional distance for which closeness implies representational similarity, and in synthetic experiments, we find that wider networks learn distributions which are closer with respect to our distance and have more similar representations. Our results thus clarify the link between closeness in distribution and representational similarity.

[388] MOBODY: Model Based Off-Dynamics Offline Reinforcement Learning

Yihong Guo, Yu Yang, Pan Xu, Anqi Liu

Main category: cs.LG

TL;DR: MOBODY is a model-based offline RL algorithm for off-dynamics settings that uses separate action encoders and target Q-weighted behavior cloning to effectively handle dynamics mismatch between source and target domains.

Details

Motivation: Existing off-dynamics RL methods fail when dynamics shift is significant or optimal trajectories lie outside low-shift regions, as they only use data from low-shift areas, limiting exploration of high-reward states in the target domain.

Method: MOBODY uses learned target dynamics transitions to explore the target domain, employs separate action encoders for each domain with shared state representations, and introduces target Q-weighted behavior cloning to avoid out-of-distribution actions.

Result: MOBODY outperforms state-of-the-art off-dynamics RL baselines on MuJoCo and Adroit benchmarks, with especially pronounced improvements in challenging scenarios where existing methods struggle.

Conclusion: The proposed MOBODY algorithm effectively addresses limitations of existing off-dynamics RL methods by enabling exploration of target domain using learned dynamics and target Q-weighted policy optimization.

Abstract: We study off-dynamics offline reinforcement learning, where the goal is to learn a policy from offline source and limited target datasets with mismatched dynamics. Existing methods either penalize the reward or discard source transitions occurring in parts of the transition space with high dynamics shift. As a result, they optimize the policy using data from low-shift regions, limiting exploration of high-reward states in the target domain that do not fall within these regions. Consequently, such methods often fail when the dynamics shift is significant or the optimal trajectories lie outside the low-shift regions. To overcome this limitation, we propose MOBODY, a Model-Based Off-Dynamics Offline RL algorithm that optimizes a policy using learned target dynamics transitions to explore the target domain, rather than only being trained with the low dynamics-shift transitions. For the dynamics learning, built on the observation that achieving the same next state requires taking different actions in different domains, MOBODY employs separate action encoders for each domain to encode different actions to the shared latent space while sharing a unified representation of states and a common transition function. We further introduce a target Q-weighted behavior cloning loss in policy optimization to avoid out-of-distribution actions, which push the policy toward actions with high target-domain Q-values, rather than high source domain Q-values or uniformly imitating all actions in the offline dataset. We evaluate MOBODY on a wide range of MuJoCo and Adroit benchmarks, demonstrating that it outperforms state-of-the-art off-dynamics RL baselines as well as policy learning methods based on different dynamics learning baselines, with especially pronounced improvements in challenging scenarios where existing methods struggle.

[389] Personalized Semi-Supervised Federated Learning for Human Activity Recognition

Riccardo Presotto, Gabriele Civitarese, Claudio Bettini

Main category: cs.LG

TL;DR: FedAR is a hybrid method combining semi-supervised and federated learning for Human Activity Recognition that addresses data scarcity and privacy issues by using active learning and label propagation with minimal manual annotation.

Details

Motivation: Address the scarcity of labeled data in sensor-based HAR and overcome privacy/scalability issues of centralized semi-supervised learning by leveraging federated learning.

Method: Combines active learning and label propagation to semi-automatically annotate unlabeled sensor data locally, uses federated learning to build global models, and includes transfer learning for personalization.

Result: Achieves recognition rates and personalization capabilities similar to state-of-the-art FL supervised approaches while requiring only minimal annotated data and decreasing active learning queries over time.

Conclusion: FedAR provides an effective and scalable solution for HAR data scarcity by combining semi-supervised and federated learning with minimal manual annotation requirements.

Abstract: One of the major open problems in sensor-based Human Activity Recognition (HAR) is the scarcity of labeled data. Among the many solutions to address this challenge, semi-supervised learning approaches represent a promising direction. However, their centralised architecture incurs in the scalability and privacy problems that arise when the process involves a large number of users. Federated Learning (FL) is a promising paradigm to address these problems. However, the FL methods that have been proposed for HAR assume that the participating users can always obtain labels to train their local models (i.e., they assume a fully supervised setting). In this work, we propose FedAR: a novel hybrid method for HAR that combines semi-supervised and federated learning to take advantage of the strengths of both approaches. FedAR combines active learning and label propagation to semi-automatically annotate the local streams of unlabeled sensor data, and it relies on FL to build a global activity model in a scalable and privacy-aware fashion. FedAR also includes a transfer learning strategy to fine-tune the global model on each user. We evaluated our method on two public datasets, showing that FedAR reaches recognition rates and personalization capabilities similar to state-of-the-art FL supervised approaches. As a major advantage, FedAR only requires a very limited number of annotated data to populate a pre-trained model and a small number of active learning questions that quickly decrease while using the system, leading to an effective and scalable solution for the data scarcity problem of HAR.

[390] FinHEAR: Human Expertise and Adaptive Risk-Aware Temporal Reasoning for Financial Decision-Making

Jiaxiang Chen, Mingxi Zou, Zhuo Wang, Qifan Wang, Dongning Sun, Chi Zhang, Zenglin Xu

Main category: cs.LG

TL;DR: FinHEAR is a multi-agent LLM framework that addresses financial decision-making challenges by incorporating behavioral economics principles, expert-guided retrieval, and adaptive risk assessment to improve performance in trend prediction and trading tasks.

Details

Motivation: LLMs often fail to capture human behavioral patterns in financial decisions like expert reliance under information asymmetry, loss aversion, and feedback-driven temporal adjustment, despite having strong general reasoning capabilities.

Method: FinHEAR uses a multi-agent framework with specialized LLM-based agents in an event-centric pipeline. It incorporates expert-guided retrieval, confidence-adjusted position sizing, and outcome-based refinement grounded in behavioral economics principles.

Result: Empirical results on curated financial datasets show FinHEAR consistently outperforms strong baselines in trend prediction and trading tasks, achieving higher accuracy and better risk-adjusted returns.

Conclusion: The FinHEAR framework successfully enhances financial decision-making by integrating behavioral economics insights and specialized agent coordination, demonstrating improved performance over existing approaches.

Abstract: Financial decision-making presents unique challenges for language models, demanding temporal reasoning, adaptive risk assessment, and responsiveness to dynamic events. While large language models (LLMs) show strong general reasoning capabilities, they often fail to capture behavioral patterns central to human financial decisions-such as expert reliance under information asymmetry, loss-averse sensitivity, and feedback-driven temporal adjustment. We propose FinHEAR, a multi-agent framework for Human Expertise and Adaptive Risk-aware reasoning. FinHEAR orchestrates specialized LLM-based agents to analyze historical trends, interpret current events, and retrieve expert-informed precedents within an event-centric pipeline. Grounded in behavioral economics, it incorporates expert-guided retrieval, confidence-adjusted position sizing, and outcome-based refinement to enhance interpretability and robustness. Empirical results on curated financial datasets show that FinHEAR consistently outperforms strong baselines across trend prediction and trading tasks, achieving higher accuracy and better risk-adjusted returns.

[391] Photovoltaic power forecasting using quantum machine learning

Asel Sagingalieva, Stefan Komornyik, Arsenii Senokosov, Ayush Joshi, Christopher Mansell, Olga Tsurkan, Karan Pinto, Markus Pflitsch, Alexey Melnikov

Main category: cs.LG

TL;DR: Hybrid quantum neural networks significantly improve photovoltaic power forecasting accuracy and data efficiency compared to classical methods, with 40%+ error reduction in some cases.

Details

Motivation: Accurate PV power forecasting is crucial for grid integration but remains challenging due to variable irradiance, complex meteorological factors, and device-specific behavior. Current ML approaches may not be optimal, and quantum models could offer better performance.

Method: Two hybrid quantum neural network architectures: 1) Hybrid Quantum Long Short-Term Memory model for time-series forecasting, and 2) Hybrid Quantum Sequence-to-Sequence model that predicts power for arbitrary horizons without prior meteorological inputs.

Result: Hybrid Quantum LSTM reduced MAE and MSE by over 40% vs best baselines. Hybrid Quantum Seq2Seq achieved 16% lower MAE than best baseline and can predict arbitrary horizons without meteorological inputs. Both models maintain superior accuracy with limited training data.

Conclusion: Hybrid quantum models effectively address key challenges in PV power forecasting and provide a practical path to more reliable, data-efficient energy predictions.

Abstract: Accurate forecasting of photovoltaic power is essential for reliable grid integration, yet remains difficult due to highly variable irradiance, complex meteorological drivers, site geography, and device-specific behavior. Although contemporary machine learning has achieved successes, it is not clear that these approaches are optimal: new model classes may further enhance performance and data efficiency. We investigate hybrid quantum neural networks for time-series forecasting of photovoltaic power and introduce two architectures. The first, a Hybrid Quantum Long Short-Term Memory model, reduces mean absolute error and mean squared error by more than 40% relative to the strongest baselines evaluated. The second, a Hybrid Quantum Sequence-to-Sequence model, once trained, it predicts power for arbitrary forecast horizons without requiring prior meteorological inputs and achieves a 16% lower mean absolute error than the best baseline on this task. Both hybrid models maintain superior accuracy when training data are limited, indicating improved data efficiency. These results show that hybrid quantum models address key challenges in photovoltaic power forecasting and offer a practical route to more reliable, data-efficient energy predictions.

[392] Optimistic Query Routing in Clustering-based Approximate Maximum Inner Product Search

Sebastian Bruch, Aditya Krishnan, Franco Maria Nardini

Main category: cs.LG

TL;DR: This paper introduces an optimistic routing framework for clustering-based nearest neighbor search that uses distribution moments to estimate maximum inner products, reducing points probed by up to 50% while maintaining accuracy.

Details

Motivation: Existing clustering-based nearest neighbor search methods lack effective routing algorithms for identifying which shards to probe, despite this being crucial for search efficacy.

Method: Proposed a framework incorporating moments of inner product distributions within shards to estimate maximum inner products, using optimism principles from sequential decision making. Designed a space-efficient sketch of the second moment with size independent of point count.

Result: Achieved same accuracy as state-of-the-art routers like ScaNN while probing up to 50% fewer points on benchmark datasets. The algorithm requires only O(1) vectors per shard.

Conclusion: The optimistic routing approach using distribution moments significantly improves efficiency in clustering-based nearest neighbor search while maintaining accuracy and space efficiency.

Abstract: Clustering-based nearest neighbor search is an effective method in which points are partitioned into geometric shards to form an index, with only a few shards searched during query processing to find a set of top-$k$ vectors. Even though the search efficacy is heavily influenced by the algorithm that identifies the shards to probe, it has received little attention in the literature. This work bridges that gap by studying routing in clustering-based maximum inner product search. We unpack existing routers and notice the surprising contribution of optimism. We then take a page from the sequential decision making literature and formalize that insight following the principle of ``optimism in the face of uncertainty.’’ In particular, we present a framework that incorporates the moments of the distribution of inner products within each shard to estimate the maximum inner product. We then present an instance of our algorithm that uses only the first two moments to reach the same accuracy as state-of-the-art routers such as ScaNN by probing up to $50%$ fewer points on benchmark datasets. Our algorithm is also space-efficient: we design a sketch of the second moment whose size is independent of the number of points and requires $\mathcal{O}(1)$ vectors per shard.

[393] Hopfield-Fenchel-Young Networks: A Unified Framework for Associative Memory Retrieval

Saul Santos, Vlad Niculae, Daniel McNamee, André F. T. Martins

Main category: cs.LG

TL;DR: A unified framework called Hopfield-Fenchel-Young networks that generalizes associative memory models using Fenchel-Young losses and different entropy functions, enabling sparse transformations and structured pattern retrieval.

Details

Motivation: To create a unified framework that generalizes traditional and modern Hopfield networks, connecting them with self-attention mechanisms in transformers and enabling sparse transformations and structured pattern associations.

Method: Proposes Hopfield-Fenchel-Young networks with energies formulated as differences between two Fenchel-Young losses, using Tsallis and norm entropies to derive differentiable update rules, and extending to structured networks with SparseMAP transformation.

Result: The framework unifies various Hopfield network variants, provides energy minimization perspectives for common transformations like normalization, and enables exact retrieval of single patterns and structured associations.

Conclusion: The Hopfield-Fenchel-Young networks framework successfully generalizes associative memory models, connects them with transformer self-attention, and demonstrates effectiveness on diverse memory recall tasks including image retrieval and text rationalization.

Abstract: Associative memory models, such as Hopfield networks and their modern variants, have garnered renewed interest due to advancements in memory capacity and connections with self-attention in transformers. In this work, we introduce a unified framework-Hopfield-Fenchel-Young networks-which generalizes these models to a broader family of energy functions. Our energies are formulated as the difference between two Fenchel-Young losses: one, parameterized by a generalized entropy, defines the Hopfield scoring mechanism, while the other applies a post-transformation to the Hopfield output. By utilizing Tsallis and norm entropies, we derive end-to-end differentiable update rules that enable sparse transformations, uncovering new connections between loss margins, sparsity, and exact retrieval of single memory patterns. We further extend this framework to structured Hopfield networks using the SparseMAP transformation, allowing the retrieval of pattern associations rather than a single pattern. Our framework unifies and extends traditional and modern Hopfield networks and provides an energy minimization perspective for widely used post-transformations like $\ell_2$-normalization and layer normalization-all through suitable choices of Fenchel-Young losses and by using convex analysis as a building block. Finally, we validate our Hopfield-Fenchel-Young networks on diverse memory recall tasks, including free and sequential recall. Experiments on simulated data, image retrieval, multiple instance learning, and text rationalization demonstrate the effectiveness of our approach.

[394] Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound

Tal Fiskus, Uri Shaham

Main category: cs.LG

TL;DR: A novel causal bound on factual loss in DRL using Neyman-Rubin framework, improving sample efficiency and reducing buffer size requirements.

Details

Motivation: DRL agents require extensive training steps and large replay buffers, leading to high computational costs and resource demands.

Method: Leverage Neyman-Rubin potential outcomes framework to establish causal bound on factual loss, storing past value network outputs in replay buffer to utilize typically discarded data.

Result: Achieved up to 383% higher reward ratio, reduced experience replay buffer size by up to 96%, and significantly improved sample efficiency with negligible cost across Atari 2600 and MuJoCo domains on DQN and SAC agents.

Conclusion: The proposed causal bound method effectively addresses computational and resource challenges in DRL by improving sample efficiency and reducing buffer requirements while maintaining performance.

Abstract: Deep reinforcement learning (DRL) agents excel in solving complex decision-making tasks across various domains. However, they often require a substantial number of training steps and a vast experience replay buffer, leading to significant computational and resource demands. To address these challenges, we introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL. Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss, which is analogous to the on-policy loss in DRL. This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded. Extensive experiments across the Atari 2600 and MuJoCo domains on various agents, such as DQN and SAC, achieve up to 383% higher reward ratio, outperforming the same agents without our proposed term, and reducing the experience replay buffer size by up to 96%, significantly improving sample efficiency at a negligible cost.

[395] Recursive Gaussian Process State Space Model

Tengjie Zheng, Haipeng Chen, Lin Cheng, Shengping Gong, Xu Huang

Main category: cs.LG

TL;DR: Proposes a recursive Gaussian Process State-Space Model with adaptive capabilities for online learning, featuring domain-independent learning, lightweight inducing point selection, and online hyperparameter optimization.

Details

Motivation: Address the lack of efficient online learning methods for GPSSMs when prior information about data distribution and model function is limited, enabling applications in principle discovery, time-series prediction, and controller design.

Method: Uses first-order linearization for Bayesian update of joint state-GP distribution, develops online inducing point selection based on informative criteria, and recovers historical measurement information from current filtering distribution for hyperparameter optimization.

Result: Demonstrates superior accuracy, computational efficiency, and adaptability compared to state-of-the-art online GPSSM techniques on both synthetic and real-world datasets.

Conclusion: The proposed recursive GPSSM method effectively addresses online learning challenges in scenarios with limited prior information, providing an efficient and adaptive solution for dynamical system modeling.

Abstract: Learning dynamical models from data is not only fundamental but also holds great promise for advancing principle discovery, time-series prediction, and controller design. Among various approaches, Gaussian Process State-Space Models (GPSSMs) have recently gained significant attention due to their combination of flexibility and interpretability. However, for online learning, the field lacks an efficient method suitable for scenarios where prior information regarding data distribution and model function is limited. To address this issue, this paper proposes a recursive GPSSM method with adaptive capabilities for both operating domains and Gaussian process (GP) hyperparameters. Specifically, we first utilize first-order linearization to derive a Bayesian update equation for the joint distribution between the system state and the GP model, enabling closed-form and domain-independent learning. Second, an online selection algorithm for inducing points is developed based on informative criteria to achieve lightweight learning. Third, to support online hyperparameter optimization, we recover historical measurement information from the current filtering distribution. Comprehensive evaluations on both synthetic and real-world datasets demonstrate the superior accuracy, computational efficiency, and adaptability of our method compared to state-of-the-art online GPSSM techniques.

[396] A Multimodal Lightweight Approach to Fault Diagnosis of Induction Motors in High-Dimensional Dataset

Usman Ali

Main category: cs.LG

TL;DR: This paper presents a transfer-learning-based lightweight deep learning model (ShuffleNetV2) for diagnosing broken rotor bar faults in induction motors using current and vibration signals, achieving 98.856% accuracy on a large dataset of 57,500 spectral images.

Details

Motivation: To address the limitation of small datasets in existing fault diagnosis approaches that risk overfitting in industrial environments, and to enhance proactive maintenance by accurately detecting broken rotor bar faults in induction motors.

Method: Used transfer learning with ShuffleNetV2 model on large-scale dataset (57,500 images) generated from current and vibration signals using Short-Time Fourier Transform (STFT). Applied FFT for better visualization of harmonic sidebands.

Result: ShuffleNetV2 achieved superior performance with 98.856% classification accuracy for detecting 1-4 broken rotor bars, with less computational cost compared to other models. The model was trained on 47,500 images and tested on 10,000 images.

Conclusion: The research provides valuable insights into model performance and efficiency, offering a foundation for developing robust fault diagnosis systems for induction motors in industrial settings with reduced computational requirements.

Abstract: An accurate AI-based diagnostic system for induction motors (IMs) holds the potential to enhance proactive maintenance, mitigating unplanned downtime and curbing overall maintenance costs within an industrial environment. Notably, among the prevalent faults in IMs, a Broken Rotor Bar (BRB) fault is frequently encountered. Researchers have proposed various fault diagnosis approaches using signal processing (SP), machine learning (ML), deep learning (DL), and hybrid architectures for BRB faults. One limitation in the existing literature is the training of these architectures on relatively small datasets, risking overfitting when implementing such systems in industrial environments. This paper addresses this limitation by implementing large-scale data of BRB faults by using a transfer-learning-based lightweight DL model named ShuffleNetV2 for diagnosing one, two, three, and four BRB faults using current and vibration signal data. Spectral images for training and testing are generated using a Short-Time Fourier Transform (STFT). The dataset comprises 57,500 images, with 47,500 used for training and 10,000 for testing. Remarkably, the ShuffleNetV2 model exhibited superior performance, in less computational cost as well as accurately classifying 98.856% of spectral images. To further enhance the visualization of harmonic sidebands resulting from broken bars, Fast Fourier Transform (FFT) is applied to current and vibration data. The paper also provides insights into the training and testing times for each model, contributing to a comprehensive understanding of the proposed fault diagnosis methodology. The findings of our research provide valuable insights into the performance and efficiency of different ML and DL models, offering a foundation for the development of robust fault diagnosis systems for induction motors in industrial settings.

[397] On the Interaction of Compressibility and Adversarial Robustness

Melih Barsbey, Antônio H. Ribeiro, Umut Şimşekli, Tolga Birdal

Main category: cs.LG

TL;DR: The paper shows that neural network compression (sparsity, spectral compression) creates vulnerabilities to adversarial attacks by introducing sensitive directions in representation space, revealing a fundamental tension between compressibility and robustness.

Details

Motivation: To understand the interaction between neural network compressibility and adversarial robustness, as these properties are both desirable but their relationship remains unclear despite extensive individual study.

Method: Developed a principled framework to analyze how neuron-level sparsity and spectral compressibility affect robustness, showing compression induces sensitive directions that adversaries can exploit. Derived robustness bounds and conducted empirical evaluations across synthetic and realistic tasks.

Result: Compression creates a small number of highly sensitive directions that enable effective adversarial perturbations. These vulnerabilities persist under adversarial training and transfer learning, and contribute to universal adversarial perturbations.

Conclusion: There is a fundamental tension between structured compressibility and robustness, suggesting new pathways are needed to design models that are both efficient and secure.

Abstract: Modern neural networks are expected to simultaneously satisfy a host of desirable properties: accurate fitting to training data, generalization to unseen inputs, parameter and computational efficiency, and robustness to adversarial perturbations. While compressibility and robustness have each been studied extensively, a unified understanding of their interaction still remains elusive. In this work, we develop a principled framework to analyze how different forms of compressibility - such as neuron-level sparsity and spectral compressibility - affect adversarial robustness. We show that these forms of compression can induce a small number of highly sensitive directions in the representation space, which adversaries can exploit to construct effective perturbations. Our analysis yields a simple yet instructive robustness bound, revealing how neuron and spectral compressibility impact $L_\infty$ and $L_2$ robustness via their effects on the learned representations. Crucially, the vulnerabilities we identify arise irrespective of how compression is achieved - whether via regularization, architectural bias, or implicit learning dynamics. Through empirical evaluations across synthetic and realistic tasks, we confirm our theoretical predictions, and further demonstrate that these vulnerabilities persist under adversarial training and transfer learning, and contribute to the emergence of universal adversarial perturbations. Our findings show a fundamental tension between structured compressibility and robustness, and suggest new pathways for designing models that are both efficient and secure.

[398] REX: Causal discovery based on machine learning and explainability techniques

Jesus Renero, Idoia Ochoa, Roberto Maestre

Main category: cs.LG

TL;DR: ReX is a novel causal discovery method that combines machine learning models with explainability techniques (Shapley values) to identify causal relationships, outperforming state-of-the-art methods on synthetic and real-world datasets.

Details

Motivation: Current causal discovery methods lack explainability, which is crucial for understanding complex systems in healthcare, economics, and AI. There's a need to integrate explainability into causal discovery to enhance interpretability.

Method: ReX leverages machine learning models coupled with Shapley values explainability techniques to identify and interpret significant causal relationships among variables.

Result: ReX outperforms state-of-the-art causal discovery methods on synthetic datasets with non-linear and additive noise models. On the Sachs single-cell protein-signaling dataset, it achieved 0.952 precision with no incorrect edges.

Conclusion: ReX effectively bridges predictive modeling and causal inference, offering a robust tool for understanding complex causal structures while minimizing false positives across diverse datasets.

Abstract: Explainable Artificial Intelligence (XAI) techniques hold significant potential for enhancing the causal discovery process, which is crucial for understanding complex systems in areas like healthcare, economics, and artificial intelligence. However, no causal discovery methods currently incorporate explainability into their models to derive the causal graphs. Thus, in this paper we explore this innovative approach, as it offers substantial potential and represents a promising new direction worth investigating. Specifically, we introduce ReX, a causal discovery method that leverages machine learning (ML) models coupled with explainability techniques, specifically Shapley values, to identify and interpret significant causal relationships among variables. Comparative evaluations on synthetic datasets comprising continuous tabular data reveal that ReX outperforms state-of-the-art causal discovery methods across diverse data generation processes, including non-linear and additive noise models. Moreover, ReX was tested on the Sachs single-cell protein-signaling dataset, achieving a precision of 0.952 and recovering key causal relationships with no incorrect edges. Taking together, these results showcase ReX’s effectiveness in accurately recovering true causal structures while minimizing false positive predictions, its robustness across diverse datasets, and its applicability to real-world problems. By combining ML and explainability techniques with causal discovery, ReX bridges the gap between predictive modeling and causal inference, offering an effective tool for understanding complex causal structures.

[399] Bayesian Optimization with Preference Exploration using a Monotonic Neural Network Ensemble

Hanyang Wang, Juergen Branke, Matthias Poloczek

Main category: cs.LG

TL;DR: Proposes a neural network ensemble for Bayesian Optimization with Preference Exploration that incorporates monotonicity constraints and handles pairwise comparisons, outperforming state-of-the-art methods.

Details

Motivation: Many real-world black-box optimization problems have multiple conflicting objectives, and interactive preference learning can focus search on relevant subsets. Previous approaches haven't sufficiently exploited the monotonic nature of utility functions.

Method: Uses a neural network ensemble as a utility surrogate model that naturally integrates monotonicity constraints and supports pairwise comparison data.

Result: The proposed method outperforms state-of-the-art approaches and shows robustness to noise in utility evaluations. An ablation study confirms monotonicity’s critical role in performance enhancement.

Conclusion: Incorporating monotonicity constraints in utility surrogate models significantly improves Bayesian Optimization with Preference Exploration, with neural network ensembles effectively handling pairwise comparisons and demonstrating superior performance.

Abstract: Many real-world black-box optimization problems have multiple conflicting objectives. Rather than attempting to approximate the entire set of Pareto-optimal solutions, interactive preference learning allows to focus the search on the most relevant subset. However, few previous studies have exploited the fact that utility functions are usually monotonic. In this paper, we address the Bayesian Optimization with Preference Exploration (BOPE) problem and propose using a neural network ensemble as a utility surrogate model. This approach naturally integrates monotonicity and supports pairwise comparison data. Our experiments demonstrate that the proposed method outperforms state-of-the-art approaches and exhibits robustness to noise in utility evaluations. An ablation study highlights the critical role of monotonicity in enhancing performance.

[400] Privacy-Preserving Dataset Combination

Keren Fuentes, Mimee Xu, Irene Chen

Main category: cs.LG

TL;DR: SecureKL is a privacy-preserving protocol that enables organizations to evaluate external datasets’ utility without data leakage, addressing privacy concerns in data sharing.

Details

Motivation: Privacy concerns and competitive interests limit data sharing, especially disadvantaging smaller organizations that cannot privately assess external data's utility before sharing.

Method: SecureKL uses secure computation to perform dataset divergence metrics internally with zero privacy leakage, without assuming downstream models.

Result: On real-world data, SecureKL achieves >90% correlation with non-private counterparts and successfully identifies beneficial data collaborations in heterogeneous domains like healthcare and income prediction.

Conclusion: Secure computation maximizes data utilization and outperforms privacy-agnostic utility assessments that leak information.

Abstract: Access to diverse, high-quality datasets is crucial for machine learning model performance, yet data sharing remains limited by privacy concerns and competitive interests, particularly in regulated domains like healthcare. This dynamic especially disadvantages smaller organizations that lack resources to purchase data or negotiate favorable sharing agreements, due to the inability to \emph{privately} assess external data’s utility. To resolve privacy and uncertainty tensions simultaneously, we introduce {\SecureKL}, the first secure protocol for dataset-to-dataset evaluations with zero privacy leakage, designed to be applied preceding data sharing. {\SecureKL} evaluates a source dataset against candidates, performing dataset divergence metrics internally with private computations, all without assuming downstream models. On real-world data, {\SecureKL} achieves high consistency ($>90%$ correlation with non-private counterparts) and successfully identifies beneficial data collaborations in highly-heterogeneous domains (ICU mortality prediction across hospitals and income prediction across states). Our results highlight that secure computation maximizes data utilization, outperforming privacy-agnostic utility assessments that leak information.

[401] Predicting gene essentiality and drug response from perturbation screens in preclinical cancer models with LEAP: Layered Ensemble of Autoencoders and Predictors

Barbara Bodinier, Gaetan Dissez, Lucile Ter-Minassian, Linus Bleistein, Roberta Codato, John Klein, Eric Durand, Antonin Dauvin

Main category: cs.LG

TL;DR: LEAP framework improves drug discovery prediction by combining multiple autoencoders and predictors, achieving better performance and interpretability in perturbation response modeling.

Details

Motivation: Existing predictive models for high-throughput preclinical screens suffer from limited reproducibility, generalizability, and interpretability, hindering their utility in drug discovery.

Method: Layered Ensemble of Autoencoders and Predictors (LEAP) aggregates predictions from multiple regressors trained using diverse gene expression representation models, with perturbation-specific LASSO regressors providing optimal balance.

Result: LEAP consistently improves prediction performances in unscreened cell lines across modeling strategies, achieving near state-of-the-art performance with low computation time.

Conclusion: LEAP has potential to accelerate drug discovery by guiding preclinical experiment prioritization and providing biological mechanism insights, with publicly available code and datasets.

Abstract: High-throughput preclinical perturbation screens, where the effects of genetic, chemical, or environmental perturbations are systematically tested on disease models, hold significant promise for machine learning-enhanced drug discovery due to their scale and causal nature. Predictive models trained on such datasets can be used to (i) infer perturbation response for previously untested disease models, and (ii) characterise the biological context that affects perturbation response. Existing predictive models suffer from limited reproducibility, generalisability and interpretability. To address these issues, we introduce a framework of Layered Ensemble of Autoencoders and Predictors (LEAP), a general and flexible ensemble strategy to aggregate predictions from multiple regressors trained using diverse gene expression representation models. LEAP consistently improves prediction performances in unscreened cell lines across modelling strategies. In particular, LEAP applied to perturbation-specific LASSO regressors (PS-LASSO) provides a favorable balance between near state-of-the-art performance and low computation time. We also propose an interpretability approach combining model distillation and stability selection to identify important biological pathways for perturbation response prediction in LEAP. Our models have the potential to accelerate the drug discovery pipeline by guiding the prioritisation of preclinical experiments and providing insights into the biological mechanisms involved in perturbation response. The code and datasets used in this work are publicly available.

[402] Hydra: A Modular Architecture for Efficient Long-Context Reasoning

Siddharth Chaudhary, Dev Patel, Maheep Chaudhary, Bennett Browning

Main category: cs.LG

TL;DR: Hydra is a modular architecture using state-space backbone that combines sparse global attention, mixture-of-experts, and dual memories to achieve significant efficiency gains and accuracy improvements over transformers.

Details

Motivation: The quadratic complexity of transformers limits deployment in resource-constrained and long-context settings, requiring more efficient reasoning systems.

Method: Hydra uses a state-space backbone with adaptive routing between sparse global attention, mixture-of-experts, and dual memories (reasoning workspace and product key memory).

Result: Achieves 3.01× throughput gains on synthetic data and 3.0× on WikiText at 8K tokens, plus 10× accuracy improvements on multi-step logical composition compared to equal-sized transformers.

Conclusion: Each component contributes effectively: sparse attention captures long-range dependencies, experts specialize to domains, and product key memory enables selective retrieval.

Abstract: The quadratic complexity of transformers fundamentally limits reasoning system deployment in resource-constrained and long-context settings. We introduce Hydra, a modular architecture based upon a state-space backbone which adaptively routes between complementary efficiency mechanisms: sparse global attention, mixture-of-experts, and dual memories comprising a reasoning workspace and product key memory. We evaluate a 29M parameter model measuring logical chaining accuracy and throughput on synthetic sequences, plus throughput on WikiText. Ablation studies use component-specific synthetic datasets to isolate individual mechanisms. Hydra achieves $3.01\times$ and $3.0\times$ throughput gains at 8K tokens for synthetic and WikiText datasets, respectively, and $10\times$ accuracy improvements on multi-step logical composition compared to equal-sized transformers. Ablations confirm each component’s contribution: sparse attention captures long-range dependencies, experts specialize to input domains, and product key memory enables selective retrieval.

[403] All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, J. Andrew Bagnell

Main category: cs.LG

TL;DR: Two-stage RL fine-tuning (reward model + RL) outperforms direct offline optimization due to generation-verification gap - simple reward models are easy to learn, and RL only searches over policies optimal for these simple verifiers, requiring less data.

Details

Motivation: Explain why complex two-stage RL fine-tuning outperforms simpler direct offline optimization despite information loss through reward modeling.

Method: Theoretical and empirical analysis of hypotheses about RL’s value in fine-tuning, focusing on generation-verification gap.

Result: Most support for hypothesis that simple reward models are easy to learn, and RL only searches over policies optimal for these simple verifiers, reducing search space.

Conclusion: Two-stage online fine-tuning requires less data than offline FT because it searches over a reduced policy subset via simple verifiers.

Abstract: From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure. Specifically, one first trains a reward model (RM) on some dataset (e.g., human preferences) before using it to provide online feedback as part of a downstream reinforcement learning (RL) procedure, rather than directly optimizing the policy parameters on said dataset via offline maximum likelihood estimation. In fact, from an information-theoretic perspective, we can only lose information via passing through a reward model and cannot create any new information via on-policy sampling. To explain this discrepancy, we scrutinize several hypotheses on the value of RL in FT through both theoretical and empirical lenses. Of the hypotheses considered, we find the most support for the explanation that on problems with a generation-verification gap, (1) it is relatively easy to learn the relatively simple RM (verifier) from the preference data. Then, (2) the downstream RL procedure only returns policies (generators) that are optimal for such relatively simple verifiers. Thus, end-to-end, two-stage online FT only has to search over a reduced subset of the full space of policies, requiring less data than offline FT.

[404] GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping

Qifu Wen, Xi Zeng, Zihan Zhou, Shuaijun Liu, Mehdi Hosseinzadeh, Ningxin Su, Reza Rawassizadeh

Main category: cs.LG

TL;DR: GradES is a gradient-based early stopping method that individually stops parameter updates in transformer components when their gradient changes fall below a threshold, eliminating costly validation passes while improving training efficiency and generalization.

Details

Motivation: Traditional early stopping requires computationally expensive validation inference for large transformers, and halts all parameters simultaneously despite different components converging at varying rates.

Method: Track gradient change magnitudes in transformer components (attention projections and FFN matrices) during backpropagation, and individually exclude matrices from updates when their gradient changes fall below threshold τ.

Result: Speeds up training by 1.57-7.22× while improving accuracy by 1.2% on language tasks and 3.88% on multimodal benchmarks.

Conclusion: GradES enables efficient component-wise early stopping that reduces computational costs while preventing overfitting and enhancing generalization performance.

Abstract: Early stopping monitors global validation loss and halts all parameter updates simultaneously, which is computationally costly for large transformers due to the extended time required for validation inference. We propose \textit{GradES}, a novel gradient-based early stopping approach that operates within transformer components (attention projections and Feed-Forward layer matrices). We found that different components converge at varying rates during fine-tuning for both language and vision-language models. \textit{GradES} tracks the magnitude of gradient changes in backpropagation for these matrices during training. When a projection matrix’s magnitude of gradient changes fall below a convergence threshold $\tau$, we exclude that projection matrix from further updates individually, eliminating costly validation passes while allowing slow converging matrices to continue learning. \textit{GradES} speeds up training time by 1.57–7.22$\times$ while simultaneously enhancing generalization through early prevention of overfitting, resulting in 1.2% higher average accuracy in language tasks and 3.88% on multimodal benchmarks.

[405] Rethinking Robustness in Machine Learning: A Posterior Agreement Approach

João Borges S. Carvalho, Victor Jimenez Rodriguez, Alessandro Torcinovich, Antonio E. Cinà, Carlos Cotrini, Lea Schönherr, Joachim M. Buhmann

Main category: cs.LG

TL;DR: Proposes a principled robustness assessment framework based on Posterior Agreement theory for evaluating ML algorithms under covariate shifts, showing higher discriminability than accuracy-based measures without requiring supervision.

Details

Motivation: Current robustness evaluation methods lack theoretical justification and rely on task performance measures like accuracy, highlighting the need for a principled foundation for robustness assessment under distribution shifts.

Method: Extends the Posterior Agreement framework to covariate shift settings and proposes a novel robustness measure, evaluated in controlled environments and empirical analysis of adversarial learning and domain generalization scenarios.

Result: PA provides reliable analysis of algorithm vulnerabilities across different shift conditions, offers higher discriminability than accuracy-based measures, and requires no supervision.

Conclusion: The Posterior Agreement framework offers a sound and principled approach for robustness assessment under covariate shifts, addressing limitations of current evaluation methods.

Abstract: The robustness of algorithms against covariate shifts is a fundamental problem with critical implications for the deployment of machine learning algorithms in the real world. Current evaluation methods predominantly measure robustness through the lens of standard generalization, relying on task performance measures like accuracy. This approach lacks a theoretical justification and underscores the need for a principled foundation of robustness assessment under distribution shifts. In this work, we set the desiderata for a robustness measure, and we propose a novel principled framework for the robustness assessment problem that directly follows the Posterior Agreement (PA) theory of model validation. Specifically, we extend the PA framework to the covariate shift setting and propose a measure for robustness evaluation. We assess the soundness of our measure in controlled environments and through an empirical robustness analysis in two different covariate shift scenarios: adversarial learning and domain generalization. We illustrate the suitability of PA by evaluating several models under different nature and magnitudes of shift, and proportion of affected observations. The results show that PA offers a reliable analysis of the vulnerabilities in learning algorithms across different shift conditions and provides higher discriminability than accuracy-based measures, while requiring no supervision.

[406] Lookup multivariate Kolmogorov-Arnold Networks

Sergey Pozdnyakov, Philippe Schwaller

Main category: cs.LG

TL;DR: lmKANs replace linear layers with trainable low-dimensional multivariate functions using spline lookup tables, achieving up to 6x FLOPs reduction while maintaining MLP-level flexibility.

Details

Motivation: Linear layers dominate parameter count and computational cost in deep learning models, creating a need for more efficient alternatives.

Method: Express high-dimensional mappings through trainable low-dimensional multivariate functions implemented as spline lookup tables, requiring only a few multiplications to compute.

Result: 6.0x FLOPs reduction in inference, 10x higher H100 throughput on tabular data, 1.6-2.1x FLOPs reduction in CNNs on CIFAR-10 and ImageNet-1k while maintaining accuracy.

Conclusion: lmKANs provide a superior trade-off between capacity and inference cost compared to traditional linear layers, making them effective drop-in replacements.

Abstract: High-dimensional linear mappings, or linear layers, dominate both the parameter count and the computational cost of most modern deep-learning models. We introduce a general-purpose drop-in replacement, lookup multivariate Kolmogorov-Arnold Networks (lmKANs), which deliver a substantially better trade-off between capacity and inference cost. Our construction expresses a general high-dimensional mapping through trainable low-dimensional multivariate functions. These functions can carry dozens or hundreds of trainable parameters each, and yet it takes only a few multiplications to compute them because they are implemented as spline lookup tables. Empirically, lmKANs reduce inference FLOPs by up to 6.0x while matching the flexibility of MLPs in general high-dimensional function approximation. In another feedforward fully connected benchmark, on the tabular-like dataset of randomly displaced methane configurations, lmKANs enable more than 10x higher H100 throughput at equal accuracy. Within frameworks of Convolutional Neural Networks, lmKAN-based CNNs cut inference FLOPs at matched accuracy by 1.6-2.1x and by 1.7x on the CIFAR-10 and ImageNet-1k datasets, respectively. Our code, including dedicated CUDA kernels, is available online at https://github.com/schwallergroup/lmkan.

[407] Neural Mean-Field Games: Extending Mean-Field Game Theory with Neural Stochastic Differential Equations

Anna C. M. Thöni, Yoram Bachrach, Tal Kachman

Main category: cs.LG

TL;DR: Neural mean-field games combine mean-field game theory with neural SDEs to create a data-driven, model-free approach for solving complex games with large populations, overcoming limitations of traditional PDE-based methods.

Details

Motivation: Traditional mean-field game theory relies on analytical solutions through PDE systems, which are model-dependent, may lose solution uniqueness, and suffer from modeling bias. The goal is to reduce model dependency and enable more flexible learning of strategic interactions.

Method: Proposes neural mean-field games that integrate mean-field game theory with deep learning via neural stochastic differential equations. Uses automatic differentiation for robustness and objectivity, making it data-driven and lightweight.

Result: Successfully solves two mean-field games of varying complexity, observability, and noise levels. Demonstrates efficiency in learning strategic interactions hard to capture with pure mean-field theory. Shows robustness in simulating viral dynamics using real-world data.

Conclusion: The neural mean-field game approach is flexible, generalizable, requires few observations to learn underlying distributions, and accurately models complex phenomena like epidemic outbreaks from real data.

Abstract: Mean-field game theory relies on approximating games that are intractible to model due to a very large to infinite population of players. While these kinds of games can be solved analytically via the associated system of partial derivatives, this approach is not model-free, can lead to the loss of the existence or uniqueness of solutions, and may suffer from modelling bias. To reduce the dependency between the model and the game, we introduce neural mean-field games: a combination of mean-field game theory and deep learning in the form of neural stochastic differential equations. The resulting model is data-driven, lightweight, and can learn extensive strategic interactions that are hard to capture using mean-field theory alone. In addition, the model is based on automatic differentiation, making it more robust and objective than approaches based on finite differences. We highlight the efficiency and flexibility of our approach by solving two mean-field games that vary in their complexity, observability, and the presence of noise. Lastly, we illustrate the model’s robustness by simulating viral dynamics based on real-world data. Here, we demonstrate that the model’s ability to learn from real-world data helps to accurately model the evolution of an epidemic outbreak. Using these results, we show that the model is flexible, generalizable, and requires few observations to learn the distribution underlying the data.

[408] The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

Long Li, Jiaran Hao, Jason Klein Liu, Zhijian Zhou, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi

Main category: cs.LG

TL;DR: DPH-RL uses mass-covering f-divergences to preserve diversity in RL fine-tuning, solving the paradox where Pass@1 improves but Pass@k degrades due to catastrophic forgetting.

Details

Motivation: Address the paradox where RL fine-tuning improves single-attempt accuracy (Pass@1) but degrades multi-attempt performance (Pass@k) due to catastrophic forgetting and lack of knowledge retention mechanisms.

Method: Propose DPH-RL framework using mass-covering f-divergences (forward-KL, JS-divergence) as rehearsal mechanism by continuously referencing initial policy to maintain broad solution coverage.

Result: DPH-RL resolves Pass@k degradation and improves both Pass@1 and Pass@k in- and out-of-domain, while being more training-efficient by computing f-divergence using generator functions without online reference model.

Conclusion: Proper selection of divergence measure is a powerful tool for building more general and diverse reasoning models, highlighting an overlooked axis for improving RLVR.

Abstract: A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. While various methods have been proposed, the choice and function of the divergence term have been surprisingly unexamined as a proactive solution. We argue that standard RLVR objectives – both those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirely – lack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We propose a fundamental shift in perspective: using the divergence term itself as the solution. Our framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences (like forward-KL and JS-divergence) to function as a rehearsal mechanism. By continuously referencing the initial policy, this approach forces the model to maintain broad solution coverage. Extensive experiments on math and SQL generation demonstrate that DPH-RL not only resolves the Pass@k degradation but improves both Pass@1 and Pass@k in- and out-of-domain. Additionally, DPH-RL is more training-efficient because it computes f-divergence using generator functions, requiring only sampling from the initial policy and no online reference model. Our work highlights a crucial, overlooked axis for improving RLVR, demonstrating that the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models.

[409] RadioDiff-$k^2$: Helmholtz Equation Informed Generative Diffusion Model for Multi-Path Aware Radio Map Construction

Xiucheng Wang, Qiming Zhang, Nan Cheng, Ruijin Sun, Zan Li, Shuguang Cui, Xuemin Shen

Main category: cs.LG

TL;DR: RadioDiff-k² is a physics-informed generative learning approach for multipath-aware radio map construction that uses dual diffusion models guided by the Helmholtz equation to accurately model EM singularities.

Details

Motivation: Future wireless communication requires environment-aware paradigms, making accurate radio map construction crucial but challenging. Conventional EM-based methods have high computational overhead, while existing neural network approaches lack sufficient physics consideration for modeling EM singularities in complex multipath environments.

Method: Proposes a dual diffusion model framework: one DM infers EM singularities (corresponding to negative wave numbers in Helmholtz equation), and another DM reconstructs the complete radio map using these singularities and environmental context. The approach is explicitly guided by the Helmholtz equation governing EM wave propagation.

Result: Achieves state-of-the-art performance in both image-level radio map construction and localization tasks, with inference latency within a few hundred milliseconds.

Conclusion: RadioDiff-k² provides an accurate and efficient solution for multipath-aware radio map construction by integrating physics principles with generative learning, overcoming limitations of both conventional EM methods and pure neural network approaches.

Abstract: In this paper, we propose a novel physics-informed generative learning approach, named RadioDiff-$k^2$, for accurate and efficient multipath-aware radio map (RM) construction. As future wireless communication evolves towards environment-aware paradigms, the accurate construction of RMs becomes crucial yet highly challenging. Conventional electromagnetic (EM)-based methods, such as full-wave solvers and ray-tracing approaches, exhibit substantial computational overhead and limited adaptability to dynamic scenarios. Although existing neural network (NN) approaches have efficient inferencing speed, they lack sufficient consideration of the underlying physics of EM wave propagation, limiting their effectiveness in accurately modeling critical EM singularities induced by complex multipath environments. To address these fundamental limitations, we propose a novel physics-inspired RM construction method guided explicitly by the Helmholtz equation, which inherently governs EM wave propagation. Specifically, based on the analysis of partial differential equations (PDEs), we theoretically establish a direct correspondence between EM singularities, which correspond to the critical spatial features influencing wireless propagation, and regions defined by negative wave numbers in the Helmholtz equation. We then design an innovative dual diffusion model (DM)-based large artificial intelligence framework comprising one DM dedicated to accurately inferring EM singularities and another DM responsible for reconstructing the complete RM using these singularities along with environmental contextual information. Experimental results demonstrate that the proposed RadioDiff-$k^2$ framework achieves state-of-the-art (SOTA) performance in both image-level RM construction and localization tasks, while maintaining inference latency within a few hundred milliseconds.

[410] CoUn: Empowering Machine Unlearning via Contrastive Learning

Yasser H. Khalil, Mehdi Setayesh, Hongliang Li

Main category: cs.LG

TL;DR: CoUn is a machine unlearning framework that uses contrastive learning and supervised learning on retain data to effectively remove forget data influence while preserving model performance on retain data.

Details

Motivation: Existing machine unlearning methods based on label manipulation or model weight perturbations achieve limited unlearning effectiveness, so a more effective approach is needed.

Method: CoUn adjusts learned data representations through contrastive learning and supervised learning applied only to retain data, leveraging semantic similarity between samples to indirectly adjust forget representations and maintaining retain representations within their clusters.

Result: Extensive experiments show CoUn consistently outperforms state-of-the-art MU baselines in unlearning effectiveness, and integrating its contrastive learning module into existing baselines improves their performance.

Conclusion: CoUn provides an effective machine unlearning framework that successfully removes forget data influence while preserving retain data knowledge, with its contrastive learning approach being transferable to enhance other methods.

Abstract: Machine unlearning (MU) aims to remove the influence of specific “forget” data from a trained model while preserving its knowledge of the remaining “retain” data. Existing MU methods based on label manipulation or model weight perturbations often achieve limited unlearning effectiveness. To address this, we introduce CoUn, a novel MU framework inspired by the observation that a model retrained from scratch using only retain data classifies forget data based on their semantic similarity to the retain data. CoUn emulates this behavior by adjusting learned data representations through contrastive learning (CL) and supervised learning, applied exclusively to retain data. Specifically, CoUn (1) leverages semantic similarity between data samples to indirectly adjust forget representations using CL, and (2) maintains retain representations within their respective clusters through supervised learning. Extensive experiments across various datasets and model architectures show that CoUn consistently outperforms state-of-the-art MU baselines in unlearning effectiveness. Additionally, integrating our CL module into existing baselines empowers their unlearning effectiveness.

[411] Euclidean Distance Matrix Completion via Asymmetric Projected Gradient Descent

Yicheng Li, Xinghua Sun

Main category: cs.LG

TL;DR: APGD algorithm for Euclidean Distance Matrix Completion achieves global convergence with exact recovery using O(μ²r³κ²n log n) random observations, but performs poorly in limited sample scenarios compared to standard non-convex approaches.

Details

Motivation: To develop a gradient-type algorithm based on Burer-Monteiro factorization for reconstructing point set configurations from partial Euclidean distance measurements, addressing the EDMC problem with improved theoretical guarantees.

Method: Asymmetric Projected Gradient Descent (APGD) algorithm based on Burer-Monteiro factorization, analyzed using incoherence matrix completion framework without requiring sample splitting.

Result: Global convergence guarantee with exact recovery established for O(μ²r³κ²n log n) Bernoulli random observations. Shows exact linear convergence in rich-sample regions but deteriorates rapidly in limited sample scenarios.

Conclusion: APGD matches theoretical predictions but reveals weakened implicit regularization and requires substantially more samples than information-theoretic limits suggest, indicating limitations in practical applications with limited data.

Abstract: This paper proposes and analyzes a gradient-type algorithm based on Burer-Monteiro factorization, called the Asymmetric Projected Gradient Descent (APGD), for reconstructing the point set configuration from partial Euclidean distance measurements, known as the Euclidean Distance Matrix Completion (EDMC) problem. By paralleling the incoherence matrix completion framework, we show for the first time that global convergence guarantee with exact recovery of this routine can be established given $\mathcal{O}(\mu^2 r^3 \kappa^2 n \log n)$ Bernoulli random observations without any sample splitting. Unlike leveraging the tangent space Restricted Isometry Property (RIP) and local curvature of the low-rank embedding manifold in some very recent works, our proof provides extra upper bounds that act as analogies of the random graph lemma under EDMC setting. The APGD works surprisingly well and numerical experiments demonstrate exact linear convergence behavior in rich-sample regions yet deteriorates rapidly when compared with the performance obtained by optimizing the s-stress function, i.e., the standard but unexplained non-convex approach for EDMC, if the sample size is limited. While virtually matching our theoretical prediction, this unusual phenomenon might indicate that: (i) the power of implicit regularization is weakened when specified in the APGD case; (ii) the stabilization of such new gradient direction requires substantially more samples than the information-theoretic limit would suggest.

[412] MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, Xin Liu

Main category: cs.LG

TL;DR: MegaScale-MoE is a production system that improves training efficiency for large-scale mixture-of-experts (MoE) models through optimized communication strategies, achieving 1.88× better performance than Megatron-LM.

Details

Motivation: Existing MoE training systems suffer from efficiency degradation as model sizes increase and hardware evolves, highlighting the need for better communication optimization in MoE training.

Method: Customizes communication-efficient parallelism for attention and FFNs in MoE layers, overlaps communication with computation at inter- and intra-operator levels, and applies communication compression with adjusted patterns to lower precision.

Result: Achieves training throughput of 1.41M tokens/s when training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, improving efficiency by 1.88× compared to Megatron-LM.

Conclusion: The system demonstrates effective acceleration of MoE training and provides valuable insights for future MoE system research.

Abstract: We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$\times$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.

[413] Tequila: Trapping-free Ternary Quantization for Large Language Models

Hong Huang, Decheng Wu, Rui Cen, Guanghua Yu, Zonghang Li, Kai Liu, Jianchen Zhu, Peng Chen, Xue Liu, Dapeng Wu

Main category: cs.LG

TL;DR: Tequila is a novel ternary weight quantization method that addresses deadzone trapping by repurposing trapped weights as dynamic biases, achieving near full-precision performance with 3x inference speedup and minimal overhead.

Details

Motivation: Current ternary quantization methods suffer from significant accuracy degradation due to deadzone trapping, where weights get stuck at boundaries and receive noisy gradients, limiting model capacity and optimization.

Method: Proposes Tequila which reactivates deadzone-trapped weights by converting them into dynamic biases, enabling continuous forward signals and meaningful gradient signals during backpropagation with nearly zero inference overhead.

Result: Outperforms SOTA ternary quantization methods across five benchmarks, achieving >4% accuracy gain on ARC benchmark and nearly matching full-precision performance (<1% gap) with 3.0x inference speedup.

Conclusion: Tequila provides a highly practical and efficient solution for deploying advanced LLMs in resource-constrained environments by overcoming deadzone trapping in ternary quantization.

Abstract: Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to {-1, 0, 1}, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as deadzone trapping: a large number of weights are trapped at the deadzone boundary. This occurs because these weights receive only noisy, uninformative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose Tequila, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly zero inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves >4% accuracy gain over the SOTA baseline, nearly matching full-precision performance (within <1% gap) with a 3.0x inference speedup. Consequently, Tequila offers a highly practical and efficient implementation for the deployment of advanced LLMs in resource-constrained environments. The code is available at https://github.com/Tencent/AngelSlim.

[414] msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML

Zhaolan Huang, Emmanuel Baccelli

Main category: cs.LG

TL;DR: msf-CNN is a novel patch-based fusion technique that optimizes CNN inference on microcontrollers by exploring fusion solution spaces as directed acyclic graphs, achieving 50% RAM reduction compared to prior methods.

Details

Motivation: To enable AI models to run on memory-constrained microcontrollers (MCUs) with tiny memory budgets (e.g., 128kB RAM) while maintaining real-time inference latency requirements.

Method: Introduces msf-CNN, which efficiently finds optimal fusion settings for CNNs by walking through the fusion solution space represented as a directed acyclic graph, identifying a wider set of solutions than previous work.

Result: Achieves 50% less RAM usage compared to prior art (MCUNetV2 and StreamNet) while maintaining inference performance, with implementation running on various microcontrollers (ARM Cortex-M, RISC-V, ESP32).

Conclusion: msf-CNN offers additional flexibility for system designers by providing more efficient memory usage for CNN inference on resource-constrained MCUs.

Abstract: AI spans from large language models to tiny models running on microcontrollers (MCUs). Extremely memory-efficient model architectures are decisive to fit within an MCU’s tiny memory budget e.g., 128kB of RAM. However, inference latency must remain small to fit real-time constraints. An approach to tackle this is patch-based fusion, which aims to optimize data flows across neural network layers. In this paper, we introduce msf-CNN, a novel technique that efficiently finds optimal fusion settings for convolutional neural networks (CNNs) by walking through the fusion solution space represented as a directed acyclic graph. Compared to previous work on CNN fusion for MCUs, msf-CNN identifies a wider set of solutions. We published an implementation of msf-CNN running on various microcontrollers (ARM Cortex-M, RISC-V, ESP32). We show that msf-CNN can achieve inference using 50% less RAM compared to the prior art (MCUNetV2 and StreamNet). We thus demonstrate how msf-CNN offers additional flexibility for system designers.

[415] Ascent Fails to Forget

Ioannis Mavrothalassitis, Pol Puigdemont, Noam Itzhak Levi, Volkan Cevher

Main category: cs.LG

TL;DR: Gradient ascent-based unlearning methods often fail due to statistical dependence between forget and retain datasets, causing performance degradation and divergence from retrained models.

Details

Motivation: To challenge the misconception that forget and retain datasets can be independently manipulated during unlearning, and to investigate why gradient ascent methods frequently fail in practice.

Method: Empirical and theoretical analysis of gradient ascent-based unlearning, including logistic regression examples and experiments on neural networks to study the effects of statistical dependence.

Result: Statistical dependence between datasets causes gradient ascent methods to diverge from retrained models, potentially producing worse solutions than the original model, with models getting trapped in inferior local minima.

Conclusion: Statistical dependencies between forget and retain datasets, even simple correlations, are sufficient to cause gradient ascent-based unlearning methods to fail, making them potentially detrimental rather than beneficial.

Abstract: Contrary to common belief, we show that gradient ascent-based unconstrained optimization methods frequently fail to perform machine unlearning, a phenomenon we attribute to the inherent statistical dependence between the forget and retain data sets. This dependence, which can manifest itself even as simple correlations, undermines the misconception that these sets can be independently manipulated during unlearning. We provide empirical and theoretical evidence showing these methods often fail precisely due to this overlooked relationship. For random forget sets, this dependence means that degrading forget set metrics (which, for a retrained model, should mirror test set metrics) inevitably harms overall test performance. Going beyond random sets, we consider logistic regression as an instructive example where a critical failure mode emerges: inter-set dependence causes gradient descent-ascent iterations to progressively diverge from the ideal retrained model. Strikingly, these methods can converge to solutions that are not only far from the retrained ideal but are potentially even further from it than the original model itself, rendering the unlearning process actively detrimental. A toy example further illustrates how this dependence can trap models in inferior local minima, inescapable via finetuning. Our findings highlight that the presence of such statistical dependencies, even when manifest only as correlations, can be sufficient for ascent-based unlearning to fail. Our theoretical insights are corroborated by experiments on complex neural networks, demonstrating that these methods do not perform as expected in practice due to this unaddressed statistical interplay.

[416] Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan

Main category: cs.LG

TL;DR: Muddit is a unified discrete diffusion transformer that enables fast parallel generation across text and image modalities by integrating pretrained visual priors with a lightweight text decoder.

Details

Motivation: Address limitations of existing unified generation models: autoregressive models suffer from slow sequential decoding, while non-autoregressive models have weak generalization due to limited pretrained backbones.

Method: Uses unified discrete diffusion transformer architecture that integrates strong visual priors from pretrained text-to-image backbone with lightweight text decoder for flexible multimodal generation.

Result: Achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency across text and image generation tasks.

Conclusion: Demonstrates that purely discrete diffusion, when equipped with strong visual priors, serves as a scalable and effective backbone for unified generation across modalities.

Abstract: Unified generation models aim to handle diverse tasks across modalities – such as text generation, image generation, and vision-language reasoning – within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

[417] Neural Diffusion Processes for Physically Interpretable Survival Prediction

Alessio Cristofoletto, Cesare Rollo, Giovanni Birolo, Piero Fariselli

Main category: cs.LG

TL;DR: DeepFHT combines deep neural networks with first hitting time distributions from stochastic processes to model survival data, achieving competitive accuracy while maintaining interpretable physics-based parameters.

Details

Motivation: To develop a survival analysis framework that captures time-varying risk without assuming proportional hazards, while providing interpretable parameters that elucidate feature-risk relationships.

Method: Couples deep neural networks with first hitting time distributions, representing time-to-event as the first passage of a latent diffusion process to an absorbing boundary. Neural networks map inputs to physically meaningful parameters like initial condition, drift, and diffusion.

Result: Achieves predictive accuracy comparable to state-of-the-art approaches while maintaining physics-based interpretable parameterization that reveals relationships between input features and risk.

Conclusion: The combination of stochastic process theory and deep learning provides a principled approach for modeling survival phenomena in complex systems, offering both accuracy and interpretability.

Abstract: We introduce DeepFHT, a survival-analysis framework that couples deep neural networks with first hitting time (FHT) distributions from stochastic process theory. Time to event is represented as the first passage of a latent diffusion process to an absorbing boundary. A neural network maps input variables to physically meaningful parameters including initial condition, drift, and diffusion, within a chosen FHT process such as Brownian motion, both with drift and driftless. This yields closed- form survival and hazard functions and captures time-varying risk without assuming proportional- hazards. We compare DeepFHT with Cox regression using synthetic and real-world datasets. The method achieves predictive accuracy on par with the state-of-the-art approach, while maintaining a physics- based interpretable parameterization that elucidates the relation between input features and risk. This combination of stochastic process theory and deep learning provides a principled avenue for modeling survival phenomena in complex systems

[418] Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs

Hen Davidov, Shai Feldman, Gilad Freidkin, Yaniv Romano

Main category: cs.LG

TL;DR: The paper introduces time-to-unsafe-sampling as a new safety metric for LLMs and proposes a conformal prediction-based method to estimate it with statistical guarantees, addressing the challenge of rare unsafe outputs.

Details

Motivation: Existing safety evaluation methods for generative models lack prompt-adaptive metrics that quantify how quickly models can produce unsafe content, which is challenging due to the rarity of unsafe outputs in well-aligned models.

Method: The authors frame the problem as survival analysis and develop a conformal prediction-based calibration technique to construct lower predictive bounds on time-to-unsafe-sampling, with an optimized sampling-budget allocation scheme for improved efficiency.

Result: Experiments on synthetic and real data validate the theoretical results and demonstrate the method’s practical utility for safety risk assessment in generative AI models.

Conclusion: The proposed approach provides a rigorous, distribution-free method for quantifying safety risks in LLMs through time-to-unsafe-sampling estimation with coverage guarantees.

Abstract: We introduce time-to-unsafe-sampling, a novel safety measure for generative models, defined as the number of generations required by a large language model (LLM) to trigger an unsafe (e.g., toxic) response. While providing a new dimension for prompt-adaptive safety evaluation, quantifying time-to-unsafe-sampling is challenging: unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget. To address this challenge, we frame this estimation problem as one of survival analysis. We build on recent developments in conformal prediction and propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees. Our key technical innovation is an optimized sampling-budget allocation scheme that improves sample efficiency while maintaining distribution-free guarantees. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.

[419] Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama

Main category: cs.LG

TL;DR: RLVR trains policies using automated verifiers instead of human labeling, but binary rewards introduce false negatives and false positives. The paper proposes backward and forward correction algorithms to address verifier errors, showing improved performance in math reasoning tasks.

Details

Motivation: To reduce vulnerability to verifier hacking, RLVR systems use binary rewards, but this introduces false negatives (rejecting correct answers) and false positives (accepting incorrect ones), which degrade training quality.

Method: Models verifier as stochastic reward channel with asymmetric noise rates. Derives two correction algorithms: backward correction (de-biases observed reward) and forward correction (reweights score-function terms using only FN rate). Implements them in GRPO-based RLVR pipeline.

Result: Both corrections improve over uncorrected training across models and datasets. Forward correction converges faster and remains stable under heavier noise. Lightweight LLM verifier estimating FN rate online outperforms other state-of-the-art methods.

Conclusion: The proposed correction algorithms effectively mitigate verifier unreliability in RLVR systems, with forward correction showing particular advantages in convergence speed and noise robustness.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling. To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary ${0,1}$ during training. This choice carries a cost: it introduces \textit{false negatives} (rejecting correct answers, FNs) and \textit{false positives} (accepting incorrect ones, FPs). For instance, a rule-based checker may mark the correct fraction $\frac{12}{36}$ as wrong when compared against the canonical $\frac{1}{3}$ due to brittle parsing/equivalence rules (FN), while a large language model (LLM) judges can be gamed by superficial cues or even a single adversarial token, yielding inflated correctness for wrong solutions (FP). We formalize verifier unreliability by modeling the verifier as a stochastic reward channel with asymmetric noise rates. From this abstraction, we derive two correction algorithms for verifier errors. The first is a \textit{backward} correction that de-biases the observed binary reward to recover an \textit{unbiased} estimator of the clean policy gradient. The second is a \textit{forward} correction that reweights score-function terms so that the expected update direction aligns with the \textit{clean gradient}; notably, it requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization (GRPO)-based RLVR pipeline and evaluate them on math-reasoning models and benchmarks. Across models and datasets, both corrections improve over uncorrected training; the forward variant converges faster and remains stable under heavier noise. Finally, we show a practical appeal mechanism in which a lightweight LLM verifier estimates the FN rate online by rechecking rule-based negatives, obtaining outperformance compared with other state-of-the-art contenders.

[420] Conditional Generative Modeling for Enhanced Credit Risk Management in Supply Chain Finance

Qingkai Zhang, L. Jeff Hong, Houmin Yan

Main category: cs.LG

TL;DR: Proposes a credit risk management framework for 3PL-led supply chain finance in cross-border e-commerce using generative modeling to assess credit risk and determine optimal loan sizes for SMEs.

Details

Motivation: Small- and medium-sized sellers in cross-border e-commerce face financing challenges due to limited credit histories, despite the growth opportunities in this sector. Third-party logistics-led supply chain finance offers a solution but requires advanced credit risk assessment methods.

Method: Uses Quantile-Regression-based Generative Metamodeling (QRGMM) for sales distribution modeling, integrated with Deep Factorization Machines (DeepFM) to capture complex covariate interactions. Proposes a unified framework for flexible risk measure estimation and functional risk measures tied to loan levels.

Result: Extensive experiments on synthetic and real-world data validate the model’s effectiveness for both credit risk assessment and loan size determination in cross-border e-commerce supply chain finance.

Conclusion: Generative models show strong potential for enhancing credit risk management in cross-border e-commerce supply chain finance, particularly for supporting financing access for small- and medium-sized sellers.

Abstract: The rapid expansion of cross-border e-commerce (CBEC) has created significant opportunities for small- and medium-sized sellers, yet financing remains a critical challenge due to their limited credit histories. Third-party logistics (3PL)-led supply chain finance (SCF) has emerged as a promising solution, leveraging in-transit inventory as collateral. We propose an advanced credit risk management framework tailored for 3PL-led SCF, addressing the dual challenges of credit risk assessment and loan size determination. Specifically, we leverage conditional generative modeling of sales distributions through Quantile-Regression-based Generative Metamodeling (QRGMM) as the foundation for risk measures estimation. We propose a unified framework that enables flexible estimation of multiple risk measures while introducing a functional risk measure formulation that systematically captures the relationship between these risk measures and varying loan levels, supported by theoretical guarantees. To capture complex covariate interactions in e-commerce sales data, we integrate QRGMM with Deep Factorization Machines (DeepFM). Extensive experiments on synthetic and real-world data validate the efficacy of our model for credit risk assessment and loan size determination. This study explores the use of generative models in CBEC SCF risk management, illustrating their potential to strengthen credit assessment and support financing for small- and medium-sized sellers.

[421] Understanding Generalization in Node and Link Prediction

Antonis Vasileiou, Timo Stoll, Christopher Morris

Main category: cs.LG

TL;DR: The paper introduces a unified framework to analyze generalization properties of MPNNs for node and link prediction, addressing limitations of existing works that overlook graph structure influence and make unrealistic assumptions.

Details

Motivation: Current understanding of MPNN generalization is limited, especially for node- and link-level predictions, with existing works making unrealistic i.i.d. assumptions and neglecting graph structure influence.

Method: Developed a unified theoretical framework to analyze MPNN generalization in inductive and transductive settings, incorporating diverse architectural parameters, loss functions, and quantifying graph structure influence.

Result: The framework provides theoretical insights into MPNN generalization capabilities and is supported by empirical studies, showing applicability beyond graphs to any classification task in inductive/transductive settings.

Conclusion: The work deepens understanding of MPNN generalization in node and link prediction tasks and provides a versatile framework applicable to broader classification problems.

Abstract: Using message-passing graph neural networks (MPNNs) for node and link prediction is crucial in various scientific and industrial domains, which has led to the development of diverse MPNN architectures. Besides working well in practical settings, their ability to generalize beyond the training set remains poorly understood. While some studies have explored MPNNs’ generalization in graph-level prediction tasks, much less attention has been given to node- and link-level predictions. Existing works often rely on unrealistic i.i.d.@ assumptions, overlooking possible correlations between nodes or links, and assuming fixed aggregation and impractical loss functions while neglecting the influence of graph structure. In this work, we introduce a unified framework to analyze the generalization properties of MPNNs in inductive and transductive node and link prediction settings, incorporating diverse architectural parameters and loss functions and quantifying the influence of graph structure. Additionally, our proposed generalization framework can be applied beyond graphs to any classification task under the inductive or transductive setting. Our empirical study supports our theoretical insights, deepening our understanding of MPNNs’ generalization capabilities in these tasks.

[422] Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise

Luca Scimeca, Thomas Jiralerspong, Berton Earnshaw, Jason Hartford, Yoshua Bengio

Main category: cs.LG

TL;DR: This paper introduces spectrally anisotropic Gaussian diffusion (SAGD), which replaces isotropic noise with structured frequency-diagonal covariance to shape inductive biases in diffusion models, enabling better data modeling and selective corruption omission.

Details

Motivation: To build explicit inductive biases into diffusion probabilistic models (DPMs) to better accommodate target data distributions, as current DPMs have largely implicit inductive biases.

Method: Introduces an anisotropic noise operator with structured frequency-diagonal covariance that unifies band-pass masks and power-law weightings, allowing emphasis or suppression of specific frequency bands while maintaining Gaussian forward process.

Result: Empirically shows that induced anisotropy outperforms standard diffusion across several vision datasets and enables selective omission of known corruptions confined to specific frequency bands.

Conclusion: Carefully designed anisotropic forward noise provides a simple yet principled way to tailor inductive bias in DPMs, reshaping probability-flow paths from noise to data.

Abstract: Diffusion Probabilistic Models (DPMs) have achieved strong generative performance, yet their inductive biases remain largely implicit. In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. We introduce an anisotropic noise operator that shapes these biases by replacing the isotropic forward covariance with a structured, frequency-diagonal covariance. This operator unifies band-pass masks and power-law weightings, allowing us to emphasize or suppress designated frequency bands, while keeping the forward process Gaussian. We refer to this as spectrally anisotropic Gaussian diffusion (SAGD). In this work, we derive the score relation for anisotropic covariances and show that, under full support, the learned score converges to the true data score as $t!\to!0$, while anisotropy reshapes the probability-flow path from noise to data. Empirically, we show the induced anisotropy outperforms standard diffusion across several vision datasets, and enables selective omission: learning while ignoring known corruptions confined to specific bands. Together, these results demonstrate that carefully designed anisotropic forward noise provides a simple, yet principled, handle to tailor inductive bias in DPMs.

[423] A Cycle-Consistency Constrained Framework for Dynamic Solution Space Reduction in Noninjective Regression

Hanzhang Jia, Yi Gao

Main category: cs.LG

TL;DR: A cycle consistency-based data-driven training framework for non-injective regression tasks that jointly optimizes forward and backward models to eliminate reliance on preset probability distributions and prior knowledge.

Details

Motivation: To address challenges in multi-output models that heavily rely on preset probability distributions and embedded prior knowledge in non-injective regression tasks.

Method: Jointly optimizes a forward model Φ: X→Y and backward model Ψ: Y→X using cycle consistency loss L_cycle = L(Y, Φ(Ψ(Y))) (and vice versa), creating a closed-loop mechanism integrating generation and validation phases.

Result: Achieves cycle reconstruction error below 0.003 on normalized synthetic and simulated datasets, with approximately 30% improvement in evaluation metrics compared to baseline models without cycle consistency.

Conclusion: The framework supports unsupervised learning, significantly reduces reliance on manual intervention, and demonstrates potential advantages in non-injective regression tasks.

Abstract: To address the challenges posed by the heavy reliance of multi-output models on preset probability distributions and embedded prior knowledge in non-injective regression tasks, this paper proposes a cycle consistency-based data-driven training framework. The method jointly optimizes a forward model {\Phi}: X to Y and a backward model {\Psi}: Y to X, where the cycle consistency loss is defined as L _cycleb equal L(Y reduce {\Phi}({\Psi}(Y))) (and vice versa). By minimizing this loss, the framework establishes a closed-loop mechanism integrating generation and validation phases, eliminating the need for manual rule design or prior distribution assumptions. Experiments on normalized synthetic and simulated datasets demonstrate that the proposed method achieves a cycle reconstruction error below 0.003, achieving an improvement of approximately 30% in evaluation metrics compared to baseline models without cycle consistency. Furthermore, the framework supports unsupervised learning and significantly reduces reliance on manual intervention, demonstrating potential advantages in non-injective regression tasks.

[424] Reasoning-Enhanced Large Language Models for Molecular Property Prediction

Jiaxi Zhuang, Yaorui Shi, Jue Hou, Yunong He, Mingwei Ye, Mingjun Xu, Yuming Su, Linfeng Zhang, Ying Qian, Linfeng Zhang, Guolin Ke, Hengxing Cai

Main category: cs.LG

TL;DR: MPPReasoner is a multimodal LLM that integrates molecular images and SMILES strings for molecular property prediction, featuring chemical reasoning capabilities through a two-stage training approach with SFT and reinforcement learning.

Details

Motivation: Address limitations in existing molecular property prediction methods including poor interpretability, limited cross-task generalization, and lack of chemical reasoning capabilities.

Method: Built on Qwen2.5-VL-7B-Instruct, integrates molecular images with SMILES strings. Uses two-stage training: supervised fine-tuning with 16,000 reasoning trajectories, followed by Reinforcement Learning from Principle-Guided Rewards (RLPGR) with verifiable rule-based rewards.

Result: Outperforms best baselines by 7.91% on in-distribution tasks and 4.53% on out-of-distribution tasks across 8 datasets. Shows exceptional cross-task generalization and generates chemically sound reasoning paths.

Conclusion: MPPReasoner significantly enhances interpretability and practical utility for chemists by providing chemically sound reasoning for molecular property prediction.

Abstract: Molecular property prediction is crucial for drug discovery and materials science, yet existing approaches suffer from limited interpretability, poor cross-task generalization, and lack of chemical reasoning capabilities. Traditional machine learning models struggle with task transferability, while specialized molecular language models provide little insight into their decision-making processes. To address these limitations, we propose \textbf{MPPReasoner}, a multimodal large language model that incorporates chemical reasoning for molecular property prediction. Our approach, built upon Qwen2.5-VL-7B-Instruct, integrates molecular images with SMILES strings to enable comprehensive molecular understanding. We develop a two-stage training strategy: supervised fine-tuning (SFT) using 16,000 high-quality reasoning trajectories generated through expert knowledge and multiple teacher models, followed by Reinforcement Learning from Principle-Guided Rewards (RLPGR). RLPGR employs verifiable, rule-based rewards that systematically evaluate chemical principle application, molecular structure analysis, and logical consistency through computational verification. Extensive experiments across 8 datasets demonstrate significant performance improvements, with MPPReasoner outperforming the best baselines by 7.91% and 4.53% on in-distribution and out-of-distribution tasks respectively. MPPReasoner exhibits exceptional cross-task generalization and generates chemically sound reasoning paths that provide valuable insights into molecular property analysis, substantially enhancing both interpretability and practical utility for chemists. Code is available at https://anonymous.4open.science/r/MPPReasoner-12687.

[425] Light-Weight Diffusion Multiplier and Uncertainty Quantification for Fourier Neural Operators

Albert Matveev, Sanmitra Ghosh, Aamal Hussain, James-Michael Leahy, Michalis Michaelides

Main category: cs.LG

TL;DR: DINOZAUR is a diffusion-based neural operator that replaces FNO’s dense tensor multiplier with a dimensionality-independent diffusion multiplier using learnable time parameters, enabling efficient uncertainty quantification and reduced parameter count.

Details

Motivation: FNOs face scalability issues from overparameterization and lack native uncertainty quantification, which is crucial for scientific applications. Current UQ methods ignore geometric inductive biases.

Method: Replaces FNO’s dense tensor multiplier with a diffusion multiplier inspired by heat kernel structure, using single learnable time parameter per channel. Defines priors over time parameters to create Bayesian neural operator for uncertainty quantification.

Result: Achieves competitive or superior performance across PDE benchmarks while providing efficient uncertainty quantification and drastically reduced parameter count and memory footprint.

Conclusion: DINOZAUR offers a scalable neural operator with built-in uncertainty quantification that maintains performance while being more parameter-efficient than traditional FNOs.

Abstract: Operator learning is a powerful paradigm for solving partial differential equations, with Fourier Neural Operators serving as a widely adopted foundation. However, FNOs face significant scalability challenges due to overparameterization and offer no native uncertainty quantification – a key requirement for reliable scientific and engineering applications. Instead, neural operators rely on post hoc UQ methods that ignore geometric inductive biases. In this work, we introduce DINOZAUR: a diffusion-based neural operator parametrization with uncertainty quantification. Inspired by the structure of the heat kernel, DINOZAUR replaces the dense tensor multiplier in FNOs with a dimensionality-independent diffusion multiplier that has a single learnable time parameter per channel, drastically reducing parameter count and memory footprint without compromising predictive performance. By defining priors over those time parameters, we cast DINOZAUR as a Bayesian neural operator to yield spatially correlated outputs and calibrated uncertainty estimates. Our method achieves competitive or superior performance across several PDE benchmarks while providing efficient uncertainty quantification.

[426] Optimally Deep Networks - Adapting Model Depth to Datasets for Superior Efficiency

Shaharyar Ahmed Khan Tareen, Filza Khan Tareen

Main category: cs.LG

TL;DR: ODNs optimize neural network depth to match dataset complexity, reducing computational costs and memory usage while maintaining accuracy.

Details

Motivation: Deep neural networks are often overparameterized for simple datasets, leading to wasted computation, energy consumption, and memory usage, making deployment on resource-constrained devices impractical.

Method: Progressive depth expansion training strategy that starts with shallow networks and incrementally increases depth as earlier blocks converge, removing redundant layers to use only optimal depth for each dataset.

Result: ResNet-18 and ResNet-34 achieved 98.64% and 96.44% reduction in memory footprint for MNIST and SVHN datasets while maintaining competitive accuracies of 99.31% and 96.08% respectively.

Conclusion: ODNs provide an effective approach to balance model depth with task complexity, significantly reducing computational costs and memory requirements while preserving model performance.

Abstract: Deep neural networks (DNNs) have provided brilliant performance across various tasks. However, this success often comes at the cost of unnecessarily large model sizes, high computational demands, and substantial memory footprints. Typically, powerful architectures are trained at full depths but not all datasets or tasks require such high model capacity. Training very deep architectures on relatively low-complexity datasets frequently leads to wasted computation, unnecessary energy consumption, and excessive memory usage, which in turn makes deployment of models on resource-constrained devices impractical. To address this problem, we introduce Optimally Deep Networks (ODNs), which provide a balance between model depth and task complexity. Specifically, we propose a NAS like training strategy called progressive depth expansion, which begins by training deep networks at shallower depths and incrementally increases their depth as the earlier blocks converge, continuing this process until the target accuracy is reached. ODNs use only the optimal depth for the given datasets, removing redundant layers. This cuts down future training and inference costs, lowers the memory footprint, enhances computational efficiency, and facilitates deployment on edge devices. Empirical results show that the optimal depths of ResNet-18 and ResNet-34 for MNIST and SVHN, achieve up to 98.64 % and 96.44 % reduction in memory footprint, while maintaining a competitive accuracy of 99.31 % and 96.08 %, respectively.

[427] Evaluating Sparse Autoencoders for Monosemantic Representation

Moghis Fereidouni, Muhammad Umair Haider, Peizhong Ju, A. B. Siddique

Main category: cs.LG

TL;DR: SAEs reduce polysemanticity in LLMs by creating sparser, more interpretable features, enabling better concept separability and more precise interventions than base models.

Details

Motivation: Polysemanticity in LLMs makes neurons activate for multiple unrelated concepts, hindering interpretability. SAEs have been proposed to address this but lacked systematic quantitative comparison with base models.

Method: Introduced concept separability score using Jensen-Shannon distance, evaluated SAEs vs base models on Gemma-2-2B and DeepSeek-R1 across 5 datasets, and developed APP intervention method using concept-conditioned activation distributions.

Result: SAEs significantly reduce polysemanticity and achieve higher concept separability. APP intervention method provides the smallest perplexity increase while effectively removing concepts.

Conclusion: SAEs successfully mitigate polysemanticity and enable more precise concept-level control, with APP offering an effective targeted suppression approach for interpretable interventions.

Abstract: A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into sparse, more interpretable features. While prior work suggests that SAEs promote monosemanticity, no quantitative comparison has examined how concept activation distributions differ between SAEs and their base models. This paper provides the first systematic evaluation of SAEs against base models through activation distribution lens. We introduce a fine-grained concept separability score based on the Jensen-Shannon distance, which captures how distinctly a neuron’s activation distributions vary across concepts. Using two large language models (Gemma-2-2B and DeepSeek-R1) and multiple SAE variants across five datasets (including word-level and sentence-level), we show that SAEs reduce polysemanticity and achieve higher concept separability. To assess practical utility, we evaluate concept-level interventions using two strategies: full neuron masking and partial suppression. We find that, compared to base models, SAEs enable more precise concept-level control when using partial suppression. Building on this, we propose Attenuation via Posterior Probabilities (APP), a new intervention method that uses concept-conditioned activation distributions for targeted suppression. APP achieves the smallest perplexity increase while remaining highly effective at concept removal.

[428] LeMat-Traj: A Scalable and Unified Dataset of Materials Trajectories for Atomistic Modeling

Ali Ramlaoui, Martin Siron, Inel Djafar, Joseph Musielewicz, Amandine Rossello, Victor Schmidt, Alexandre Duval

Main category: cs.LG

TL;DR: LeMat-Traj is a curated dataset of over 120 million atomic configurations that standardizes quantum mechanical trajectory data from multiple sources to enable training of accurate machine learning interatomic potentials.

Details

Motivation: To address the fragmented availability and inconsistent formatting of quantum mechanical trajectory datasets from DFT calculations, which are expensive to generate but difficult to combine due to format variations, metadata differences, and accessibility issues.

Method: Aggregated data from large-scale repositories (Materials Project, Alexandria, OQMD), standardized data representation, harmonized results across DFT functionals (PBE, PBESol, SCAN, r2SCAN), and filtered for high-quality configurations. Also developed LeMaterial-Fetcher, an open-source library for reproducible data integration.

Result: Significantly lowers barrier for training transferrable and accurate MLIPs. Fine-tuning models pre-trained on high-force data with LeMat-Traj achieved significant reduction in force prediction errors on relaxation tasks. Dataset spans both relaxed low-energy states and high-energy, high-force structures.

Conclusion: LeMat-Traj provides a standardized, high-quality dataset that enables better training of machine learning interatomic potentials, and the accompanying LeMaterial-Fetcher library offers a reproducible framework for community to incorporate new data sources and evolve large-scale materials datasets.

Abstract: The development of accurate machine learning interatomic potentials (MLIPs) is limited by the fragmented availability and inconsistent formatting of quantum mechanical trajectory datasets derived from Density Functional Theory (DFT). These datasets are expensive to generate yet difficult to combine due to variations in format, metadata, and accessibility. To address this, we introduce LeMat-Traj, a curated dataset comprising over 120 million atomic configurations aggregated from large-scale repositories, including the Materials Project, Alexandria, and OQMD. LeMat-Traj standardizes data representation, harmonizes results and filters for high-quality configurations across widely used DFT functionals (PBE, PBESol, SCAN, r2SCAN). It significantly lowers the barrier for training transferrable and accurate MLIPs. LeMat-Traj spans both relaxed low-energy states and high-energy, high-force structures, complementing molecular dynamics and active learning datasets. By fine-tuning models pre-trained on high-force data with LeMat-Traj, we achieve a significant reduction in force prediction errors on relaxation tasks. We also present LeMaterial-Fetcher, a modular and extensible open-source library developed for this work, designed to provide a reproducible framework for the community to easily incorporate new data sources and ensure the continued evolution of large-scale materials datasets. LeMat-Traj and LeMaterial-Fetcher are publicly available at https://huggingface.co/datasets/LeMaterial/LeMat-Traj and https://github.com/LeMaterial/lematerial-fetcher.

[429] Deep Edge Filter: Return of the Human-Crafted Layer in Deep Learning

Dongkwan Lee, Junhoo Lee, Nojun Kwak

Main category: cs.LG

TL;DR: Deep Edge Filter applies high-pass filtering to neural network features to improve generalization by isolating high-frequency semantic components while removing low-frequency domain biases.

Details

Motivation: The hypothesis that neural networks encode task-relevant semantic information in high-frequency components and domain-specific biases in low-frequency components of deep features.

Method: Subtracting low-pass filtered outputs from original features to isolate generalizable representations while preserving architectural integrity.

Result: Consistent performance improvements across diverse domains (Vision, Text, 3D, Audio) regardless of model architecture and data modality. Analysis shows feature sparsification and effective isolation of high-frequency components.

Conclusion: The method empirically validates the core hypothesis and provides a practical approach to improve model generalizability through frequency-domain feature processing.

Abstract: We introduce the Deep Edge Filter, a novel approach that applies high-pass filtering to deep neural network features to improve model generalizability. Our method is motivated by our hypothesis that neural networks encode task-relevant semantic information in high-frequency components while storing domain-specific biases in low-frequency components of deep features. By subtracting low-pass filtered outputs from original features, our approach isolates generalizable representations while preserving architectural integrity. Experimental results across diverse domains such as Vision, Text, 3D, and Audio demonstrate consistent performance improvements regardless of model architecture and data modality. Analysis reveals that our method induces feature sparsification and effectively isolates high-frequency components, providing empirical validation of our core hypothesis. The code is available at https://github.com/dongkwani/DeepEdgeFilter.

[430] Learning Unified Representations from Heterogeneous Data for Robust Heart Rate Modeling

Peng Yang, Zhengdong Huang, Zicheng Xie, Wentao Tian, Jingyu Liu, Lunhong Dong

Main category: cs.LG

TL;DR: A framework for heart rate prediction that addresses data heterogeneity through source-agnostic and user-agnostic representations, achieving significant performance improvements on benchmark datasets.

Details

Motivation: Real-world heart rate prediction faces data heterogeneity challenges from fragmented device markets (source heterogeneity) and individual physiological differences (user heterogeneity), limiting existing methods' performance.

Method: Proposes a framework with random feature dropout for source heterogeneity and time-aware attention with contrastive learning for user heterogeneity. Creates ParroTao benchmark dataset for evaluation.

Result: Outperforms existing baselines by 17% on ParroTao and 15% on FitRec datasets. Learned representations show strong discriminative power and practical value in downstream applications.

Conclusion: The proposed framework effectively handles data heterogeneity in heart rate prediction, demonstrating superior performance and practical utility for real-world health monitoring applications.

Abstract: Heart rate prediction is vital for personalized health monitoring and fitness, while it frequently faces a critical challenge when deploying in real-world: data heterogeneity. We classify it in two key dimensions: source heterogeneity from fragmented device markets with varying feature sets, and user heterogeneity reflecting distinct physiological patterns across individuals and activities. Existing methods either discard device-specific information, or fail to model user-specific differences, limiting their real-world performance. To address this, we propose a framework that learns latent representations agnostic to both heterogeneity, enabling downstream predictors to work consistently under heterogeneous data patterns. Specifically, we introduce a random feature dropout strategy to handle source heterogeneity, making the model robust to various feature sets. To manage user heterogeneity, we employ a time-aware attention module to capture long-term physiological traits and use a contrastive learning objective to build a discriminative representation space. To reflect the heterogeneous nature of real-world data, we created and publicly released a new benchmark dataset, ParroTao. Evaluations on both ParroTao and the public FitRec dataset show that our model significantly outperforms existing baselines by 17% and 15%, respectively. Furthermore, analysis of the learned representations demonstrates their strong discriminative power, and one downstream application task confirm the practical value of our model.

[431] Rethinking Layer-wise Gaussian Noise Injection: Bridging Implicit Objectives and Privacy Budget Allocation

Qifeng Tan, Shusen Yang, Xuebin Ren, Yikai Zhang

Main category: cs.LG

TL;DR: A unified analytical framework for layer-wise Gaussian mechanisms in differentially private deep learning that connects noise allocation strategies to privacy-utility tradeoffs, revealing flaws in existing methods and proposing a SNR-Consistent strategy for better performance.

Details

Motivation: Existing layer-wise Gaussian mechanisms use heuristic noise allocation strategies without theoretical grounding, lacking understanding of how noise allocation connects to formal privacy-utility tradeoffs.

Method: Developed a unified analytical framework that systematically connects layer-wise noise injection strategies with optimization objectives and privacy budget allocations. Proposed SNR-Consistent noise allocation strategy that unifies signal preservation and privacy budget efficiency.

Result: Analysis revealed existing approaches optimize ill-posed objectives, either ignoring inter-layer SNR consistency or leading to inefficient privacy budget use. Proposed method achieves better signal preservation and more efficient privacy budget utilization, outperforming existing strategies in both centralized and federated learning settings.

Conclusion: The framework provides diagnostic insights into prior methods and theoretical guidance for designing adaptive and effective noise injection schemes in deep models, achieving better privacy-utility tradeoffs.

Abstract: Layer-wise Gaussian mechanisms (LGM) enhance flexibility in differentially private deep learning by injecting noise into partitioned gradient vectors. However, existing methods often rely on heuristic noise allocation strategies, lacking a rigorous understanding of their theoretical grounding in connecting noise allocation to formal privacy-utility tradeoffs. In this paper, we present a unified analytical framework that systematically connects layer-wise noise injection strategies with their implicit optimization objectives and associated privacy budget allocations. Our analysis reveals that several existing approaches optimize ill-posed objectives – either ignoring inter-layer signal-to-noise ratio (SNR) consistency or leading to inefficient use of the privacy budget. In response, we propose a SNR-Consistent noise allocation strategy that unifies both aspects, yielding a noise allocation scheme that achieves better signal preservation and more efficient privacy budget utilization. Extensive experiments in both centralized and federated learning settings demonstrate that our method consistently outperforms existing allocation strategies, achieving better privacy-utility tradeoffs. Our framework not only offers diagnostic insights into prior methods but also provides theoretical guidance for designing adaptive and effective noise injection schemes in deep models.

[432] PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

Andy Xu, Rohan Desai, Larry Wang, Gabriel Hope, Ethan Ritz

Main category: cs.LG

TL;DR: PLaID++ is an LLM post-trained for stable and property-guided crystal generation that uses a compact Wyckoff text representation and temperature scaling to generate diverse, thermodynamically stable materials structures more effectively than prior methods.

Details

Motivation: To address the challenge of generating diverse candidate materials that satisfy constraints rather than just correct answers, particularly in materials generation where the objective is to produce varied structures meeting specific properties.

Method: Uses a compact, symmetry-informed Wyckoff text representation for computational efficiency and physical priors, and employs temperature scaling as an entropy regularizer to prevent mode collapse and encourage exploration during generation.

Result: PLaID++ generates structures that are thermodynamically stable, unique, and novel at ~50% greater rate than prior methods, and can conditionally generate structures with desired space group properties.

Conclusion: The work demonstrates the potential of adapting NLP post-training techniques to materials design, enabling targeted and efficient discovery of novel materials through constraint-satisfying generation.

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising approach to improve correctness in LLMs, however, in many scientific problems, the objective is not necessarily to produce the correct answer, but instead to produce a diverse array of candidates which satisfy a set of constraints. We study this challenge in the context of materials generation. To this end, we introduce PLaID++, an LLM post-trained for stable and property-guided crystal generation. We find that performance hinges on our crystallographic representation and reward formulation. First, we introduce a compact, symmetry-informed Wyckoff text representation which improves computational efficiency and encourages generalization from physical priors. Second, we demonstrate that temperature scaling acts as an entropy regularizer which counteracts mode collapse and encourages exploration. By encoding symmetry constraints directly into text and guiding model outputs towards desirable chemical space, PLaID++ generates structures that are thermodynamically stable, unique, and novel at a $\sim$50% greater rate than prior methods and conditionally generates structures with desired space group properties. Our work demonstrates the potential of adapting post-training techniques from natural language processing to materials design, paving the way for targeted and efficient discovery of novel materials.

[433] Soft Graph Transformer for MIMO Detection

Jiadong Hong, Lei Liu, Xinyu Bian, Wenjie Wang, Zhaoyang Zhang

Main category: cs.LG

TL;DR: Soft Graph Transformer (SGT) is a neural architecture for MIMO detection that combines self-attention and graph-aware cross-attention to achieve near-ML performance with computational efficiency.

Details

Motivation: ML detection has exponential complexity in large systems, while conventional message-passing algorithms fail in finite dimensions. Existing Transformer-based detectors ignore MIMO factor graph structure and cannot use prior soft information.

Method: SGT combines self-attention for contextual dependencies within symbol/constraint subgraphs with graph-aware cross-attention for structured message passing across subgraphs. It has soft-input interface for auxiliary priors and produces soft outputs.

Result: SGT achieves near-ML performance in experiments and maintains computational efficiency while producing effective soft outputs.

Conclusion: SGT provides a flexible and interpretable framework for receiver systems that can leverage soft priors, addressing limitations of existing MIMO detection approaches.

Abstract: We propose the Soft Graph Transformer (SGT), a soft-input-soft-output neural architecture designed for MIMO detection. While Maximum Likelihood (ML) detection achieves optimal accuracy, its exponential complexity makes it infeasible in large systems, and conventional message-passing algorithms rely on asymptotic assumptions that often fail in finite dimensions. Recent Transformer-based detectors show strong performance but typically overlook the MIMO factor graph structure and cannot exploit prior soft information. SGT addresses these limitations by combining self-attention, which encodes contextual dependencies within symbol and constraint subgraphs, with graph-aware cross-attention, which performs structured message passing across subgraphs. Its soft-input interface allows the integration of auxiliary priors, producing effective soft outputs while maintaining computational efficiency. Experiments demonstrate that SGT achieves near-ML performance and offers a flexible and interpretable framework for receiver systems that leverage soft priors.

[434] Traces Propagation: Memory-Efficient and Scalable Forward-Only Learning in Spiking Neural Networks

Lorenzo Pes, Bojian Yin, Sander Stuijk, Federico Corradi

Main category: cs.LG

TL;DR: Proposes Traces Propagation (TP), a fully local and memory-efficient learning rule for Spiking Neural Networks that combines eligibility traces with layer-wise contrastive loss, outperforming other local methods on neuromorphic datasets.

Details

Motivation: Current SNN training methods like BPTT violate biological locality principles and have high computational/memory demands, while existing local learning rules fail to address spatial credit assignment without auxiliary matrices that hinder scalability.

Method: TP combines eligibility traces for temporal credit assignment with layer-wise contrastive loss for spatial credit assignment, eliminating the need for auxiliary layer-wise matrices and operating in a forward-only manner.

Result: TP outperforms other fully local learning rules on NMNIST and SHD datasets, shows competitive performance on DVS-GESTURE and DVS-CIFAR10, scales effectively to deeper SNN architectures like VGG-9, and provides favorable memory scaling for datasets with many classes.

Conclusion: TP enables efficient, scalable, and fully local learning for SNNs, making it suitable for practical fine-tuning tasks and paving the way for efficient edge learning.

Abstract: Spiking Neural Networks (SNNs) provide an efficient framework for processing dynamic spatio-temporal signals and for investigating the learning principles underlying biological neural systems. A key challenge in training SNNs is to solve both spatial and temporal credit assignment. The dominant approach for training SNNs is Backpropagation Through Time (BPTT) with surrogate gradients. However, BPTT is in stark contrast with the spatial and temporal locality observed in biological neural systems and leads to high computational and memory demands, limiting efficient training strategies and on-device learning. Although existing local learning rules achieve local temporal credit assignment by leveraging eligibility traces, they fail to address the spatial credit assignment without resorting to auxiliary layer-wise matrices, which increase memory overhead and hinder scalability, especially on embedded devices. In this work, we propose Traces Propagation (TP), a forward-only, memory-efficient, scalable, and fully local learning rule that combines eligibility traces with a layer-wise contrastive loss without requiring auxiliary layer-wise matrices. TP outperforms other fully local learning rules on NMNIST and SHD datasets. On more complex datasets such as DVS-GESTURE and DVS-CIFAR10, TP showcases competitive performance and scales effectively to deeper SNN architectures such as VGG-9, while providing favorable memory scaling compared to prior fully local scalable rules, for datasets with a significant number of classes. Finally, we show that TP is well suited for practical fine-tuning tasks, such as keyword spotting on the Google Speech Commands dataset, thus paving the way for efficient learning at the edge.

[435] Theory of periodic convolutional neural network

Yuqing Liu

Main category: cs.LG

TL;DR: Periodic CNNs with boundary conditions can approximate ridge functions with d-1 variables in d-dimensional space, achieving a sharp characterization of expressive power.

Details

Motivation: To incorporate periodic boundary conditions into CNNs and establish their theoretical approximation capabilities for high-dimensional ridge structures.

Method: Developed periodic CNN architecture with rigorous mathematical analysis of approximation theorems for ridge functions.

Result: Proved periodic CNNs can approximate ridge functions depending on d-1 linear variables in d-dimensional space, which is impossible with fewer variables.

Conclusion: Periodic CNNs expand CNN approximation theory and are well-suited for problems with high-dimensional ridge structures in wrapped domains, physics, and materials science.

Abstract: We introduce a novel convolutional neural network architecture, termed the \emph{periodic CNN}, which incorporates periodic boundary conditions into the convolutional layers. Our main theoretical contribution is a rigorous approximation theorem: periodic CNNs can approximate ridge functions depending on $d-1$ linear variables in a $d$-dimensional input space, while such approximation is impossible in lower-dimensional ridge settings ($d-2$ or fewer variables). This result establishes a sharp characterization of the expressive power of periodic CNNs. Beyond the theory, our findings suggest that periodic CNNs are particularly well-suited for problems where data naturally admits a ridge-like structure of high intrinsic dimension, such as image analysis on wrapped domains, physics-informed learning, and materials science. The work thus both expands the mathematical foundation of CNN approximation theory and highlights a class of architectures with surprising and practically relevant approximation capabilities.

[436] High-Probability Analysis of Online and Federated Zero-Order Optimisation

Arya Akhavan, David Janz, El-Mahdi El-Mhamdi

Main category: cs.LG

TL;DR: FedZero is a federated zero-order optimization algorithm with high-probability theoretical guarantees for convex settings, providing novel concentration tools for Lipschitz functions and sub-Gamma variables.

Details

Motivation: To address distributed learning in gradient-free zero-order optimization and establish strong theoretical guarantees that previous methods only achieved in expectation.

Method: Developed FedZero algorithm with three main contributions: federated convex optimization guarantees, single-worker zero-order convergence proofs, and novel concentration inequalities for Lipschitz functions on l1-sphere and squared sub-Gamma variables.

Result: Achieved high-probability guarantees for regret minimization in federated convex setting and established first high-probability convergence guarantees for classical zero-order optimization with two-point feedback.

Conclusion: FedZero provides rigorous theoretical foundations for federated zero-order optimization with novel probabilistic tools that may have broader applications beyond the immediate context.

Abstract: We study distributed learning in the context of gradient-free zero-order optimisation and introduce FedZero, a federated zero-order algorithm with sharp theoretical guarantees. Our contributions are threefold. First, in the federated convex setting, we derive high-probability guarantees for regret minimisation achieved by FedZero. Second, in the single-worker regime, corresponding to the classical zero-order framework with two-point feedback, we establish the first high-probability convergence guarantees for convex zero-order optimisation, strengthening previous results that held only in expectation. Third, to establish these guarantees, we develop novel concentration tools: (i) concentration inequalities with explicit constants for Lipschitz functions under the uniform measure on the $\ell_1$-sphere, and (ii) a time-uniform concentration inequality for squared sub-Gamma random variables. These probabilistic results underpin our high-probability guarantees and may also be of independent interest.

[437] F-Adapter: Frequency-Adaptive Parameter-Efficient Fine-Tuning in Scientific Machine Learning

Hangwei Zhang, Chun Kang, Yan Wang, Difan Zou

Main category: cs.LG

TL;DR: This paper introduces Frequency-Adaptive Adapter (F-Adapter), a parameter-efficient fine-tuning method for scientific machine learning that outperforms existing techniques like LoRA by adapting to spectral complexity in PDE solutions.

Details

Motivation: Parameter-efficient fine-tuning (PEFT) has been effective in vision and language but unexplored in scientific machine learning for modeling physical systems. The study aims to adapt PEFT methods for pre-trained Large Operator Models (LOMs) in scientific domains.

Method: The authors systematically study PEFT for LOMs, finding adapters outperform LoRA. They theoretically analyze approximation errors and introduce F-Adapter, which allocates adapter capacity based on spectral complexity - higher dimensions for low-frequency components and lower dimensions for high-frequency components.

Result: F-Adapters achieve state-of-the-art results on multiple 3D Navier-Stokes benchmarks, significantly improving both generalization and spectral fidelity over LoRA and other PEFT techniques commonly used in LLMs.

Conclusion: This is the first exploration of PEFT for scientific machine learning, establishing F-Adapter as an effective paradigm that leverages the spectral sparsity of PDE solutions for superior performance in modeling physical systems.

Abstract: Parameter-efficient fine-tuning (PEFT) of powerful pre-trained models for complex downstream tasks has proven effective in vision and language processing, yet this paradigm remains unexplored in scientific machine learning, where the objective is to model complex physical systems. We conduct the first systematic study of PEFT for pre-trained Large Operator Models (LOMs) obtained by scaling variants of Fourier Neural Operator. First, we observe that the widely used Low-Rank Adaptation (LoRA) yields markedly poorer performance on LOMs than Adapter tuning. Then, we further theoretically establish that stacked LoRA incurs a depth-amplified lower bound on approximation error within Fourier layers, whereas adapters retain universal approximation capacity and, by concentrating parameters on energy-dominant low-frequency modes, attain exponentially decaying error with bottleneck width in the Fourier domain. Motivated by the robust empirical gains of adapters and by our theoretical characterization of PDE solutions as spectrally sparse, we introduce Frequency-Adaptive Adapter (F-Adapter). F-Adapter allocates adapter capacity based on spectral complexity, assigning higher-dimension modules to low-frequency components and lower-dimension modules to high-frequency components. Our F-Adapters establish state-of-the-art (SOTA) results on multiple challenging 3D Navier-Stokes benchmarks, markedly enhancing both generalization and spectral fidelity over LoRA and other PEFT techniques commonly used in LLMs. To the best of our knowledge, this work is the first to explore PEFT for scientific machine-learning and establishes F-Adapter as an effective paradigm for this domain.

[438] Learning More with Less: A Generalizable, Self-Supervised Framework for Privacy-Preserving Capacity Estimation with EV Charging Data

Anushiya Arunan, Yan Qin, Xiaoli Li, U-Xuan Tan, H. Vincent Poor, Chau Yuen

Main category: cs.LG

TL;DR: A self-supervised learning framework for EV battery capacity estimation using privacy-friendly charging data snippets, achieving 31.9% lower error than benchmarks under domain shifts.

Details

Motivation: Practical data limitations from privacy regulations and labeled data shortages hinder development of generalizable battery capacity estimation models robust to real-world distribution shifts.

Method: Self-supervised pre-training with snippet similarity-weighted masked input reconstruction, using contrastive learning to capture high-level similarities among fragmented privacy-friendly charging data snippets.

Result: Model consistently outperforms state-of-the-art baselines with 31.9% lower test error, even under challenging domain-shifted settings affected by manufacturer and age-induced distribution shifts.

Conclusion: The proposed framework effectively learns rich representations from privacy-friendly data and demonstrates strong generalization capability for battery capacity estimation in real-world EV operations.

Abstract: Accurate battery capacity estimation is key to alleviating consumer concerns about battery performance and reliability of electric vehicles (EVs). However, practical data limitations imposed by stringent privacy regulations and labeled data shortages hamper the development of generalizable capacity estimation models that remain robust to real-world data distribution shifts. While self-supervised learning can leverage unlabeled data, existing techniques are not particularly designed to learn effectively from challenging field data – let alone from privacy-friendly data, which are often less feature-rich and noisier. In this work, we propose a first-of-its-kind capacity estimation model based on self-supervised pre-training, developed on a large-scale dataset of privacy-friendly charging data snippets from real-world EV operations. Our pre-training framework, snippet similarity-weighted masked input reconstruction, is designed to learn rich, generalizable representations even from less feature-rich and fragmented privacy-friendly data. Our key innovation lies in harnessing contrastive learning to first capture high-level similarities among fragmented snippets that otherwise lack meaningful context. With our snippet-wise contrastive learning and subsequent similarity-weighted masked reconstruction, we are able to learn rich representations of both granular charging patterns within individual snippets and high-level associative relationships across different snippets. Bolstered by this rich representation learning, our model consistently outperforms state-of-the-art baselines, achieving 31.9% lower test error than the best-performing benchmark, even under challenging domain-shifted settings affected by both manufacturer and age-induced distribution shifts. Source code is available at https://github.com/en-research/GenEVBattery.

[439] Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates

Pafue Christy Nganjimi, Andrew Soltan, Danielle Belgrave, Lei Clifton, David A. Clifton, Anshul Thakur

Main category: cs.LG

TL;DR: The paper proposes using smooth quadratic Bézier curves as parametric surrogates for noisy SGD trajectories in dataset condensation, improving stability, convergence speed, and reducing memory overhead while maintaining clinical utility.

Details

Motivation: Current dataset condensation methods use full SGD trajectories as alignment targets, but these trajectories are noisy, high-curvature, and storage-intensive, leading to unstable gradients, slow convergence, and substantial memory overhead.

Method: Replace full SGD trajectories with smooth, low-loss parametric surrogates using quadratic Bézier curves that connect initial and final model states from real training trajectories, providing noise-free, low-curvature supervision signals.

Result: The method outperforms state-of-the-art condensation approaches across five clinical datasets, yielding condensed datasets that enable clinically effective model development with stabilized gradients, accelerated convergence, and eliminated dense trajectory storage.

Conclusion: Bézier-mode connections serve as effective surrogates for SGD paths in dataset condensation, providing superior performance while addressing key limitations of existing methods in clinical data applications.

Abstract: Dataset condensation (DC) enables the creation of compact, privacy-preserving synthetic datasets that can match the utility of real patient records, supporting democratised access to highly regulated clinical data for developing downstream clinical models. State-of-the-art DC methods supervise synthetic data by aligning the training dynamics of models trained on real and those trained on synthetic data, typically using full stochastic gradient descent (SGD) trajectories as alignment targets; however, these trajectories are often noisy, high-curvature, and storage-intensive, leading to unstable gradients, slow convergence, and substantial memory overhead. We address these limitations by replacing full SGD trajectories with smooth, low-loss parametric surrogates, specifically quadratic B'ezier curves that connect the initial and final model states from real training trajectories. These mode-connected paths provide noise-free, low-curvature supervision signals that stabilise gradients, accelerate convergence, and eliminate the need for dense trajectory storage. We theoretically justify B'ezier-mode connections as effective surrogates for SGD paths and empirically show that the proposed method outperforms state-of-the-art condensation approaches across five clinical datasets, yielding condensed datasets that enable clinically effective model development.

[440] Expanding the Action Space of LLMs to Reason Beyond Language

Zhongqi Yue, Weishi Wang, Yundaichuan Zhan, Juncheng Li, Daniel Dahlmeier, Fredrik D. Johansson

Main category: cs.LG

TL;DR: The paper introduces Expanded Action space (ExpA) to decouple environment interactions from language in LLMs, allowing models to switch between language reasoning and external environments, and proposes ExpA Reinforcement Learning (EARL) for effective exploration.

Details

Motivation: Current LLMs are limited to vocabulary tokens for environment interactions, overloading language with both reasoning and control duties while requiring external parsers. This work aims to internalize environment interactions beyond the vocabulary.

Method: Proposed Expanded Action space (ExpA) that allows models to trigger routing actions to switch between language and external environments, and introduced ExpA Reinforcement Learning (EARL) with counterfactual policy optimization for effective exploration.

Result: EARL outperforms vocabulary-constrained baselines on multi-turn interaction and contingent planning tasks, achieves robust performance in calculator-based multi-task learning, and discovers efficient algorithms achieving perfect Sort-4 accuracy in partially observed sorting problems.

Conclusion: Decoupling environment interactions from language through Expanded Action space enables more effective LLM reasoning and control, with EARL demonstrating superior performance in complex interaction tasks while self-discovering efficient algorithms.

Abstract: Large Language Models (LLMs) are powerful reasoners in natural language, but their actions are typically confined to outputting vocabulary tokens. As a result, interactions with external environments – such as symbolic operators or simulators – must be expressed through text in predefined formats, parsed, and routed to external interfaces. This overloads the model’s language with both reasoning and control duties, and requires a hand-crafted parser, external to the LLM. To address this, we decouple environment interactions from language by internalizing them in an Expanded Action space (ExpA), beyond the vocabulary. The model starts reasoning in the default language environment, but may trigger routing actions and switch to an external environment at any time. From there, the model can only invoke environment-specific actions, receive feedback from the environment, and potentially route back to language as a result. To promote effective exploration of the expanded action space and new environments, we introduce ExpA Reinforcement Learning (EARL) with counterfactual policy optimization. On tasks requiring multi-turn interactions and contingent planning, EARL outperforms strong baselines with vocabulary-constrained actions. It performs robustly across calculator-based multi-task learning and, in the partially observed sorting problem, achieves perfect Sort-4 accuracy while self-discovering an efficient algorithm competitive with classical designs.

[441] Bidirectional Representations Augmented Autoregressive Biological Sequence Generation:Application in De Novo Peptide Sequencing

Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Zhi Jin, ZhiQiang Gao, Nanqing Dong, Siqi Sun

Main category: cs.LG

TL;DR: A hybrid framework that combines autoregressive and non-autoregressive models for biological sequence generation, using cross-decoder attention to integrate bidirectional features into AR generation.

Details

Motivation: AR models fail to capture bidirectional dependencies in biological tasks like peptide sequencing, while NAR models lack generative coherence. A hybrid approach is needed to combine AR stability with NAR's bidirectional awareness.

Method: Shared encoder with two decoders: NAR decoder learns bidirectional features, AR decoder generates sequences using cross-decoder attention to query these features. Training uses importance annealing and gradient blocking for stability.

Result: Substantially outperforms AR and NAR baselines on nine-species peptide sequencing benchmark, harmonizing AR stability with NAR contextual awareness.

Conclusion: The hybrid framework advances biological sequence modeling by enhancing AR models with bidirectional understanding, providing robust performance for complex sequence generation tasks.

Abstract: Autoregressive (AR) models, common in sequence generation, are limited in many biological tasks such as de novo peptide sequencing and protein modeling by their unidirectional nature, failing to capture crucial global bidirectional token dependencies. Non-Autoregressive (NAR) models offer holistic, bidirectional representations but face challenges with generative coherence and scalability. To transcend this, we propose a hybrid framework enhancing AR generation by dynamically integrating rich contextual information from non-autoregressive mechanisms. Our approach couples a shared input encoder with two decoders: a non-autoregressive one learning latent bidirectional biological features, and an AR decoder synthesizing the biological sequence by leveraging these bidirectional features. A novel cross-decoder attention module enables the AR decoder to iteratively query and integrate these bidirectional features, enriching its predictions. This synergy is cultivated via a tailored training strategy with importance annealing for balanced objectives and cross-decoder gradient blocking for stable, focused learning. Evaluations on a demanding nine-species benchmark of de novo peptide sequencing show that our model substantially surpasses AR and NAR baselines. It uniquely harmonizes AR stability with NAR contextual awareness, delivering robust, superior performance on diverse downstream data. This research advances biological sequence modeling techniques and contributes a novel architectural paradigm for augmenting AR models with enhanced bidirectional understanding for complex sequence generation. Code is available at https://github.com/BEAM-Labs/denovo.

[442] Leveraging Teleconnections with Physics-Informed Graph Attention Networks for Long-Range Extreme Rainfall Forecasting in Thailand

Kiattikun Chobtham, Kanoksri Sarinnapakorn, Kritanai Torsri, Prattana Deeprasertkul, Jirawan Kamma

Main category: cs.LG

TL;DR: Physics-informed Graph Neural Networks combined with extreme-value analysis techniques improve rainfall forecasting in Thailand, outperforming baselines and enhancing extreme-event prediction.

Details

Motivation: Accurate rainfall forecasting, especially for extreme events, is challenging in climatology and Earth system science.

Method: Graph Attention Network with LSTM using physics-derived edge features, combined with Spatial Season-aware Generalized Pareto Distribution for extreme-value analysis.

Result: Outperforms established baselines across most regions, improves extreme-event prediction compared to SEAS5 operational system, and provides high-resolution maps for decision-making.

Conclusion: The proposed method offers practical enhancement for long-term water management through improved rainfall forecasting and extreme-event prediction.

Abstract: Accurate rainfall forecasting, particularly for extreme events, remains a significant challenge in climatology and the Earth system. This paper presents novel physics-informed Graph Neural Networks (GNNs) combined with extreme-value analysis techniques to improve gauge-station rainfall predictions across Thailand. The model leverages a graph-structured representation of gauge stations to capture complex spatiotemporal patterns, and it offers explainability through teleconnections. We preprocess relevant climate indices that potentially influence regional rainfall. The proposed Graph Attention Network with Long Short-Term Memory (Attention-LSTM) applies the attention mechanism using initial edge features derived from simple orographic-precipitation physics formulation. The embeddings are subsequently processed by LSTM layers. To address extremes, we perform Peak-Over-Threshold (POT) mapping using the novel Spatial Season-aware Generalized Pareto Distribution (GPD) method, which overcomes limitations of traditional machine-learning models. Experiments demonstrate that our method outperforms well-established baselines across most regions, including areas prone to extremes, and remains strongly competitive with the state of the art. Compared with the operational forecasting system SEAS5, our real-world application improves extreme-event prediction and offers a practical enhancement to produce high-resolution maps that support decision-making in long-term water management.

[443] Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Ruben Belo, Marta Guimaraes, Claudia Soares

Main category: cs.LG

TL;DR: CALM is an inference-time method that suppresses harmful concepts in LLMs by modifying latent representations using concept whitening and orthogonal projection, without retraining.

Details

Motivation: Large Language Models are vulnerable to jailbreak attacks that bypass safety guardrails through adversarial prompts, creating a need for effective defense mechanisms.

Method: Uses concept whitening technique from Computer Vision combined with orthogonal projection to remove unwanted latent directions associated with harmful content by modifying the last layer’s latent representations.

Result: CALM reduces harmful outputs and outperforms baseline methods in most metrics, with only a small computational overhead at inference.

Conclusion: CALM offers a lightweight approach to AI safety that requires no additional training data or model fine-tuning, providing effective protection against jailbreak attacks.

Abstract: Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails (e.g., by tricking the model with adversarial prompts). We propose Concept Alignment and Concept Manipulation CALM, an inference-time method that suppresses harmful concepts by modifying latent representations of the last layer of the model, without retraining. Leveraging concept whitening technique from Computer Vision combined with orthogonal projection, CALM removes unwanted latent directions associated with harmful content while preserving model performance. Experiments show that CALM reduces harmful outputs and outperforms baseline methods in most metrics, offering a lightweight approach to AI safety with no additional training data or model fine-tuning, while incurring only a small computational overhead at inference.

[444] Machine Learning-Based Ultrasonic Weld Characterization Using Hierarchical Wave Modeling and Diffusion-Driven Distribution Alignment

Joshua R. Tempelman, Adam J. Wachtor, Eric B. Flynn

Main category: cs.LG

TL;DR: This paper presents an end-to-end machine learning workflow for automated ultrasonic weld inspection that addresses data scarcity and signal corruption through reduced-order modeling, diffusion-based distribution alignment, and U-Net segmentation.

Details

Motivation: Automated ultrasonic weld inspection faces challenges due to limited training data (from complex specimen curation or simulations) and environmental volatility in industrial settings that corrupt measurements, making end-to-end ML workflows elusive.

Method: Proposed workflow includes: 1) reduced-order Helmholtz model based on Lamb wave theory to generate comprehensive dataset, 2) diffusion-based distribution alignment to handle out-of-distribution experimental data, 3) U-Net-based segmentation and inversion models refined through transfer learning from limited 3D simulations.

Result: The integrated framework provides an end-to-end solution for automated weld inspection on real data, handling varying weld heterogeneity, crack defects, and unpredictable noise distributions in experimental LDV scans.

Conclusion: This work successfully addresses key challenges in automated weld inspection by combining reduced-order modeling, diffusion techniques, and deep learning to create a practical end-to-end solution for industrial applications.

Abstract: Automated ultrasonic weld inspection remains a significant challenge in the nondestructive evaluation (NDE) community to factors such as limited training data (due to the complexity of curating experimental specimens or high-fidelity simulations) and environmental volatility of many industrial settings (resulting in the corruption of on-the-fly measurements). Thus, an end-to-end machine learning (ML) workflow for acoustic weld inspection in realistic (i.e., industrial) settings has remained an elusive goal. This work addresses the challenges of data curation and signal corruption by proposing workflow consisting of a reduced-order modeling scheme, diffusion based distribution alignment, and U-Net-based segmentation and inversion. A reduced-order Helmholtz model based on Lamb wave theory is used to generate a comprehensive dataset over varying weld heterogeneity and crack defects. The relatively inexpensive low-order solutions provide a robust training dateset for inversion models which are refined through a transfer learning stage using a limited set of full 3D elastodynamic simulations. To handle out-of-distribution (OOD) real-world measurements with varying and unpredictable noise distributions, i.e., Laser Doppler Vibrometry scans, guided diffusion produces in-distribution representations of OOD experimental LDV scans which are subsequently processed by the inversion models. This integrated framework provides an end-to-end solution for automated weld inspection on real data.

[445] When In Doubt, Abstain: The Impact of Abstention on Strategic Classification

Lina Alkarmi, Ziyuan Huang, Mingyan Liu

Main category: cs.LG

TL;DR: This paper studies how classifier abstention (declining decisions when uncertain) affects strategic classification, showing it improves accuracy and deters manipulation without harming principal utility.

Details

Motivation: Algorithmic decision making is vulnerable to strategic manipulation, and prior research showed abstention improves classifier accuracy - this paper explores how abstention impacts strategic agents and optimal use in strategic contexts.

Method: Model the interaction as a Stackelberg game where a principal (classifier) announces decision policy first, then strategic agents manipulate observable features to get desired outcomes, focusing on binary classifiers.

Result: Optimal abstention ensures principal’s utility is no worse than non-abstention settings, even with strategic agents. Abstention also deters manipulation by making it costlier for less qualified agents to achieve positive outcomes.

Conclusion: Abstention is a valuable tool for reducing negative effects of strategic behavior in algorithmic decision making systems, improving accuracy while deterring manipulation.

Abstract: Algorithmic decision making is increasingly prevalent, but often vulnerable to strategic manipulation by agents seeking a favorable outcome. Prior research has shown that classifier abstention (allowing a classifier to decline making a decision due to insufficient confidence) can significantly increase classifier accuracy. This paper studies abstention within a strategic classification context, exploring how its introduction impacts strategic agents’ responses and how principals should optimally leverage it. We model this interaction as a Stackelberg game where a principal, acting as the classifier, first announces its decision policy, and then strategic agents, acting as followers, manipulate their features to receive a desired outcome. Here, we focus on binary classifiers where agents manipulate observable features rather than their true features, and show that optimal abstention ensures that the principal’s utility (or loss) is no worse than in a non-abstention setting, even in the presence of strategic agents. We also show that beyond improving accuracy, abstention can also serve as a deterrent to manipulation, making it costlier for agents, especially those less qualified, to manipulate to achieve a positive outcome when manipulation costs are significant enough to affect agent behavior. These results highlight abstention as a valuable tool for reducing the negative effects of strategic behavior in algorithmic decision making systems.

[446] CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations

Guangyi Chen, Yunlong Deng, Peiyuan Zhu, Yan Li, Yifan Shen, Zijian Li, Kun Zhang

Main category: cs.LG

TL;DR: A new benchmark for Causal Representation Learning (CRL) using high-fidelity simulated visual data that provides both realistic visual complexity and ground-truth causal generating processes.

Details

Motivation: Existing CRL evaluations face a dilemma between realism and evaluative precision, relying on either simplistic synthetic datasets or downstream performance on real-world tasks without ground-truth causal information.

Method: Created a comprehensive dataset with ~200k images and 3M video frames across 24 sub-scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. The benchmark provides flexible access to underlying causal structures.

Result: The benchmark enables evaluation of CRL methods across diverse paradigms and provides empirical insights for practitioners to choose appropriate CRL frameworks for specific real-world problems.

Conclusion: This benchmark bridges the gap between rigorous evaluation and real-world applicability in CRL by offering realistic visual data with known ground-truth causal processes across various scenarios from static to dynamic settings.

Abstract: Causal Representation Learning (CRL) aims to uncover the data-generating process and identify the underlying causal variables and relations, whose evaluation remains inherently challenging due to the requirement of known ground-truth causal variables and causal structure. Existing evaluations often rely on either simplistic synthetic datasets or downstream performance on real-world tasks, generally suffering a dilemma between realism and evaluative precision. In this paper, we introduce a new benchmark for CRL using high-fidelity simulated visual data that retains both realistic visual complexity and, more importantly, access to ground-truth causal generating processes. The dataset comprises around 200 thousand images and 3 million video frames across 24 sub-scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. These scenarios range from static to dynamic settings, simple to complex structures, and single to multi-agent interactions, offering a comprehensive testbed that hopefully bridges the gap between rigorous evaluation and real-world applicability. In addition, we provide flexible access to the underlying causal structures, allowing users to modify or configure them to align with the required assumptions in CRL, such as available domain labels, temporal dependencies, or intervention histories. Leveraging this benchmark, we evaluated representative CRL methods across diverse paradigms and offered empirical insights to assist practitioners and newcomers in choosing or extending appropriate CRL frameworks to properly address specific types of real problems that can benefit from the CRL perspective. Welcome to visit our: Project page:https://causal-verse.github.io/, Dataset:https://huggingface.co/CausalVerse.

[447] Incentive-Based Federated Learning: Architectural Elements and Future Directions

Chanuka A. S. Hewa Kaluannakkage, Rajkumar Buyya

Main category: cs.LG

TL;DR: This chapter analyzes incentive mechanisms for federated learning, addressing the participation dilemma where entities may be unwilling to contribute or free-ride on others’ efforts.

Details

Motivation: Federated learning enables collaborative model training while preserving data privacy, but faces practical limitations due to the participation dilemma where entities need incentives to contribute rather than free-ride.

Method: Examines economic and game theory concepts applied to federated learning, along with technology-driven solutions like blockchain and deep reinforcement learning. Presents a comprehensive taxonomy covering both centralized and decentralized architectures.

Result: Developed a framework showing that well-designed incentive mechanisms are essential for practical success of federated learning, with applications in healthcare, smart infrastructure, vehicular networks, and blockchain systems.

Conclusion: Incentive mechanisms are not optional but essential components for sustainable, fair, and robust federated learning ecosystems, though significant challenges remain despite promising solutions.

Abstract: Federated learning promises to revolutionize machine learning by enabling collaborative model training without compromising data privacy. However, practical adaptability can be limited by critical factors, such as the participation dilemma. Participating entities are often unwilling to contribute to a learning system unless they receive some benefits, or they may pretend to participate and free-ride on others. This chapter identifies the fundamental challenges in designing incentive mechanisms for federated learning systems. It examines how foundational concepts from economics and game theory can be applied to federated learning, alongside technology-driven solutions such as blockchain and deep reinforcement learning. This work presents a comprehensive taxonomy that thoroughly covers both centralized and decentralized architectures based on the aforementioned theoretical concepts. Furthermore, the concepts described are presented from an application perspective, covering emerging industrial applications, including healthcare, smart infrastructure, vehicular networks, and blockchain-based decentralized systems. Through this exploration, this chapter demonstrates that well-designed incentive mechanisms are not merely optional features but essential components for the practical success of federated learning. This analysis reveals both the promising solutions that have emerged and the significant challenges that remain in building truly sustainable, fair, and robust federated learning ecosystems.

cs.MA

[448] Topological Structure Learning Should Be A Research Priority for LLM-Based Multi-Agent Systems

Jiaxi Yang, Mengqi Zhang, Yiqiao Jin, Hao Chen, Qingsong Wen, Lu Lin, Yi He, Srijan Kumar, Weijie Xu, James Evans, Jindong Wang

Main category: cs.MA

TL;DR: The paper proposes a paradigm shift toward topology-aware Multi-Agent Systems (MASs) that explicitly model and optimize agent interactions, introducing a three-stage framework for systematic MAS design.

Details

Motivation: Current LLM-based Multi-Agent Systems lack systematic exploration of topology - how agents should be configured, connected, and coordinated - which limits their full potential in complex tasks.

Method: A three-stage framework: 1) agent selection, 2) structure profiling, and 3) topology synthesis to systematically design and optimize MAS topologies.

Result: The framework provides a principled foundation for designing MASs and opens new research frontiers across language modeling, reinforcement learning, graph learning, and generative modeling.

Conclusion: Topology-aware MASs can unleash the full potential of agentic AI in complex real-world applications, with key challenges and opportunities identified for future MAS evaluation.

Abstract: Large Language Model-based Multi-Agent Systems (MASs) have emerged as a powerful paradigm for tackling complex tasks through collaborative intelligence. However, the topology of these systems–how agents in MASs should be configured, connected, and coordinated–remains largely unexplored. In this position paper, we call for a paradigm shift toward \emph{topology-aware MASs} that explicitly model and dynamically optimize the structure of inter-agent interactions. We identify three fundamental components–agents, communication links, and overall topology–that collectively determine the system’s adaptability, efficiency, robustness, and fairness. To operationalize this vision, we introduce a systematic three-stage framework: 1) agent selection, 2) structure profiling, and 3) topology synthesis. This framework not only provides a principled foundation for designing MASs but also opens new research frontiers across language modeling, reinforcement learning, graph learning, and generative modeling to ultimately unleash their full potential in complex real-world applications. We conclude by outlining key challenges and opportunities in MASs evaluation. We hope our framework and perspectives offer critical new insights in the era of agentic AI.

[449] Bayesian Ego-graph inference for Networked Multi-Agent Reinforcement Learning

Wei Duan, Jie Lu, Junyu Xuan

Main category: cs.MA

TL;DR: BayesG is a decentralized MARL framework that learns dynamic communication graphs via Bayesian variational inference, enabling agents to adaptively sample sparse interaction structures for improved scalability and performance in large-scale environments.

Details

Motivation: Existing Networked-MARL methods assume static communication neighborhoods, which limits adaptability to dynamic environments. Centralized approaches that learn dynamic graphs require global state access, making them impractical for real-world decentralized systems.

Method: Proposes a stochastic graph-based policy where agents condition decisions on sampled subgraphs. Uses Bayesian variational inference to learn sparse, context-aware interaction structures via latent communication masks. Trains variational distribution end-to-end with policy using ELBO objective.

Result: BayesG outperforms strong MARL baselines on large-scale traffic control tasks with up to 167 agents, demonstrating superior scalability, efficiency, and performance.

Conclusion: The proposed decentralized actor-framework successfully enables agents to jointly learn both interaction topology and decision-making strategies, providing an effective solution for dynamic communication in networked multi-agent systems.

Abstract: In networked multi-agent reinforcement learning (Networked-MARL), decentralized agents must act under local observability and constrained communication over fixed physical graphs. Existing methods often assume static neighborhoods, limiting adaptability to dynamic or heterogeneous environments. While centralized frameworks can learn dynamic graphs, their reliance on global state access and centralized infrastructure is impractical in real-world decentralized systems. We propose a stochastic graph-based policy for Networked-MARL, where each agent conditions its decision on a sampled subgraph over its local physical neighborhood. Building on this formulation, we introduce BayesG, a decentralized actor-framework that learns sparse, context-aware interaction structures via Bayesian variational inference. Each agent operates over an ego-graph and samples a latent communication mask to guide message passing and policy computation. The variational distribution is trained end-to-end alongside the policy using an evidence lower bound (ELBO) objective, enabling agents to jointly learn both interaction topology and decision-making strategies. BayesG outperforms strong MARL baselines on large-scale traffic control tasks with up to 167 agents, demonstrating superior scalability, efficiency, and performance.

cs.MM

[450] Game mechanics for cyber-harm awareness in the metaverse

Sophie McKenzie, Jeb Webb, Robin Doss

Main category: cs.MM

TL;DR: CyberNinjas VR experience was developed to educate children aged 8-16 on safe metaverse behaviors using game mechanics to foster cyber-safe behaviors and provide referral steps for harmful interactions.

Details

Motivation: The metaverse amplifies cyber harm risks for young users, making online safety education essential as immersive technologies reshape children's interactions and experiences.

Method: Developed CyberNinjas VR experience that uses game mechanics to educate children about safe metaverse behaviors and provides clear referral steps for harmful interactions.

Result: The project analyzes CyberNinjas to understand how game mechanics can foster cyber-safe behaviors, which will inform the design of future VR environments prioritizing safety and inclusivity.

Conclusion: Understanding user engagement in metaverse gaming through projects like CyberNinjas will aid in designing safer and more inclusive VR environments for young users.

Abstract: Educating children and young people to be safe online is essential, especially as the metaverse, a next-generation internet blending immersive technologies, promises to reshape their interactions and amplify their experiences. While virtual reality offers fully immersive, highly interactive, and multi-sensory engagement, it also heightens cyber harm risks for young or vulnerable users. To address this, the CyberNinjas VR experience was developed to educate children aged 8 to 16 on safe metaverse behaviours, providing clear referral steps for harmful interactions. Understanding user engagement in metaverse gaming will aid the design of future VR environments which prioritize safety and inclusivity. This project analyses CyberNinjas to understand how game mechanics can foster cyber-safe behaviours.

eess.AS

[451] LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models

Xiaohan Zhao, Hongyu Xiang, Shengze Ye, Song Li, Zhengkun Tian, Guanyu Chen, Ke Ding, Guanglu Wan

Main category: eess.AS

TL;DR: LongCat-Audio-Codec is an industrial-grade audio tokenizer/detokenizer for speech LLMs that achieves ultra-low frame rate encoding (16.67 Hz) with bitrates from 0.43-0.87 kbps while maintaining high speech quality.

Details

Motivation: To develop an efficient audio codec solution for end-to-end speech large language models that balances coding efficiency with decoding quality for industrial applications.

Method: Uses a decoupled model architecture with multistage training strategy, enabling robust semantic modeling, flexible acoustic feature extraction, and low-latency streaming synthesis.

Result: Achieves strong speech intelligibility and high-quality synthesis at ultra-low bitrates (0.43-0.87 kbps) with 16.67 Hz frame rate, effectively balancing coding efficiency and quality.

Conclusion: LongCat-Audio-Codec provides an industrial-grade solution for speech LLMs with excellent efficiency-quality trade-off, making it suitable for practical applications requiring low-bitrate, high-quality audio processing.

Abstract: This paper presents LongCat-Audio-Codec, an audio tokenizer and detokenizer solution designed for industrial grade end-to-end speech large language models. By leveraging a decoupled model architecture and a multistage training strategy, LongCat-Audio-Codec exhibits robust semantic modeling capabilities, flexible acoustic feature extraction capabilities, and low-latency streaming synthesis capabilities. It encodes speech at an ultra-low frame rate of 16.67 Hz, with a minimum bitrate of 0.43 kbps and a maximum bitrate of 0.87 kbps. Evaluation results demonstrate that LongCat-Audio-Codec achieves strong speech intelligibility and is capable of synthesizing highquality speech at low bitrate, thus effectively balancing coding efficiency and decoding quality. The inference code and model checkpoints of LongCat-Audio-Codec are available at: https://github.com/meituan-longcat/LongCat-Audio-Codec.

[452] LDCodec: A high quality neural audio codec with low-complexity decoder

Jiawei Jiang, Linping Xu, Dejun Zhang, Qingbo Huang, Xianjun Xia, Yijian Xiao

Main category: eess.AS

TL;DR: LDCodec is a neural audio codec with low-complexity decoder that achieves better quality than Opus at half the bitrate (6kbps vs 12kbps) through novel residual units, LSRVQ quantization, and specialized discriminators.

Details

Motivation: Neural audio coding outperforms classical methods at low bitrates but faces practical limitations due to high complexity, especially for on-demand streaming on devices like smartphones.

Method: Developed LDCodec using novel residual units with Long-term and Short-term Residual Vector Quantization (LSRVQ), subband-fullband frequency discriminators, and perceptual loss functions to reduce decoder complexity while maintaining quality.

Result: LDCodec at 6kbps outperforms Opus at 12kbps in both subjective and objective tests, demonstrating superior audio quality at half the bitrate.

Conclusion: The proposed LDCodec successfully addresses the complexity challenge of neural audio codecs while achieving state-of-the-art performance, making it suitable for practical applications like mobile streaming.

Abstract: Neural audio coding has been shown to outperform classical audio coding at extremely low bitrates. However, the practical application of neural audio codecs is still limited by their elevated complexity. To address this challenge, we have developed a high-quality neural audio codec with a low-complexity decoder, named LDCodec (Low-complexity Decoder Neural Audio Codec), specifically designed for on-demand streaming media clients, such as smartphones. Specifically, we introduced a novel residual unit combined with Long-term and Short-term Residual Vector Quantization (LSRVQ), subband-fullband frequency discriminators, and perceptual loss functions. This combination results in high-quality audio reconstruction with lower complexity. Both our subjective and objective tests demonstrated that our proposed LDCodec at 6kbps outperforms Opus at 12kbps.

[453] DroneAudioset: An Audio Dataset for Drone-based Search and Rescue

Chitralekha Gupta, Soundarya Ramesh, Praveen Sasikumar, Kian Peen Yeo, Suranga Nanayakkara

Main category: eess.AS

TL;DR: DroneAudioset is a comprehensive drone audition dataset with 23.5 hours of annotated recordings to address the limitations of vision-based human detection and enable development of noise suppression and classification methods for drone-based audio perception.

Details

Motivation: Existing drone systems rely on vision-based methods that fail under low-visibility or occlusion, while audio perception suffers from extreme ego-noise masking human sounds. Current datasets lack diversity, real acoustic interactions, and standardized setups.

Method: Created DroneAudioset dataset featuring 23.5 hours of annotated recordings across various SNRs (-57.2 dB to -2.5 dB), drone types, throttle settings, microphone configurations, and environments to enable systematic evaluation of noise suppression and classification methods.

Result: The dataset provides comprehensive real-world acoustic data that enables development of drone noise-aware audio processing and informs practical design considerations such as microphone placement trade-offs for drone audition systems.

Conclusion: DroneAudioset represents an important step towards enabling the design and deployment of effective drone-audition systems by providing standardized, diverse, and realistic acoustic data for human-presence detection under challenging conditions.

Abstract: Unmanned Aerial Vehicles (UAVs) or drones, are increasingly used in search and rescue missions to detect human presence. Existing systems primarily leverage vision-based methods which are prone to fail under low-visibility or occlusion. Drone-based audio perception offers promise but suffers from extreme ego-noise that masks sounds indicating human presence. Existing datasets are either limited in diversity or synthetic, lacking real acoustic interactions, and there are no standardized setups for drone audition. To this end, we present DroneAudioset (The dataset is publicly available at https://huggingface.co/datasets/ahlab-drone-project/DroneAudioSet/ under the MIT license), a comprehensive drone audition dataset featuring 23.5 hours of annotated recordings, covering a wide range of signal-to-noise ratios (SNRs) from -57.2 dB to -2.5 dB, across various drone types, throttles, microphone configurations as well as environments. The dataset enables development and systematic evaluation of noise suppression and classification methods for human-presence detection under challenging conditions, while also informing practical design considerations for drone audition systems, such as microphone placement trade-offs, and development of drone noise-aware audio processing. This dataset is an important step towards enabling design and deployment of drone-audition systems.

[454] Quantization-Based Score Calibration for Few-Shot Keyword Spotting with Dynamic Time Warping in Noisy Environments

Kevin Wilkinghoff, Alessia Cornaggia-Urrigshardt, Zheng-Hua Tan

Main category: eess.AS

TL;DR: Proposes a score calibration method for keyword spotting that uses embedding quantization and score normalization to improve threshold selection in noisy, few-shot scenarios.

Details

Motivation: Traditional threshold selection for keyword spotting often leads to suboptimal performance on unseen data, especially in noisy environments or few-shot settings.

Method: Two-step score calibration: quantizing embeddings and normalizing detection scores using quantization error before thresholding.

Result: Experiments on KWS-DailyTalk with simulated radio channels show simplified threshold choice and significant performance improvement.

Conclusion: The proposed calibration approach effectively mitigates performance degradation from suboptimal thresholds in template-based open-set few-shot keyword spotting.

Abstract: Detecting occurrences of keywords with keyword spotting (KWS) systems requires thresholding continuous detection scores. Selecting appropriate thresholds is a non-trivial task, typically relying on optimizing the performance on a validation dataset. However, such greedy threshold selection often leads to suboptimal performance on unseen data, particularly in varying or noisy acoustic environments or few-shot settings. In this work, we investigate detection threshold estimation for template-based open-set few-shot KWS using dynamic time warping on noisy speech data. To mitigate the performance degradation caused by suboptimal thresholds, we propose a score calibration approach consisting of two different steps: quantizing embeddings and normalizing detection scores using the quantization error prior to thresholding. Experiments on KWS-DailyTalk with simulated high frequency radio channels show that the proposed calibration approach simplifies the choice of detection thresholds and significantly improves the resulting performance.

Azalea Gui, Woosung Choi, Junghyun Koo, Kazuki Shimada, Takashi Shibuya, Joan Serrà, Wei-Hsiang Liao, Yuki Mitsufuji

Main category: eess.AS

TL;DR: Proposes two noise-agnostic data cleaning methods for music source separation: data attribution via unlearning and Fréchet Audio Distance filtering, which outperform contaminated baseline and close 66.7% of performance gap without requiring specific noise knowledge.

Details

Motivation: Deep learning models for music source separation depend on training data quality, but datasets often contain difficult-to-detect artifacts like audio bleeding and label noise. Since contamination types and extent are typically unknown, targeted cleaning methods are impractical.

Method: Two noise-agnostic approaches: (1) data attribution via unlearning to identify and filter training samples contributing least to clean outputs, (2) Fréchet Audio Distance to measure and remove samples perceptually dissimilar to a small trusted clean reference set.

Result: On contaminated dataset with simulated real-world noise, unlearning-based methods produced cleaned dataset and model that outperforms both original contaminated data and small clean reference set, closing approximately 66.7% of performance gap between contaminated baseline and clean dataset model.

Conclusion: The proposed noise-agnostic approaches offer a more generic and broadly applicable solution for curating high-quality training data compared to methods tailored for specific artifacts.

Abstract: The performance of deep learning models for music source separation heavily depends on training data quality. However, datasets are often corrupted by difficult-to-detect artifacts such as audio bleeding and label noise. Since the type and extent of contamination are typically unknown, cleaning methods targeting specific corruptions are often impractical. This paper proposes and evaluates two distinct, noise-agnostic data cleaning methods to address this challenge. The first approach uses data attribution via unlearning to identify and filter out training samples that contribute the least to producing clean outputs. The second leverages the Fr'echet Audio Distance to measure and remove samples that are perceptually dissimilar to a small and trusted clean reference set. On a dataset contaminated with a simulated distribution of real-world noise, our unlearning-based methods produced a cleaned dataset and a corresponding model that outperforms both the original contaminated data and the small clean reference set used for cleaning. This result closes approximately 66.7% of the performance gap between the contaminated baseline and a model trained on the same dataset without any contamination. Unlike methods tailored for specific artifacts, our noise-agnostic approaches offer a more generic and broadly applicable solution for curating high-quality training data.

[456] MC-LExt: Multi-Channel Target Speaker Extraction with Onset-Prompted Speaker Conditioning Mechanism

Tongtao Ling, Shulin He, Pengjie Shen, Zhong-Qiu Wang

Main category: eess.AS

TL;DR: MC-LExt is a multi-channel target speaker extraction framework that prepends short enrollment utterances to guide extraction, achieving state-of-the-art performance on noisy-reverberant benchmarks.

Details

Motivation: Existing MC-TSE methods have limitations: DOA-based approaches depend on explicit direction estimation and are sensitive to microphone geometry, while speaker embedding methods model identity implicitly and degrade in noisy-reverberant conditions.

Method: Prepend a short enrollment utterance of the target speaker to each channel of the multi-channel mixture, providing onset-prompted conditioning that allows DNNs to learn spatial and speaker identity cues jointly in an end-to-end manner.

Result: Experiments on noisy-reverberant benchmarks WHAMR! and MC-Libri2Mix demonstrate the effectiveness of the proposed MC-LExt framework.

Conclusion: MC-LExt provides a simple but highly effective framework for multi-channel target speaker extraction that overcomes limitations of existing approaches by using enrollment utterances as conditioning signals.

Abstract: Multi-channel target speaker extraction (MC-TSE) aims to extract a target speaker’s voice from multi-speaker signals captured by multiple microphones. Existing methods often rely on auxiliary clues such as direction-of-arrival (DOA) or speaker embeddings. However, DOA-based approaches depend on explicit direction estimation and are sensitive to microphone array geometry, while methods based on speaker embeddings model speaker identity in an implicit manner and may degrade in noisy-reverberant conditions. To address these limitations, we propose multi-channel listen to extract (MC-LExt), a simple but highly-effective framework for MC-TSE. Our key idea is to prepend a short enrollment utterance of the target speaker to each channel of the multi-channel mixture, providing an onset-prompted conditioning signal that can guide TSE. This design allows the deep neural network (DNN) to learn spatial and speaker identity cues jointly in a fully end-to-end manner. Experiments on noisy-reverberant benchmarks, including WHAMR! and MC-Libri2Mix, demonstrate the effectiveness of MC-TSE.

[457] Magnitude and Phase-based Feature Fusion Using Co-attention Mechanism for Speaker recognition

Rongfeng Su, Mengjie Du, Xiaokang Liu, Lan Wang, Nan Yan

Main category: eess.AS

TL;DR: A feature-level fusion framework using co-attention mechanism that dynamically weights magnitude and phase domain features for improved speaker recognition performance.

Details

Motivation: Traditional feature-level fusion methods ignore the unique contributions of speaker semantics in magnitude and phase domains, leading to suboptimal performance.

Method: Two separate sub-networks for magnitude and phase domains, with intermediate high-level outputs fused using co-attention mechanism before pooling layer. Correlation matrix dynamically scales contributions based on different pronunciations.

Result: Achieved 97.20% Top-1 accuracy on VoxCeleb, outperforming state-of-the-art by 0.82% absolute improvement, and 0.45% EER reduction compared to single FBank feature system.

Conclusion: The co-attention based feature-level fusion effectively leverages complementary information from magnitude and phase domains, significantly improving speaker recognition performance.

Abstract: Phase-based features related to vocal source characteristics can be incorporated into magnitude-based speaker recognition systems to improve the system performance. However, traditional feature-level fusion methods typically ignore the unique contributions of speaker semantics in the magnitude and phase domains. To address this issue, this paper proposed a feature-level fusion framework using the co-attention mechanism for speaker recognition. The framework consists of two separate sub-networks for the magnitude and phase domains respectively. Then, the intermediate high-level outputs of both domains are fused by the co-attention mechanism before a pooling layer. A correlation matrix from the co-attention module is supposed to re-assign the weights for dynamically scaling contributions in the magnitude and phase domains according to different pronunciations. Experiments on VoxCeleb showed that the proposed feature-level fusion strategy using the co-attention mechanism gave the Top-1 accuracy of 97.20%, outperforming the state-of-the-art system with 0.82% absolutely, and obtained EER reduction of 0.45% compared to single feature system using FBank.

eess.IV

[458] Symmetric Entropy-Constrained Video Coding for Machines

Yuxiao Sun, Yao Zhao, Meiqin Liu, Chao Yao, Jian Jin, Weisi Lin

Main category: eess.IV

TL;DR: SEC-VCM is a symmetric entropy-constrained video coding framework for machines that establishes alignment between video codecs and visual backbones, achieving significant bitrate savings for multiple video understanding tasks.

Details

Motivation: Existing VCM methods bind codecs to specific downstream models, requiring retraining and limiting generalization. Current unified frameworks don't sufficiently explore direct links between video coding and understanding under visual foundation model guidance.

Method: Proposes symmetric alignment between video codec and visual backbone using bi-directional entropy-constraint mechanism to suppress conditional entropy, and semantic-pixel dual-path fusion module to inject pixel-level priors and suppress artifacts.

Result: Achieves state-of-the-art rate-task performance with significant bitrate savings: 37.41% on video instance segmentation, 29.83% on video object segmentation, 46.22% on object detection, and 44.94% on multiple object tracking compared to VTM.

Conclusion: The framework successfully establishes symmetric alignment between coding and understanding, enabling explicit handling of semantic information beneficial for machine vision systems while discarding irrelevant information.

Abstract: As video transmission increasingly serves machine vision systems (MVS) instead of human vision systems (HVS), video coding for machines (VCM) has become a critical research topic. Existing VCM methods often bind codecs to specific downstream models, requiring retraining or supervised data and thus limiting generalization in multi-task scenarios. Recently, unified VCM frameworks have employed visual backbones (VB) and visual foundation models (VFM) to support multiple video understanding tasks with a single codec. They mainly utilize VB/VFM to maintain semantic consistency or suppress non-semantic information, but seldom explore how to directly link video coding with understanding under VB/VFM guidance. Hence, we propose a Symmetric Entropy-Constrained Video Coding framework for Machines (SEC-VCM). It establishes a symmetric alignment between the video codec and VB, allowing the codec to leverage VB’s representation capabilities to preserve semantics and discard MVS-irrelevant information. Specifically, a bi-directional entropy-constraint (BiEC) mechanism ensures symmetry between the process of video decoding and VB encoding by suppressing conditional entropy. This helps the codec to explicitly handle semantic information beneficial for MVS while squeezing useless information. Furthermore, a semantic-pixel dual-path fusion (SPDF) module injects pixel-level priors into the final reconstruction. Through semantic-pixel fusion, it suppresses artifacts harmful to MVS and improves machine-oriented reconstruction quality. Experimental results show our framework achieves state-of-the-art (SOTA) in rate-task performance, with significant bitrate savings over VTM on video instance segmentation (37.41%), video object segmentation (29.83%), object detection (46.22%), and multiple object tracking (44.94%). We will release our code.

[459] Confidence-Weighted Semi-Supervised Learning for Skin Lesion Segmentation Using Hybrid CNN-Transformer Networks

Saqib Qamar

Main category: eess.IV

TL;DR: MIRA-U is a semi-supervised framework for skin lesion segmentation that combines uncertainty-aware teacher-student pseudo-labeling with a hybrid CNN-Transformer architecture, achieving state-of-the-art performance with limited labeled data.

Details

Motivation: Automated skin lesion segmentation is crucial for early skin cancer detection but faces challenges due to limited annotated training data, requiring effective semi-supervised approaches.

Method: Uses teacher network pre-trained via masked image modeling to generate confidence-weighted soft pseudo-labels, guiding a U-shaped CNN-Transformer student network with cross-attention skip connections.

Result: Achieves superior performance with Dice Similarity Coefficient (DSC) of 0.9153 and Intersection over Union (IoU) of 0.8552 using only 50% labeled data on ISIC-2016 and PH2 datasets.

Conclusion: The proposed framework effectively addresses limited annotation challenges in skin lesion segmentation through uncertainty-aware pseudo-labeling and hybrid architecture design, outperforming existing methods.

Abstract: Automated skin lesion segmentation through dermoscopic analysis is essential for early skin cancer detection, yet remains challenging due to limited annotated training data. We present MIRA-U, a semi-supervised framework that combines uncertainty-aware teacher-student pseudo-labeling with a hybrid CNN-Transformer architecture. Our approach employs a teacher network pre-trained via masked image modeling to generate confidence-weighted soft pseudo-labels, which guide a U-shaped CNN-Transformer student network featuring cross-attention skip connections. This design enhances pseudo-label quality and boundary delineation, surpassing reconstruction-based and CNN-only baselines, particularly in low-annotation regimes. Extensive evaluation on ISIC-2016 and PH2 datasets demonstrates superior performance, achieving a Dice Similarity Coefficient (DSC) of 0.9153 and Intersection over Union (IoU) of 0.8552 using only 50% labeled data. Code is publicly available on GitHub.

[460] SANR: Scene-Aware Neural Representation for Light Field Image Compression with Rate-Distortion Optimization

Gai Zhang, Xinfeng Zhang, Lv Tang, Hongyu An, Li Zhang, Qingming Huang

Main category: eess.IV

TL;DR: SANR is a scene-aware neural representation framework for light field image compression that introduces hierarchical scene modeling and end-to-end rate-distortion optimization, achieving 65.62% BD-rate savings against HEVC.

Details

Motivation: Light field images have high-dimensional data that creates compression challenges. Existing neural representation methods neglect explicit scene structure modeling and lack end-to-end rate-distortion optimization, limiting compression efficiency.

Method: Proposes SANR with hierarchical scene modeling using multi-scale latent codes to capture scene structures, and incorporates entropy-constrained quantization-aware training for end-to-end rate-distortion optimization.

Result: Extensive experiments show SANR significantly outperforms state-of-the-art methods in rate-distortion performance with 65.62% BD-rate saving against HEVC.

Conclusion: SANR successfully addresses limitations of previous methods by combining scene-aware modeling with end-to-end rate-distortion optimization, achieving superior light field image compression performance.

Abstract: Light field images capture multi-view scene information and play a crucial role in 3D scene reconstruction. However, their high-dimensional nature results in enormous data volumes, posing a significant challenge for efficient compression in practical storage and transmission scenarios. Although neural representation-based methods have shown promise in light field image compression, most approaches rely on direct coordinate-to-pixel mapping through implicit neural representation (INR), often neglecting the explicit modeling of scene structure. Moreover, they typically lack end-to-end rate-distortion optimization, limiting their compression efficiency. To address these limitations, we propose SANR, a Scene-Aware Neural Representation framework for light field image compression with end-to-end rate-distortion optimization. For scene awareness, SANR introduces a hierarchical scene modeling block that leverages multi-scale latent codes to capture intrinsic scene structures, thereby reducing the information gap between INR input coordinates and the target light field image. From a compression perspective, SANR is the first to incorporate entropy-constrained quantization-aware training (QAT) into neural representation-based light field image compression, enabling end-to-end rate-distortion optimization. Extensive experiment results demonstrate that SANR significantly outperforms state-of-the-art techniques regarding rate-distortion performance with a 65.62% BD-rate saving against HEVC.

[461] A Cross-Framework Study of Temporal Information Buffering Strategies for Learned Video Compression

Kuan-Wei Ho, Yi-Hsin Chen, Martin Benjak, Jörn Ostermann, Wen-Hsiao Peng

Main category: eess.IV

TL;DR: Systematic evaluation of explicit, implicit, and hybrid buffering methods across four inter-frame coding frameworks (residual coding, conditional coding, conditional residual coding, masked conditional residual coding) for learned video codecs.

Details

Motivation: Recent learned video codecs show remarkable compression efficiency, but lack comprehensive studies covering all combinations of inter-frame coding frameworks and temporal propagation strategies.

Method: Systematically evaluate the impact of explicit, implicit, and hybrid buffering on coding performance across four inter-frame coding frameworks under unified experimental setup.

Result: The study provides thorough understanding of the effectiveness of different buffering strategies across various inter-frame coding frameworks.

Conclusion: Comprehensive analysis reveals the performance characteristics of different temporal propagation strategies when combined with various inter-frame coding frameworks in learned video compression.

Abstract: Recent advances in learned video codecs have demonstrated remarkable compression efficiency. Two fundamental design aspects are critical: the choice of inter-frame coding framework and the temporal information propagation strategy. Inter-frame coding frameworks include residual coding, conditional coding, conditional residual coding, and masked conditional residual coding, each with distinct mechanisms for utilizing temporal predictions. Temporal propagation methods can be categorized as explicit, implicit, or hybrid buffering, differing in how past decoded information is stored and used. However, a comprehensive study covering all possible combinations is still lacking. This work systematically evaluates the impact of explicit, implicit, and hybrid buffering on coding performance across four inter-frame coding frameworks under a unified experimental setup, providing a thorough understanding of their effectiveness.

[462] Efficient reconstruction and denoising of cryo-ET data with end-to-end localized deep learning

Vinith Kishore, Valentin Debarnot, AmirEhsan Khorashadizadeh, Ricardo D. Righetto, Benjamin D. Engel, Ivan Dokmanić

Main category: eess.IV

TL;DR: CryoLithe is an end-to-end deep learning network that directly reconstructs 3D volumes from cryo-electron tomography tilt-series, achieving denoising and missing wedge correction comparable to state-of-the-art methods but with two orders of magnitude faster speed and better memory efficiency.

Details

Motivation: Current self-supervised deep learning approaches for cryo-ET reconstruction are slow (taking dozens of hours) and memory-intensive, despite significantly improving reconstruction quality over traditional iterative algorithms.

Method: Implemented a local, memory-efficient reconstruction network that leverages transform-domain locality to directly estimate volumes from aligned tilt-series, enabling supervised training without retraining or fine-tuning.

Result: CryoLithe achieves denoising and missing wedge correction comparable or better than state-of-the-art methods (Cryo-CARE, IsoNet, DeepDeWedge) while being two orders of magnitude faster and more memory-efficient.

Conclusion: CryoLithe facilitates downstream cryo-ET analysis including segmentation and subtomogram averaging, and is robust to distribution shifts, making it effective for real data applications without requiring retraining.

Abstract: Cryo-electron tomography (cryo-ET) enables 3D visualization of cellular structures. Accurate reconstruction of high-resolution volumes is complicated by the very low signal-to-noise ratio and a restricted range of sample tilts. Recent self-supervised deep learning approaches, which post-process initial reconstructions by filtered backprojection (FBP), have significantly improved reconstruction quality with respect to signal processing iterative algorithms, but they are slow, taking dozens of hours for an expert to reconstruct tomogram and demand large memory. We present CryoLithe, an end-to-end network that directly estimates the volume from an aligned tilt-series. CryoLithe achieves denoising and missing wedge correction comparable or better than state-of-the-art self-supervised deep learning approaches such as Cryo-CARE, IsoNet or DeepDeWedge, while being two order of magnitude faster. To achieve this, we implement a local, memory-efficient reconstruction network. We demonstrate that leveraging transform-domain locality makes our network robust to distribution shifts, enabling effective supervised training and giving excellent results on real data–without retraining or fine-tuning. CryoLithe reconstructions facilitate downstream cryo-ET analysis, including segmentation and subtomogram averaging and is openly available: https://github.com/swing-research/CryoLithe.

[463] Skull-stripping induces shortcut learning in MRI-based Alzheimer’s disease classification

Christian Tinauer, Maximilian Sackl, Rudolf Stollberger, Reinhold Schmidt, Stefan Ropele, Christian Langkammer

Main category: eess.IV

TL;DR: Deep neural networks for Alzheimer’s disease classification from MRI rely on volumetric features and preprocessing artifacts rather than gray-white matter texture, demonstrating shortcut learning behavior.

Details

Motivation: To understand which specific image features contribute to AD classification decisions in deep neural networks, particularly assessing the roles of texture, volumetric information, and preprocessing artifacts.

Method: Used 990 T1w MRIs from ADNI database, varied preprocessing through skull-stripping and intensity binarization, trained 3D CNNs on each configuration, and analyzed feature relevance using Layer-wise Relevance Propagation and clustering.

Result: Classification performance remained stable across preprocessing conditions despite substantial image content differences. Models relied on volumetric features and brain contours from skull-stripping rather than gray-white matter texture.

Conclusion: The study reveals shortcut learning where preprocessing artifacts serve as unintended cues (Clever Hans effect), highlighting the importance of interpretability tools to detect biases and ensure trustworthy medical AI.

Abstract: Objectives: High classification accuracy of Alzheimer’s disease (AD) from structural MRI has been achieved using deep neural networks, yet the specific image features contributing to these decisions remain unclear. In this study, the contributions of T1-weighted (T1w) gray-white matter texture, volumetric information, and preprocessing – particularly skull-stripping – were systematically assessed. Methods: A dataset of 990 matched T1w MRIs from AD patients and cognitively normal controls from the ADNI database were used. Preprocessing was varied through skull-stripping and intensity binarization to isolate texture and shape contributions. A 3D convolutional neural network was trained on each configuration, and classification performance was compared using exact McNemar tests with discrete Bonferroni-Holm correction. Feature relevance was analyzed using Layer-wise Relevance Propagation, image similarity metrics, and spectral clustering of relevance maps. Results: Despite substantial differences in image content, classification accuracy, sensitivity, and specificity remained stable across preprocessing conditions. Models trained on binarized images preserved performance, indicating minimal reliance on gray-white matter texture. Instead, volumetric features – particularly brain contours introduced through skull-stripping – were consistently used by the models. Conclusions: This behavior reflects a shortcut learning phenomenon, where preprocessing artifacts act as potentially unintended cues. The resulting Clever Hans effect emphasizes the critical importance of interpretability tools to reveal hidden biases and to ensure robust and trustworthy deep learning in medical imaging.

[464] Universal Vessel Segmentation for Multi-Modality Retinal Images

Bo Wen, Anna Heinke, Akshay Agnihotri, Dirk-Uwe Bartsch, William Freeman, Truong Nguyen, Cheolhong An

Main category: eess.IV

TL;DR: This paper introduces a universal retinal vessel segmentation model (URVSM) that works across multiple retinal image modalities without requiring modality-specific fine-tuning, addressing limitations of existing single-modality approaches.

Details

Motivation: Existing retinal vessel segmentation studies are limited to single modalities (mainly Color Fundus) and require separate fine-tuning for new modalities, which needs additional training data that is difficult to acquire.

Method: Proposed a universal vessel segmentation model (URVSM) that can segment vessels across multiple retinal image modalities without requiring modality-specific fine-tuning or extra training data.

Result: The universal model demonstrates comparable performance to state-of-the-art fine-tuned methods while being much more versatile across different modalities.

Conclusion: This is the first work to achieve modality-agnostic retinal vessel segmentation and the first to study vessel segmentation in several novel retinal image modalities.

Abstract: We identify two major limitations in the existing studies on retinal vessel segmentation: (1) Most existing works are restricted to one modality, i.e., the Color Fundus (CF). However, multi-modality retinal images are used every day in the study of the retina and diagnosis of retinal diseases, and the study of vessel segmentation on other modalities is scarce; (2) Even though a few works extended their experiments to new modalities such as the Multi-Color Scanning Laser Ophthalmoscopy (MC), these works still require fine-tuning a separate model for the new modality. The fine-tuning will require extra training data, which is difficult to acquire. In this work, we present a novel universal vessel segmentation model (URVSM) for multi-modality retinal images. In addition to performing the study on a much wider range of image modalities, we also propose a universal model to segment the vessels in all these commonly used modalities. While being much more versatile compared with existing methods, our universal model also demonstrates comparable performance to the state-of-the-art fine-tuned methods. To the best of our knowledge, this is the first work that achieves modality-agnostic retinal vessel segmentation and the first to study retinal vessel segmentation in several novel modalities.

[465] TransVFC: A Transformable Video Feature Compression Framework for Machines

Yuxiao Sun, Yao Zhao, Meiqin Liu, Chao Yao, Huihui Bai, Chunyu Lin, Weisi Lin

Main category: eess.IV

TL;DR: TransVFC is a video feature compression framework for machine vision tasks that compresses features before transferring them to different downstream tasks using lightweight Feature Space Transform modules.

Details

Motivation: Current video coding standards like H.265/HEVC are optimized for human vision but inefficient for machine vision tasks, and existing Video Coding for Machines approaches lack adaptability for multi-task scenarios.

Method: A compress-then-transfer framework with video feature codec using scheme-based inter-prediction to reduce temporal redundancy and perception-guided conditional coding to minimize spatial redundancy, plus Feature Space Transform modules to adapt features for different tasks.

Result: TransVFC achieves high rate-task performance for diverse tasks of different granularities and requires only lightweight FST module training for new tasks without retraining the entire codec or task networks.

Conclusion: The framework provides valuable insights for video feature compression in multi-task scenarios and enables efficient adaptation to new downstream tasks with minimal retraining.

Abstract: Nowadays, more and more video transmissions primarily aim at downstream machine vision tasks rather than humans. While widely deployed Human Visual System (HVS) oriented video coding standards like H.265/HEVC and H.264/AVC are efficient, they are not the optimal approaches for Video Coding for Machines (VCM) scenarios, leading to unnecessary bitrate expenditure. The academic and technical exploration within the VCM domain has led to the development of several strategies, and yet, conspicuous limitations remain in their adaptability for multi-task scenarios. To address the challenge, we propose a Transformable Video Feature Compression (TransVFC) framework. It offers a compress-then-transfer solution and includes a video feature codec and Feature Space Transform (FST) modules. In particular, the temporal redundancy of video features is squeezed by the codec through the scheme-based inter-prediction module. Then, the codec implements perception-guided conditional coding to minimize spatial redundancy and help the reconstructed features align with downstream machine perception.After that, the reconstructed features are transferred to new feature spaces for diverse downstream tasks by FST modules. To accommodate a new downstream task, it only requires training one lightweight FST module, avoiding retraining and redeploying the upstream codec and downstream task networks. Experiments show that TransVFC achieves high rate-task performance for diverse tasks of different granularities. We expect our work can provide valuable insights for video feature compression in multi-task scenarios. The codes are at https://github.com/Ws-Syx/TransVFC.

[466] A Multimodal Deep Learning Approach for White Matter Shape Prediction in Diffusion MRI Tractography

Yui Lo, Yuqian Chen, Dongnan Liu, Leo Zekelman, Jarrett Rushmore, Yogesh Rathi, Nikos Makris, Alexandra J. Golby, Fan Zhang, Weidong Cai, Lauren J. O’Donnell

Main category: eess.IV

TL;DR: Tract2Shape is a multimodal deep learning framework that predicts white matter tractography shape measures using geometric and scalar features, achieving superior performance and cross-dataset generalization compared to state-of-the-art methods.

Details

Motivation: Conventional methods for computing white matter shape measures are computationally expensive and time-consuming for large-scale datasets due to reliance on voxel-based representations.

Method: A multimodal deep learning framework that leverages geometric (point cloud) and scalar (tabular) features with dimensionality reduction to predict five primary shape components, trained and evaluated on HCP-YA and PPMI datasets.

Result: Outperforms SOTA deep learning models across all ten shape measures, achieving highest average Pearson’s r and lowest nMSE on HCP-YA dataset. Maintains high performance on unseen PPMI dataset, demonstrating strong cross-dataset generalization.

Conclusion: Tract2Shape enables fast, accurate, and generalizable prediction of white matter shape measures, supporting scalable analysis across datasets and laying foundation for future large-scale white matter shape analysis.

Abstract: Shape measures have emerged as promising descriptors of white matter tractography, offering complementary insights into anatomical variability and associations with cognitive and clinical phenotypes. However, conventional methods for computing shape measures are computationally expensive and time-consuming for large-scale datasets due to reliance on voxel-based representations. We propose Tract2Shape, a novel multimodal deep learning framework that leverages geometric (point cloud) and scalar (tabular) features to predict ten white matter tractography shape measures. To enhance model efficiency, we utilize a dimensionality reduction algorithm for the model to predict five primary shape components. The model is trained and evaluated on two independently acquired datasets, the HCP-YA dataset, and the PPMI dataset. We evaluate the performance of Tract2Shape by training and testing it on the HCP-YA dataset and comparing the results with state-of-the-art models. To further assess its robustness and generalization ability, we also test Tract2Shape on the unseen PPMI dataset. Tract2Shape outperforms SOTA deep learning models across all ten shape measures, achieving the highest average Pearson’s r and the lowest nMSE on the HCP-YA dataset. The ablation study shows that both multimodal input and PCA contribute to performance gains. On the unseen testing PPMI dataset, Tract2Shape maintains a high Pearson’s r and low nMSE, demonstrating strong generalizability in cross-dataset evaluation. Tract2Shape enables fast, accurate, and generalizable prediction of white matter shape measures from tractography data, supporting scalable analysis across datasets. This framework lays a promising foundation for future large-scale white matter shape analysis.

[467] UNet with Self-Adaptive Mamba-Like Attention and Causal-Resonance Learning for Medical Image Segmentation

Saqib Qamar, Mohd Fazil, Parvez Ahmad, Shakir Khan, Abu Taha Zamani

Main category: eess.IV

TL;DR: SAMA-UNet is a novel U-shaped architecture for medical image segmentation that combines efficiency and accuracy by integrating local and global features through Self-Adaptive Mamba-like Aggregated Attention blocks and causal resonance multi-scale modules.

Details

Motivation: Existing deep learning models face trade-offs between efficiency and accuracy in medical image segmentation. CNNs capture local details but miss global context, while transformers handle global context at high computational cost. State Space Sequence Models show potential but have limitations in medical imaging applications.

Method: Proposes SAMA-UNet with two key innovations: 1) Self-Adaptive Mamba-like Aggregated Attention (SAMA) block that adaptively integrates local and global features through dynamic attention weighting, and 2) causal resonance multi-scale module (CR-MSM) that improves encoder-decoder interactions by adjusting feature resolution and causal dependencies across scales.

Result: Extensive experiments on MRI, CT, and endoscopy datasets show SAMA-UNet outperforms CNN, Transformer, and Mamba-based methods. Achieves 85.38% DSC and 87.82% NSD on BTCV, 92.16% and 96.54% on ACDC, 67.14% and 68.70% on EndoVis17, and 84.06% and 88.47% on ATLAS23, establishing new benchmarks across modalities.

Conclusion: SAMA-UNet effectively combines efficiency and accuracy, making it a promising solution for real-world clinical segmentation tasks. The method demonstrates superior performance across multiple medical imaging modalities compared to existing approaches.

Abstract: Medical image segmentation plays an important role in various clinical applications; however, existing deep learning models face trade-offs between efficiency and accuracy. Convolutional Neural Networks (CNNs) capture local details well but miss the global context, whereas transformers handle the global context but at a high computational cost. Recently, State Space Sequence Models (SSMs) have shown potential for capturing long-range dependencies with linear complexity; however, their direct use in medical image segmentation remains limited due to incompatibility with image structures and autoregressive assumptions. To overcome these challenges, we propose SAMA-UNet, a novel U-shaped architecture that introduces two key innovations. First, the Self-Adaptive Mamba-like Aggregated Attention (SAMA) block adaptively integrates local and global features through dynamic attention weighting, enabling an efficient representation of complex anatomical patterns. Second, the causal resonance multi-scale module (CR-MSM) improves encoder-decoder interactions by adjusting feature resolution and causal dependencies across scales, enhancing the semantic alignment between low- and high-level features. Extensive experiments on MRI, CT, and endoscopy datasets demonstrate that SAMA-UNet consistently outperforms CNN, Transformer, and Mamba-based methods. It achieves 85.38% DSC and 87.82% NSD on BTCV, 92.16% and 96.54% on ACDC, 67.14% and 68.70% on EndoVis17, and 84.06% and 88.47% on ATLAS23, establishing new benchmarks across modalities. These results confirm the effectiveness of SAMA-UNet in combining efficiency and accuracy, making it a promising solution for real-world clinical segmentation tasks. The source code is available on GitHub.

[468] Implicit neural representations for accurate estimation of the standard model of white matter

Tom Hendriks, Gerrit Arends, Edwin Versteeg, Anna Vilanova, Maxime Chamberland, Chantal M. W. Tax

Main category: eess.IV

TL;DR: INR-based framework for estimating Standard Model parameters in diffusion MRI, showing superior accuracy in low SNR conditions and enabling continuous spatial upsampling.

Details

Motivation: The Standard Model of white matter has high-dimensional parameter estimation challenges, and existing methods struggle with accuracy, especially in noisy conditions.

Method: Uses implicit neural representations (INRs) with spatial regularization through sinusoidal encoding of input coordinates, enabling self-supervised learning without labeled data.

Result: INR method achieves superior accuracy in estimating SM parameters compared to existing methods, particularly in low signal-to-noise conditions, and supports spatial upsampling.

Conclusion: INRs are a promising tool for diffusion MRI analysis due to their accuracy, noise robustness, fast inference, and ability to handle complex parameter estimation without labeled data.

Abstract: Diffusion magnetic resonance imaging (dMRI) enables non-invasive investigation of tissue microstructure. The Standard Model (SM) of white matter aims to disentangle dMRI signal contributions from intra- and extra-axonal water compartments. However, due to the model its high-dimensional nature, accurately estimating its parameters poses a complex problem and remains an active field of research, in which different (machine learning) strategies have been proposed. This work introduces an estimation framework based on implicit neural representations (INRs), which incorporate spatial regularization through the sinusoidal encoding of the input coordinates. The INR method is evaluated on both synthetic and in vivo datasets and compared to existing methods. Results demonstrate superior accuracy of the INR method in estimating SM parameters, particularly in low signal-to-noise conditions. Additionally, spatial upsampling of the INR can represent the underlying dataset anatomically plausibly in a continuous way. The INR is self-supervised, eliminating the need for labeled training data. It achieves fast inference, is robust to noise, supports joint estimation of SM kernel parameters and the fiber orientation distribution function with spherical harmonics orders up to at least 8, and accommodates gradient non-uniformity corrections. The combination of these properties positions INRs as a potentially important tool for analyzing and interpreting diffusion MRI data.

[469] Foundation Model-Driven Classification of Atypical Mitotic Figures with Domain-Aware Training Strategies

Piotr Giedziun, Jan Sołtysik, Mateusz Górczany, Norbert Ropiak, Marcin Przymus, Piotr Krajewski, Jarosław Kwiecień, Artur Bartczak, Izabela Wasiak, Mateusz Maniewski

Main category: eess.IV

TL;DR: A solution for MIDOG 2025 Challenge Track 2 that uses the H-optimus-0 foundation model with LoRA fine-tuning and MixUp augmentation for binary classification of normal vs atypical mitotic figures.

Details

Motivation: To address the complex binary classification task of distinguishing normal mitotic figures (NMFs) from atypical mitotic figures (AMFs) in pathology images.

Method: Leverages pathology-specific foundation model H-optimus-0 with LoRA fine-tuning, MixUp augmentation, soft labels from multi-expert consensus, hard negative mining, adaptive focal loss, metric learning, and domain adaptation.

Result: Achieved reasonable performance in the preliminary evaluation phase, demonstrating both promise and challenges of applying foundation models to this classification task.

Conclusion: The approach shows potential for using foundation models in complex pathology classification tasks, though challenges remain in fully optimizing their application.

Abstract: We present a solution for the MIDOG 2025 Challenge Track~2, addressing binary classification of normal mitotic figures (NMFs) versus atypical mitotic figures (AMFs). The approach leverages pathology-specific foundation model H-optimus-0, selected based on recent cross-domain generalization benchmarks and our empirical testing, with Low-Rank Adaptation (LoRA) fine-tuning and MixUp augmentation. Implementation includes soft labels based on multi-expert consensus, hard negative mining, and adaptive focal loss, metric learning and domain adaptation. The method demonstrates both the promise and challenges of applying foundation models to this complex classification task, achieving reasonable performance in the preliminary evaluation phase.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective

[2] Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT, Gemini, and Deepseek

[3] A Generalizable Rhetorical Strategy Annotation Model Using LLM-based Debate Simulation and Labelling

[4] Continual Learning via Sparse Memory Finetuning

[5] Measuring the Effect of Disfluency in Multilingual Knowledge Probing Benchmarks

[6] MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval

[7] Extending Audio Context for Long-Form Understanding in Large Audio-Language Models

[8] Latent Topic Synthesis: Leveraging LLMs for Electoral Ad Analysis

[9] FarsiMCQGen: a Persian Multiple-choice Question Generation Framework

[10] Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning

[11] Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning

[12] Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

[13] TraceCoder: Towards Traceable ICD Coding via Multi-Source Knowledge Integration

[14] Summarizing Speech: A Comprehensive Survey

[15] TACL: Threshold-Adaptive Curriculum Learning Strategy for Enhancing Medical Text Understanding

[16] Exemplar-Guided Planing: Enhanced LLM Agent for KGQA

[17] Automatic essay scoring: leveraging Jaccard coefficient and Cosine similaritywith n-gram variation in vector space model approach

[18] Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination

[19] Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

[20] AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction

[21] Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics

[22] When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

[23] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

[24] VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency

[25] Large-scale User Game Lifecycle Representation Learning

[26] Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs

[27] When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs

[28] Controllable Abstraction in Summary Generation for Large Language Models via Prompt Engineering

[29] CORE: Reducing UI Exposure in Mobile Agents via Collaboration Between Cloud and Local LLMs

[30] DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios

[31] Temporal Referential Consistency: Do LLMs Favor Sequences Over Absolute Time References?

[32] From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

[33] Latent Reasoning in LLMs as a Vocabulary-Space Superposition

[34] TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

[35] Rethinking Cross-lingual Gaps from a Statistical Viewpoint

[36] Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation

[37] KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models

[38] Finetuning LLMs for EvaCun 2025 token prediction shared task

[39] From Ghazals to Sonnets: Decoding the Polysemous Expressions of Love Across Languages

[40] BiMax: Bidirectional MaxSim Score for Document-Level Alignment

[41] The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works

[42] HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

[43] Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection

[44] Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth

[45] Attention Sinks in Diffusion Language Models

[46] LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

[47] On Non-interactive Evaluation of Animal Communication Translators

[48] Emergence of Linear Truth Encodings in Language Models

[49] Paper2Web: Let’s Make Your Paper Alive!

[50] Enhanced Sentiment Interpretation via a Lexicon-Fuzzy-Transformer Framework

[51] SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling

[52] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

[53] PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction

[54] Evaluating Large Language Models with Psychometrics

[55] Cross-layer Attention Sharing for Pre-trained Large Language Models

[56] To Err Is Human; To Annotate, SILICON? Reducing Measurement Error in LLM Annotation

[57] PAFT: Prompt-Agnostic Fine-Tuning

[58] Event Segmentation Applications in Large Language Model Enabled Automated Recall Assessments

[59] Towards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models

[60] Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation

[61] EMCee: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context

[62] LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss

[63] FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation

[64] Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs

[65] RAGRouter: Learning to Route Queries to Multiple Retrieval-Augmented Language Models

[66] PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics

[67] Scaling Physical Reasoning with the PHYSICS Dataset

[68] FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

[69] Operationalizing Automated Essay Scoring: A Human-Aware Approach

[70] Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection

[71] Thinking Augmented Pre-training

[72] InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

[73] CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding

[74] Finetune Once: Decoupling General & Domain Learning with Dynamic Boosted Annealing

[75] VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

[76] Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

[77] Hybrid Reinforcement: When Reward Is Sparse, It’s Better to Be Dense