Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 73]
cs.CV [Total: 94]
cs.AI [Total: 25]
cs.SD [Total: 7]
cs.LG [Total: 136]
cs.MA [Total: 1]
cs.MM [Total: 2]
eess.AS [Total: 2]
eess.IV [Total: 7]

cs.CL

[1] Evaluating LLMs’ Reasoning Over Ordered Procedural Steps

Adrita Anika, Md Messal Monem Miah

Main category: cs.CL

TL;DR: LLMs struggle with reconstructing correct sequences from shuffled procedural steps, especially with longer sequences and more severe shuffling.

Details

Motivation: To evaluate LLMs' capability in reasoning over procedural sequences where step order directly impacts outcomes, using food recipes as a test domain.

Method: Evaluated multiple LLMs under zero-shot and few-shot settings on a curated dataset of food recipes, using metrics like Kendall’s Tau, NLCS, and NED to measure ordering quality.

Result: Model performance declines with increasing sequence length and greater step displacement (more severe shuffling), highlighting limitations in procedural reasoning.

Conclusion: Current LLMs have significant limitations in procedural reasoning tasks, particularly with longer and more disordered procedural sequences.

Abstract: Reasoning over procedural sequences, where the order of steps directly impacts outcomes, is a critical capability for large language models (LLMs). In this work, we study the task of reconstructing globally ordered sequences from shuffled procedural steps, using a curated dataset of food recipes, a domain where correct sequencing is essential for task success. We evaluate several LLMs under zero-shot and few-shot settings and present a comprehensive evaluation framework that adapts established metrics from ranking and sequence alignment. These include Kendall’s Tau, Normalized Longest Common Subsequence (NLCS), and Normalized Edit Distance (NED), which capture complementary aspects of ordering quality. Our analysis shows that model performance declines with increasing sequence length, reflecting the added complexity of longer procedures. We also find that greater step displacement in the input, corresponding to more severe shuffling, leads to further degradation. These findings highlight the limitations of current LLMs in procedural reasoning, especially with longer and more disordered inputs.

[2] Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks

Peiyu Li, Xiuxiu Tang, Si Chen, Ying Cheng, Ronald Metoyer, Ting Hua, Nitesh V. Chawla

Main category: cs.CL

TL;DR: ATLAS is an adaptive testing framework that uses Item Response Theory to reduce benchmark evaluation costs by 90% while maintaining precision, revealing annotation errors in existing benchmarks and showing that IRT-based rankings differ significantly from traditional accuracy rankings.

Details

Motivation: Traditional LLM evaluation requires thousands of benchmark items, making evaluations expensive and slow, while treating all items equally despite varying quality and informativeness.

Method: Uses Item Response Theory (IRT) with Fisher information-guided item selection to adaptively estimate model ability, reducing the number of items needed for evaluation.

Result: Achieves 90% item reduction while maintaining measurement precision (0.154 MAE on HellaSwag with only 42 items vs 5,608 full benchmark). Reveals 3-6% of items have negative discrimination (annotation errors) and shows 23-31% of models shift by more than 10 rank positions compared to accuracy-based rankings.

Conclusion: ATLAS provides a more efficient and accurate evaluation framework that exposes flaws in static benchmarks and reveals meaningful differences in model capabilities that traditional accuracy rankings miss.

Abstract: Large language model evaluation requires thousands of benchmark items, making evaluations expensive and slow. Existing methods compute average accuracy across fixed item sets, treating all items equally despite varying quality and informativeness. We present ATLAS an adaptive testing framework using Item Response Theory (IRT) to estimate model ability through Fisher information-guided item selection. Our analysis of five major benchmarks reveals that 3-6% of items exhibit negative discrimination, indicating annotation errors that corrupt static evaluation. ATLAS achieves 90% item reduction while maintaining measurement precision: on HellaSwag (5,608 items), we match full-benchmark estimates using only 42 items with 0.154 MAE. Our framework maintains item exposure rates below 10% and test overlap at 16-27%, compared to static benchmarks where every model sees all items (100% exposure). Among 4,000+ tested models, IRT ranks differ from accuracy ranks: models with the same accuracy get different IRT scores, and 23-31% of all models shift by more than 10 rank positions. Code and calibrated item banks are available at https://github.com/Peiyu-Georgia-Li/ATLAS.git.

[3] SARC: Sentiment-Augmented Deep Role Clustering for Fake News Detection

Jingqing Wang, Jiaxing Shang, Rong Xu, Fei Hao, Tianjin Huang, Geyong Min

Main category: cs.CL

TL;DR: SARC is a sentiment-augmented role clustering framework that improves fake news detection by identifying user roles through sentiment-enhanced deep clustering and joint optimization with fake news detection.

Details

Motivation: Existing fake news detection methods treat sentiment features as auxiliary signals without considering role differentiation - the same sentiment polarity may come from users with different roles, limiting detection effectiveness.

Method: Proposes SARC framework that: 1) generates user features via joint comment text representation (BiGRU + Attention) and sentiment encoding, 2) uses differentiable deep clustering to categorize user roles automatically, 3) employs joint optimization integrating role clustering and fake news detection.

Result: Experimental results on RumourEval-19 and Weibo-comp datasets show SARC achieves superior performance across all metrics compared to baseline models.

Conclusion: SARC effectively addresses the role differentiation problem in fake news detection by integrating sentiment-augmented role clustering with detection tasks, demonstrating improved performance over existing approaches.

Abstract: Fake news detection has been a long-standing research focus in social networks. Recent studies suggest that incorporating sentiment information from both news content and user comments can enhance detection performance. However, existing approaches typically treat sentiment features as auxiliary signals, overlooking role differentiation, that is, the same sentiment polarity may originate from users with distinct roles, thereby limiting their ability to capture nuanced patterns for effective detection. To address this issue, we propose SARC, a Sentiment-Augmented Role Clustering framework which utilizes sentiment-enhanced deep clustering to identify user roles for improved fake news detection. The framework first generates user features through joint comment text representation (with BiGRU and Attention mechanism) and sentiment encoding. It then constructs a differentiable deep clustering module to automatically categorize user roles. Finally, unlike existing approaches which take fake news label as the unique supervision signal, we propose a joint optimization objective integrating role clustering and fake news detection to further improve the model performance. Experimental results on two benchmark datasets, RumourEval-19 and Weibo-comp, demonstrate that SARC achieves superior performance across all metrics compared to baseline models. The code is available at: https://github.com/jxshang/SARC.

[4] Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar

Main category: cs.CL

TL;DR: This paper proposes training LLMs to reason about instruction hierarchies, where higher-priority system instructions override conflicting user requests, improving reliability and safety.

Details

Motivation: As LLMs take on high-stakes roles, they need to reconcile competing instructions from multiple sources (developers, users, tools) within prompts, requiring reliable instruction hierarchy enforcement.

Method: Constructed VerIH dataset with aligned/conflicting system-user instructions, used lightweight reinforcement learning to transfer general reasoning to instruction prioritization.

Result: Finetuned models show consistent improvements on instruction following benchmarks, enhanced robustness against jailbreak/prompt injection attacks, and generalization to safety-critical settings.

Conclusion: Reasoning over instruction hierarchies provides a practical path to reliable LLMs where system prompt updates yield controllable and robust behavior changes.

Abstract: As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first “think” about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises both aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks. These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.

[5] EncouRAGe: Evaluating RAG Local, Fast, and Reliable

Jan Strich, Adeline Scharfenberg, Chris Biemann, Martin Semmann

Main category: cs.CL

TL;DR: EncouRAGe is a Python framework for developing and evaluating RAG systems with LLMs, featuring modular components for flexible experimentation and emphasizing reproducibility.

Details

Motivation: To address the need for streamlined development and comprehensive evaluation of RAG systems, enabling researchers to efficiently assess datasets within RAG workflows with scientific reproducibility.

Method: Developed a modular framework with five components: Type Manifest, RAG Factory, Inference, Vector Store, and Metrics, supporting local deployment and diverse evaluation metrics.

Result: Evaluation across 25k QA pairs and 51k+ documents showed RAG underperforms compared to Oracle Context, while Hybrid BM25 consistently performed best across all datasets. Reranking provided only marginal improvements with higher latency.

Conclusion: EncouRAGe provides an effective framework for RAG system development and evaluation, revealing current RAG limitations and identifying Hybrid BM25 as the most consistent performer.

Abstract: We introduce EncouRAGe, a comprehensive Python framework designed to streamline the development and evaluation of Retrieval-Augmented Generation (RAG) systems using Large Language Models (LLMs) and Embedding Models. EncouRAGe comprises five modular and extensible components: Type Manifest, RAG Factory, Inference, Vector Store, and Metrics, facilitating flexible experimentation and extensible development. The framework emphasizes scientific reproducibility, diverse evaluation metrics, and local deployment, enabling researchers to efficiently assess datasets within RAG workflows. This paper presents implementation details and an extensive evaluation across multiple benchmark datasets, including 25k QA pairs and over 51k documents. Our results show that RAG still underperforms compared to the Oracle Context, while Hybrid BM25 consistently achieves the best results across all four datasets. We further examine the effects of reranking, observing only marginal performance improvements accompanied by higher response latency.

[6] multiMentalRoBERTa: A Fine-tuned Multiclass Classifier for Mental Health Disorder

K M Sajjadul Islam, John Fields, Praveen Madiraju

Main category: cs.CL

TL;DR: multiMentalRoBERTa is a fine-tuned RoBERTa model for multiclass mental health classification that outperforms existing methods and provides interpretable results for detecting stress, anxiety, depression, PTSD, suicidal ideation, and neutral discourse.

Details

Motivation: Early detection of mental health disorders from social media text is critical for timely support, risk assessment, and resource referral.

Method: Fine-tuned RoBERTa model using multiple curated datasets, with comparative experiments against traditional ML, domain-specific transformers, and LLMs. Applied explainability methods like Layer Integrated Gradients and KeyBERT.

Result: Achieved superior performance with macro F1-scores of 0.839 (six-class) and 0.870 (five-class), outperforming MentalBERT and baseline classifiers. Identified strong correlations between depression-suicidal ideation and anxiety-PTSD.

Conclusion: multiMentalRoBERTa is a lightweight, robust, deployable solution that emphasizes the effectiveness of fine-tuned transformers for reliable and interpretable mental health detection, with importance on fairness, bias mitigation, and safety protocols.

Abstract: The early detection of mental health disorders from social media text is critical for enabling timely support, risk assessment, and referral to appropriate resources. This work introduces multiMentalRoBERTa, a fine-tuned RoBERTa model designed for multiclass classification of common mental health conditions, including stress, anxiety, depression, post-traumatic stress disorder (PTSD), suicidal ideation, and neutral discourse. Drawing on multiple curated datasets, data exploration is conducted to analyze class overlaps, revealing strong correlations between depression and suicidal ideation as well as anxiety and PTSD, while stress emerges as a broad, overlapping category. Comparative experiments with traditional machine learning methods, domain-specific transformers, and prompting-based large language models demonstrate that multiMentalRoBERTa achieves superior performance, with macro F1-scores of 0.839 in the six-class setup and 0.870 in the five-class setup (excluding stress), outperforming both fine-tuned MentalBERT and baseline classifiers. Beyond predictive accuracy, explainability methods, including Layer Integrated Gradients and KeyBERT, are applied to identify lexical cues that drive classification, with a particular focus on distinguishing depression from suicidal ideation. The findings emphasize the effectiveness of fine-tuned transformers for reliable and interpretable detection in sensitive contexts, while also underscoring the importance of fairness, bias mitigation, and human-in-the-loop safety protocols. Overall, multiMentalRoBERTa is presented as a lightweight, robust, and deployable solution for enhancing support in mental health platforms.

[7] Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding

Haneen Al-Homoud, Asma Ibrahim, Murtadha Al-Jubran, Fahad Al-Otaibi, Yazeed Al-Harbi, Daulet Toibazar, Kesen Wang, Pedro J. Moreno

Main category: cs.CL

TL;DR: Cross-Lingual SynthDocs is a large-scale synthetic corpus with over 2.5 million samples addressing Arabic OCR and Document Understanding resource scarcity, showing improved performance across multiple benchmarks when used for finetuning.

Details

Motivation: Address the scarcity of Arabic resources for Optical Character Recognition (OCR) and Document Understanding (DU) by creating a large-scale synthetic corpus.

Method: Created a synthetic corpus using authentic scanned backgrounds, bilingual layouts, diacritic aware fonts, and various rendered styles for charts and tables. The pipeline captures typographic and structural complexity of Arabic documents.

Result: Finetuning Qwen-2.5-VL on SynthDocs yields consistent improvements in Word Error Rate (WER) and Character Error Rate (CER) for OCR across multiple public Arabic benchmarks, with improvements also seen in Tree-Edit Distance Similarity (TEDS) and Chart Extraction Score (CharTeX) for other modalities.

Conclusion: SynthDocs provides a scalable, visually realistic resource for advancing research in multilingual document analysis, effectively addressing the resource gap for Arabic document processing.

Abstract: Cross-Lingual SynthDocs is a large-scale synthetic corpus designed to address the scarcity of Arabic resources for Optical Character Recognition (OCR) and Document Understanding (DU). The dataset comprises over 2.5 million of samples, including 1.5 million textual data, 270K fully annotated tables, and hundred thousands of real data based charts. Our pipeline leverages authentic scanned backgrounds, bilingual layouts, and diacritic aware fonts to capture the typographic and structural complexity of Arabic documents. In addition to text, the corpus includes variety of rendered styles for charts and tables. Finetuning Qwen-2.5-VL on SynthDocs yields consistent improvements in Word Error Rate (WER) and Character Error Rate (CER) in terms of OCR across multiple public Arabic benchmarks, Tree-Edit Distance Similarity (TEDS) and Chart Extraction Score (CharTeX) improved as well in other modalities. SynthDocs provides a scalable, visually realistic resource for advancing research in multilingual document analysis.

[8] Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation

Song Wang, Zihan Chen, Peng Wang, Zhepei Wei, Zhen Tan, Yu Meng, Cong Shen, Jundong Li

Main category: cs.CL

TL;DR: WinnowRAG is a novel RAG framework that addresses noise from retrieving multiple documents by using query-aware clustering and iterative winnowing to filter out irrelevant content while preserving valuable information.

Details

Motivation: Traditional RAG systems face a trade-off: retrieving more documents increases chances of finding relevant information but introduces significant noise from irrelevant documents, reducing answer accuracy.

Method: Two-stage approach: Stage I performs query-aware clustering to group similar documents and assigns each cluster to an LLM agent for answer generation. Stage II uses a critic LLM to evaluate agent outputs and iteratively winnow out noisy documents while preserving useful content through strategic merging techniques.

Result: Extensive experiments on realistic datasets show WinnowRAG outperforms state-of-the-art baselines in handling document noise while maintaining relevant information.

Conclusion: WinnowRAG effectively addresses the noise problem in multi-document RAG systems through systematic winnowing, is model-agnostic without requiring fine-tuning, and demonstrates superior performance across various tasks.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge sources to address their limitations in accessing up-to-date or specialized information. A natural strategy to increase the likelihood of retrieving relevant information is to expand the number of retrieved documents. However, involving more documents could introduce significant noise, as many documents may be irrelevant or misleading, thereby reducing the overall accuracy of the generated responses. To overcome the challenge associated with handling a larger number of documents, we propose WinnowRAG, a novel RAG framework designed to systematically filter out noisy documents while preserving valuable content – a process we refer to as winnowing. WinnowRAG operates in two stages: In Stage I, we perform query-aware clustering to group similar documents and form distinct topic clusters. Each cluster is assigned to an LLM agent for generating a unique answer. In Stage II, we perform winnowing, wherein a critic LLM evaluates the outputs of multiple agents and iteratively separates useful documents from noisy ones. To retain useful documents when discarding agents, we propose two strategic merging techniques to ensure that only relevant knowledge is used for generating the final response. Crucially, WinnowRAG is model-agnostic and does not require any model fine-tuning, making it easily adaptable to various tasks. Extensive experiments on various realistic datasets demonstrate the effectiveness of WinnowRAG over state-of-the-art baselines.

[9] Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili-May Liu, Lennart Luettgau, Jabez Magomere, Jonathan Rystrøm, Anna Sotnikova, Yushi Yang, Yilun Zhao, Adel Bibi, Antoine Bosselut, Ronald Clark, Arman Cohan, Jakob Foerster, Yarin Gal, Scott A. Hale, Inioluwa Deborah Raji, Christopher Summerfield, Philip H. S. Torr, Cozmin Ududec, Luc Rocher, Adam Mahdi

Main category: cs.CL

TL;DR: Systematic review of 445 LLM benchmarks reveals validity issues in measuring safety and robustness, leading to 8 recommendations for better benchmark development.

Details

Motivation: To assess construct validity of LLM benchmarks for measuring abstract phenomena like safety and robustness, as current evaluation methods may not reliably represent what matters.

Method: Conducted systematic review of 445 LLM benchmarks from leading NLP/ML conferences with 29 expert reviewers, analyzing patterns in measured phenomena, tasks, and scoring metrics.

Result: Found patterns that undermine validity of claims about LLM safety and robustness, identifying specific shortcomings in current benchmark practices.

Conclusion: Provides 8 key recommendations and actionable guidance to improve LLM benchmark development for researchers and practitioners.

Abstract: Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as ‘safety’ and ‘robustness’ requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.

[10] POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios

Tingyue Yang, Junchi Yao, Yuhui Guo, Chang Liu

Main category: cs.CL

TL;DR: POLIS-Bench is the first systematic evaluation suite for LLMs in governmental bilingual policy scenarios, featuring up-to-date bilingual corpus, scenario-grounded tasks, and dual-metric evaluation framework.

Details

Motivation: There is a lack of rigorous evaluation benchmarks specifically designed for LLMs operating in governmental bilingual policy scenarios, which requires comprehensive assessment of model understanding and application in real-world governance contexts.

Method: Constructed an extensive up-to-date bilingual policy corpus, designed three scenario-grounded tasks (Clause Retrieval & Interpretation, Solution Generation, Compliance Judgment), and established a dual-metric evaluation framework combining semantic similarity with accuracy rate.

Result: Evaluation of 10+ state-of-the-art LLMs revealed reasoning models maintain superior cross-task stability and accuracy, with compliance tasks being particularly challenging. Fine-tuned lightweight open-source POLIS series models achieved parity or surpassed proprietary baselines on multiple policy subtasks at significantly reduced cost.

Conclusion: POLIS-Bench provides a comprehensive evaluation framework for policy-oriented LLMs, demonstrating that fine-tuned lightweight models can achieve competitive performance with proprietary models, offering a cost-effective path for real-world governmental deployment.

Abstract: We introduce POLIS-Bench, the first rigorous, systematic evaluation suite designed for LLMs operating in governmental bilingual policy scenarios. Compared to existing benchmarks, POLIS-Bench introduces three major advancements. (i) Up-to-date Bilingual Corpus: We construct an extensive, up-to-date policy corpus that significantly scales the effective assessment sample size, ensuring relevance to current governance practice. (ii) Scenario-Grounded Task Design: We distill three specialized, scenario-grounded tasks – Clause Retrieval & Interpretation, Solution Generation, and the Compliance Judgmen–to comprehensively probe model understanding and application. (iii) Dual-Metric Evaluation Framework: We establish a novel dual-metric evaluation framework combining semantic similarity with accuracy rate to precisely measure both content alignment and task requirement adherence. A large-scale evaluation of over 10 state-of-the-art LLMs on POLIS-Bench reveals a clear performance hierarchy where reasoning models maintain superior cross-task stability and accuracy, highlighting the difficulty of compliance tasks. Furthermore, leveraging our benchmark, we successfully fine-tune a lightweight open-source model. The resulting POLIS series models achieves parity with, or surpasses, strong proprietary baselines on multiple policy subtasks at a significantly reduced cost, providing a cost-effective and compliant path for robust real-world governmental deployment.

[11] GEMMA-SQL: A Novel Text-to-SQL Model Based on Large Language Models

Hari Mohan Pandey, Anshul Gupta, Subham Sarkar, Minakshi Tomer, Schneider Johannes, Yan Gong

Main category: cs.CL

TL;DR: GEMMA-SQL is a lightweight text-to-SQL model based on Gemma 2B that achieves state-of-the-art performance through efficient fine-tuning and advanced prompting strategies, making it deployable on low-cost hardware.

Details

Motivation: To create an accessible text-to-SQL system that doesn't require specialized programming knowledge, while being resource-efficient and deployable on low-cost hardware unlike many large language models.

Method: Fine-tuned Gemma 2B architecture in a resource-efficient, iterative manner using the SPIDER benchmark, combined with multiple prompting strategies including few-shot learning and instruction tuning.

Result: GEMMA-SQL Instruct achieves 66.8% Test-Suite accuracy and 63.3% Exact Set Match accuracy, outperforming state-of-the-art baselines like IRNet, RYANSQL, and CodeXDavinci.

Conclusion: Effective prompt design and targeted instruction tuning can significantly boost text-to-SQL performance while maintaining scalability and adaptability, making GEMMA-SQL a practical open-source alternative for robust text-to-SQL systems.

Abstract: Text-to-SQL systems enable users to interact with structured databases using natural language, eliminating the need for specialized programming knowledge. In this work, we introduce GEMMA-SQL, a lightweight and efficient text-to-SQL model built upon the open-source Gemma 2B architecture. Unlike many large language models (LLMs), GEMMA-SQL is fine-tuned in a resource-efficient, iterative manner and can be deployed on low-cost hardware. Leveraging the SPIDER benchmark for training and evaluation, GEMMA-SQL combines multiple prompting strategies, including few-shot learning, to enhance SQL query generation accuracy. The instruction-tuned variant, GEMMA-SQL Instruct, achieves 66.8% Test-Suite accuracy and 63.3% Exact Set Match accuracy, outperforming several state-of-the-art baselines such as IRNet, RYANSQL, and CodeXDavinci. The proposed approach demonstrates that effective prompt design and targeted instruction tuning can significantly boost performance while maintaining high scalability and adaptability. These results position GEMMA-SQL as a practical, open-source alternative for robust and accessible text-to-SQL systems.

[12] First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation

Dmytro Vitel, Anshuman Chhabra

Main category: cs.CL

TL;DR: This paper challenges prior findings that first layers are best for LLM influence estimation, showing middle attention layers are better and proposing improved aggregation methods and evaluation metrics.

Details

Motivation: To improve training sample influence estimation in LLMs by addressing unreliable cancellation effects in prior methods and developing better layer selection and aggregation approaches.

Method: Proposed theoretical and empirical analysis of cancellation effect reliability, identified middle attention layers as better influence estimators, developed alternative aggregation methods (ranking and vote-based), and introduced Noise Detection Rate (NDR) metric for evaluation.

Result: Demonstrated that first layers are not necessarily better than last layers for LLM influence estimation, middle attention layers perform better, and alternative aggregation methods significantly improve performance.

Conclusion: Contrasts with prior knowledge by showing first layers aren’t superior for influence estimation, establishes middle attention layers as better estimators, and provides improved evaluation metrics and aggregation methods.

Abstract: Identifying how training samples influence/impact Large Language Model (LLM) decision-making is essential for effectively interpreting model decisions and auditing large-scale datasets. Current training sample influence estimation methods (also known as influence functions) undertake this goal by utilizing information flow through the model via its first-order and higher-order gradient terms. However, owing to the large model sizes of today consisting of billions of parameters, these influence computations are often restricted to some subset of model layers to ensure computational feasibility. Prior seminal work by Yeh et al. (2022) in assessing which layers are best suited for computing language data influence concluded that the first (embedding) layers are the most informative for this purpose, using a hypothesis based on influence scores canceling out (i.e., the cancellation effect). In this work, we propose theoretical and empirical evidence demonstrating how the cancellation effect is unreliable, and that middle attention layers are better estimators for influence. Furthermore, we address the broader challenge of aggregating influence scores across layers, and showcase how alternatives to standard averaging (such as ranking and vote-based methods) can lead to significantly improved performance. Finally, we propose better methods for evaluating influence score efficacy in LLMs without undertaking model retraining, and propose a new metric known as the Noise Detection Rate (NDR) that exhibits strong predictive capability compared to the cancellation effect. Through extensive experiments across LLMs of varying types and scales, we concretely determine that the first (layers) are not necessarily better than the last (layers) for LLM influence estimation, contrasting with prior knowledge in the field.

[13] Learning to reason about rare diseases through retrieval-augmented agents

Ha Young Kim, Jun Li, Ana Beatriz Solana, Carolin M. Pirkl, Benedikt Wiestler, Julia A. Schnabel, Cosmin I. Bercea

Main category: cs.CL

TL;DR: RADAR is a retrieval-augmented agentic system that improves rare disease detection in brain MRI by using external medical knowledge from case reports and literature, achieving up to 10.2% performance gain without additional training.

Details

Motivation: Rare diseases in medical imaging often cause AI model failures due to scarce training data, mirroring how radiologists consult literature when encountering unfamiliar findings.

Method: Uses AI agents with sentence transformers to embed case reports and literature, indexed with FAISS for efficient similarity search, enabling retrieval of clinically relevant evidence for diagnostic decision making.

Result: Achieves up to 10.2% performance gain on NOVA dataset with 280 rare diseases, with strongest improvements for open source models like DeepSeek, while providing interpretable, literature-grounded explanations.

Conclusion: Retrieval-augmented reasoning is a powerful paradigm for low-prevalence conditions in medical imaging, improving both accuracy and interpretability without requiring additional model training.

Abstract: Rare diseases represent the long tail of medical imaging, where AI models often fail due to the scarcity of representative training data. In clinical workflows, radiologists frequently consult case reports and literature when confronted with unfamiliar findings. Following this line of reasoning, we introduce RADAR, Retrieval Augmented Diagnostic Reasoning Agents, an agentic system for rare disease detection in brain MRI. Our approach uses AI agents with access to external medical knowledge by embedding both case reports and literature using sentence transformers and indexing them with FAISS to enable efficient similarity search. The agent retrieves clinically relevant evidence to guide diagnostic decision making on unseen diseases, without the need of additional training. Designed as a model-agnostic reasoning module, RADAR can be seamlessly integrated with diverse large language models, consistently improving their rare pathology recognition and interpretability. On the NOVA dataset comprising 280 distinct rare diseases, RADAR achieves up to a 10.2% performance gain, with the strongest improvements observed for open source models such as DeepSeek. Beyond accuracy, the retrieved examples provide interpretable, literature grounded explanations, highlighting retrieval-augmented reasoning as a powerful paradigm for low-prevalence conditions in medical imaging.

[14] Surprisal reveals diversity gaps in image captioning and different scorers change the story

Nikolai Ilinykh, Simon Dobnik

Main category: cs.CL

TL;DR: The paper introduces surprisal variance as a metric for linguistic diversity in image captioning, showing that human captions have twice the diversity of models when measured with caption-trained LMs, but this pattern reverses with general-language models.

Details

Motivation: To develop a robust method for quantifying linguistic diversity in image captioning that accounts for different language model scorers, addressing the limitation that conclusions about diversity can be completely inverted depending on the scorer used.

Method: Used surprisal variance (spread of token-level negative log-probabilities) to measure diversity, compared five state-of-the-art vision-and-language LLMs with human captions on MSCOCO test set, and evaluated with both caption-trained n-gram LM and general-language model.

Result: Human captions showed roughly twice the surprisal variance of models when measured with caption-trained LM, but rescoring with general-language model reversed this pattern, showing models had higher diversity than humans.

Conclusion: Relying on a single scorer can completely invert diversity conclusions, so robust diversity evaluation must report surprisal under several different scorers to avoid misleading results.

Abstract: We quantify linguistic diversity in image captioning with surprisal variance

the spread of token-level negative log-probabilities within a caption set. On the MSCOCO test set, we compare five state-of-the-art vision-and-language LLMs, decoded with greedy and nucleus sampling, to human captions. Measured with a caption-trained n-gram LM, humans display roughly twice the surprisal variance of models, but rescoring the same captions with a general-language model reverses the pattern. Our analysis introduces the surprisal-based diversity metric for image captioning. We show that relying on a single scorer can completely invert conclusions, thus, robust diversity evaluation must report surprisal under several scorers.

[15] Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models

Chenxi Liu, Junjie Liang, Yuqi Jia, Bochuan Cao, Yang Bai, Heng Huang, Xun Chen

Main category: cs.CL

TL;DR: ERPO framework addresses the issue of residual prompts in RLVR training by encouraging exploration on prompts with zero variance rewards, improving training diversity and effectiveness.

Details

Motivation: As LLMs train longer and scale larger, more training prompts become residual prompts with zero variance rewards, reducing training diversity and effectiveness.

Method: ERPO maintains a history tracker for each prompt and adaptively increases sampling temperature for residual prompts that previously produced all correct responses, encouraging diverse reasoning traces.

Result: Empirical results on Qwen2.5 series show ERPO consistently surpasses strong baselines across multiple mathematical reasoning benchmarks.

Conclusion: ERPO effectively exploits residual prompts by reactivating their training signals through exploration, leading to improved performance in RLVR training.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong performance in training LLMs with RLVR. However, as models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal. Consequently, fewer prompts contribute to training, reducing diversity and hindering effectiveness. To fully exploit these residual prompts, we propose the Explore Residual Prompts in Policy Optimization (ERPO) framework, which encourages exploration on residual prompts and reactivates their training signals. ERPO maintains a history tracker for each prompt and adaptively increases the sampling temperature for residual prompts that previously produced all correct responses. This encourages the model to generate more diverse reasoning traces, introducing incorrect responses that revive training signals. Empirical results on the Qwen2.5 series demonstrate that ERPO consistently surpasses strong baselines across multiple mathematical reasoning benchmarks.

[16] Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

Preetum Nakkiran, Arwen Bradley, Adam Goliński, Eugene Ndiaye, Michael Kirchhof, Sinead Williamson

Main category: cs.CL

TL;DR: Base LLMs exhibit semantic calibration in open-domain QA tasks despite not being explicitly trained for confidence estimation, but RL instruction-tuning and chain-of-thought reasoning break this calibration.

Details

Motivation: LLMs lack meaningful confidence estimates for their outputs, and it's unclear whether they can assess confidence in the actual meaning of responses beyond token-level calibration.

Method: Used sampling-based notion of semantic calibration and theoretical analysis connecting calibration with local loss optimality through B-calibration framework.

Result: Base LLMs are remarkably well-calibrated for semantic confidence assessment, but RL instruction-tuning and chain-of-thought reasoning systematically break this calibration.

Conclusion: The work provides the first principled explanation of when and why semantic calibration emerges in LLMs, showing it arises as a byproduct of next-token prediction when models can easily predict their own distribution over semantic answer classes.

Abstract: Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of “B-calibration,” which is a notion of calibration parameterized by a choice of equivalence classes (semantic or otherwise). This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) RL instruction-tuning systematically breaks this calibration, and (3) chain-of-thought reasoning breaks calibration. To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.

[17] Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs

Matthew Bozoukov, Matthew Nguyen, Shubkarman Singh, Bart Bussmann, Patrick Leask

Main category: cs.CL

TL;DR: LLMs can develop behavioral self-awareness through minimal fine-tuning, emerging as domain-specific linear features that can be easily induced and controlled.

Details

Motivation: To understand the safety implications of LLMs' ability to describe their own learned behaviors, which could allow models to conceal true capabilities during evaluation.

Method: Controlled fine-tuning experiments using low-rank adapters (LoRA) on instruction-tuned LLMs, specifically testing with single rank-1 LoRA adapters.

Result: Self-awareness can be reliably induced with minimal parameters, captured by single steering vectors in activation space, and is non-universal with independent representations across different tasks.

Conclusion: Behavioral self-awareness emerges as a domain-specific, linear feature that can be easily induced and modulated, raising important safety considerations for LLM development.

Abstract: Recent studies have revealed that LLMs can exhibit behavioral self-awareness: the ability to accurately describe or predict their own learned behaviors without explicit supervision. This capability raises safety concerns as it may, for example, allow models to better conceal their true abilities during evaluation. We attempt to characterize the minimal conditions under which such self-awareness emerges, and the mechanistic processes through which it manifests. Through controlled finetuning experiments on instruction-tuned LLMs with low-rank adapters (LoRA), we find: (1) that self-awareness can be reliably induced using a single rank-1 LoRA adapter; (2) that the learned self-aware behavior can be largely captured by a single steering vector in activation space, recovering nearly all of the fine-tune’s behavioral effect; and (3) that self-awareness is non-universal and domain-localized, with independent representations across tasks. Together, these findings suggest that behavioral self-awareness emerges as a domain-specific, linear feature that can be easily induced and modulated.

[18] SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents

Jaehoon Lee, Sohyun Kim, Wanggeun Park, Geon Lee, Seungkyung Kim, Minyoung Lee

Main category: cs.CL

TL;DR: SDS KoPub VDR is the first large-scale benchmark for Korean public document retrieval, featuring 361 real-world documents (40,781 pages) with complex visual elements and 600 human-verified query-page-answer triples across six public domains.

Details

Motivation: Existing VDR benchmarks overlook non-English languages and structural complexity of official publications, creating a critical gap for evaluating document retrieval systems in real-world multilingual contexts.

Method: Built a corpus of 361 Korean public documents with complex visual elements, created 600 query-page-answer triples using multimodal models followed by rigorous human verification, and categorized queries by reasoning modality (text-based, visual-based, cross-modal).

Result: Evaluation on text-only and multimodal retrieval tasks revealed substantial performance gaps, particularly in multimodal scenarios requiring cross-modal reasoning, even for state-of-the-art models.

Conclusion: SDS KoPub VDR enables rigorous evaluation across textual and multimodal retrieval tasks and provides a roadmap for advancing multimodal AI in complex document intelligence applications.

Abstract: Existing benchmarks for visual document retrieval (VDR) largely overlook non-English languages and the structural complexity of official publications. To address this critical gap, we introduce SDS KoPub VDR, the first large-scale, publicly available benchmark for retrieving and understanding Korean public documents. The benchmark is built upon a corpus of 361 real-world documents (40,781 pages), including 256 files under the KOGL Type 1 license and 105 from official legal portals, capturing complex visual elements like tables, charts, and multi-column layouts. To establish a challenging and reliable evaluation set, we constructed 600 query-page-answer triples. These were initially generated using multimodal models (e.g., GPT-4o) and subsequently underwent a rigorous human verification and refinement process to ensure factual accuracy and contextual relevance. The queries span six major public domains and are systematically categorized by the reasoning modality required: text-based, visual-based (e.g., chart interpretation), and cross-modal. We evaluate SDS KoPub VDR on two complementary tasks that reflect distinct retrieval paradigms: (1) text-only retrieval, which measures a model’s ability to locate relevant document pages based solely on textual signals, and (2) multimodal retrieval, which assesses retrieval performance when visual features (e.g., tables, charts, and layouts) are jointly leveraged alongside text. This dual-task evaluation reveals substantial performance gaps, particularly in multimodal scenarios requiring cross-modal reasoning, even for state-of-the-art models. As a foundational resource, SDS KoPub VDR not only enables rigorous and fine-grained evaluation across textual and multimodal retrieval tasks but also provides a clear roadmap for advancing multimodal AI in complex, real-world document intelligence.

[19] BudgetMem: Learning Selective Memory Policies for Cost-Efficient Long-Context Processing in Language Models

Chandra Vamsi Krishna Alla, Harish Naidu Gaddam, Manohar Kommi

Main category: cs.CL

TL;DR: BudgetMem is a memory-augmented architecture that selectively stores information under budget constraints, achieving near-baseline performance with 72.4% memory savings for long document processing.

Details

Motivation: LLMs face computational and memory constraints when processing long contexts, making current approaches prohibitively expensive for resource-constrained deployments despite growing demand for long-context applications.

Method: Combines selective memory policies with feature-based salience scoring (entity density, TF-IDF, discourse markers, position bias) and learned gating mechanisms coupled with BM25 sparse retrieval for efficient information access.

Result: Achieves only 1.0% F1 score degradation while saving 72.4% memory compared to baseline RAG on long documents (5K-10K tokens), with benefits increasing with document length.

Conclusion: Provides a practical pathway for deploying capable long context systems on modest hardware, democratizing access to advanced language understanding capabilities.

Abstract: Large Language Models (LLMs) face significant computational and memory constraints when processing long contexts, despite growing demand for applications requiring reasoning over extensive documents, multi-session dialogues, and book length texts. While recent advances have extended context windows to 100K-1M tokens, such approaches incur prohibitive costs for resource constrained deployments. We propose BudgetMem, a novel memory augmented architecture that learns what to remember rather than remembering everything. Our system combines selective memory policies with feature based salience scoring (entity density, TF-IDF, discourse markers, position bias) to decide which information merits storage under strict budget constraints. Unlike existing retrieval augmented generation (RAG) systems that store all chunks, BudgetMem employs learned gating mechanisms coupled with BM25 sparse retrieval for efficient information access. Through comprehensive experiments on 700 question answer pairs across short (237 tokens) and long (5K-10K tokens) documents with Llama-3.2-3B-Instruct, we demonstrate that BudgetMem achieves remarkable results on long documents: only 1.0% F1 score degradation while saving 72.4% memory compared to baseline RAG. We validate our approach through budget sensitivity analysis (testing 7 budget ratios), naive baseline comparisons, and document length analysis, showing that BudgetMem’s benefits increase with document length. Our work provides a practical pathway for deploying capable long context systems on modest hardware, democratizing access to advanced language understanding capabilities.

[20] AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

Yu Li, Lehui Li, Qingmin Liao, Fengli Xu, Yong Li

Main category: cs.CL

TL;DR: A framework for baseline and dataset recommendation that uses collective perception from citation networks to improve experimental design automation, achieving significant performance gains over prior methods.

Details

Motivation: Prior methods for automated experiment design suffer from limited data coverage (missing datasets actually used in papers) and overreliance on content similarity that overlooks experimental suitability.

Method: 1) Automated pipeline linking 100K papers to their actual baselines/datasets; 2) Collective perception retriever using citation contexts; 3) Reasoning-augmented reranker with explicit reasoning chains and LLM-based justifications.

Result: Covers 85% of datasets/baselines from top AI conferences over 5 years. Outperforms strongest baseline with +5.85% Recall@20 and +8.30% HitRate@5.

Conclusion: The approach advances reliable, interpretable automation of experimental design through comprehensive data coverage and network-aware recommendations.

Abstract: Large language model agents are becoming increasingly capable at web-centric tasks such as information retrieval, complex reasoning. These emerging capabilities have given rise to surge research interests in developing LLM agent for facilitating scientific quest. One key application in AI research is to automate experiment design through agentic dataset and baseline retrieval. However, prior efforts suffer from limited data coverage, as recommendation datasets primarily harvest candidates from public portals and omit many datasets actually used in published papers, and from an overreliance on content similarity that biases model toward superficial similarity and overlooks experimental suitability. Harnessing collective perception embedded in the baseline and dataset citation network, we present a comprehensive framework for baseline and dataset recommendation. First, we design an automated data-collection pipeline that links roughly one hundred thousand accepted papers to the baselines and datasets they actually used. Second, we propose a collective perception enhanced retriever. To represent the position of each dataset or baseline within the scholarly network, it concatenates self-descriptions with aggregated citation contexts. To achieve efficient candidate recall, we finetune an embedding model on these representations. Finally, we develop a reasoning-augmented reranker that exact interaction chains to construct explicit reasoning chains and finetunes a large language model to produce interpretable justifications and refined rankings. The dataset we curated covers 85% of the datasets and baselines used at top AI conferences over the past five years. On our dataset, the proposed method outperforms the strongest prior baseline with average gains of +5.85% in Recall@20, +8.30% in HitRate@5. Taken together, our results advance reliable, interpretable automation of experimental design.

[21] Diagnosing and Mitigating Semantic Inconsistencies in Wikidata’s Classification Hierarchy

Shixiong Zhao, Hideaki Takeda

Main category: cs.CL

TL;DR: A novel validation method is proposed to identify classification errors, over-generalized subclass links, and redundant connections in Wikidata’s taxonomy, along with an evaluation criterion for corrections and a user inspection system.

Details

Motivation: Wikidata's loose editorial policy has led to taxonomic inconsistencies despite its value as the largest open knowledge graph, necessitating systematic validation and correction methods.

Method: Proposes and applies a novel validation method to detect classification errors, over-generalized subclass links, and redundant connections, then develops an evaluation criterion and user inspection system.

Result: Confirms the presence of taxonomic inconsistencies in specific domains of Wikidata and provides tools for identifying and evaluating these issues.

Conclusion: The study successfully identifies taxonomic problems in Wikidata and creates a framework for leveraging crowdsourcing to address these inconsistencies through systematic validation and user participation.

Abstract: Wikidata is currently the largest open knowledge graph on the web, encompassing over 120 million entities. It integrates data from various domain-specific databases and imports a substantial amount of content from Wikipedia, while also allowing users to freely edit its content. This openness has positioned Wikidata as a central resource in knowledge graph research and has enabled convenient knowledge access for users worldwide. However, its relatively loose editorial policy has also led to a degree of taxonomic inconsistency. Building on prior work, this study proposes and applies a novel validation method to confirm the presence of classification errors, over-generalized subclass links, and redundant connections in specific domains of Wikidata. We further introduce a new evaluation criterion for determining whether such issues warrant correction and develop a system that allows users to inspect the taxonomic relationships of arbitrary Wikidata entities-leveraging the platform’s crowdsourced nature to its full potential.

[22] LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model

Wei Shao, Lingchao Zheng, Pengyu Wang, Peizhen Zheng, Jun Li, Yuwei Fan

Main category: cs.CL

TL;DR: LoPT is a lossless parallel tokenization framework that accelerates long context inference while ensuring output identical to sequential tokenization, addressing boundary artifacts in existing parallel methods.

Details

Motivation: Long context inference introduces computational latency, and while prior work optimized operators and architectures, tokenization remains an overlooked bottleneck. Existing parallel tokenization methods suffer from inconsistent results due to boundary artifacts after merging.

Method: LoPT employs character-position-based matching and dynamic chunk length adjustment to align and merge tokenized segments accurately, ensuring lossless parallel tokenization.

Result: Extensive experiments across diverse long-text datasets demonstrate that LoPT achieves significant speedup while guaranteeing lossless tokenization, with theoretical proof of consistency and comprehensive analytical studies validating robustness.

Conclusion: LoPT successfully addresses the tokenization bottleneck in long context inference by providing a lossless parallel tokenization framework that maintains output consistency while improving computational efficiency.

Abstract: Long context inference scenarios have become increasingly important for large language models, yet they introduce significant computational latency. While prior research has optimized long-sequence inference through operators, model architectures, and system frameworks, tokenization remains an overlooked bottleneck. Existing parallel tokenization methods accelerate processing through text segmentation and multi-process tokenization, but they suffer from inconsistent results due to boundary artifacts that occur after merging. To address this, we propose LoPT, a novel Lossless Parallel Tokenization framework that ensures output identical to standard sequential tokenization. Our approach employs character-position-based matching and dynamic chunk length adjustment to align and merge tokenized segments accurately. Extensive experiments across diverse long-text datasets demonstrate that LoPT achieves significant speedup while guaranteeing lossless tokenization. We also provide theoretical proof of consistency and comprehensive analytical studies to validate the robustness of our method.

[23] Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Zihao Yi, Qingxuan Jiang, Ruotian Ma, Xingyu Chen, Qu Yang, Mengru Wang, Fanghua Ye, Ying Shen, Zhaopeng Tu, Xiaolong Li, Linus

Main category: cs.CL

TL;DR: LLMs struggle to authentically role-play villainous characters due to safety alignment, showing monotonic decline in fidelity as character morality decreases.

Details

Motivation: To investigate how LLMs' safety alignment conflicts with their ability to portray morally ambiguous or villainous characters in creative generation tasks.

Method: Created Moral RolePlay benchmark with four-level moral alignment scale, tested state-of-the-art LLMs role-playing characters from moral paragons to pure villains.

Result: Models show consistent decline in role-playing fidelity with decreasing morality, struggle most with traits like “Deceitful” and “Manipulative”, and substitute nuanced malevolence with superficial aggression.

Conclusion: Safety alignment creates fundamental tension with creative fidelity, highlighting need for more nuanced, context-aware alignment methods.

Abstract: Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as Deceitful'' and Manipulative’’, often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.

[24] Acquiring Common Chinese Emotional Events Using Large Language Model

Ya Wang, Guangzheng Zhu, Cungen Cao, Jingjing Li, He Li, Xin Huang

Main category: cs.CL

TL;DR: This paper presents a method to automatically generate and filter common Chinese emotional events using LLMs, creating a large-scale knowledge base of 102,218 high-quality emotional events with sentiment labels.

Details

Motivation: Emotional events are important for applications but difficult to acquire, especially context-independent common emotional events in Chinese language.

Method: Collect emotional event indicators, prompt Chinese LLM to generate events, train filter to discard invalid results, and classify events as positive/negative using different techniques.

Result: Created a knowledge base of 102,218 high-quality common emotional events with sentiment polarity labels - the only large-scale commonsense knowledge base of emotional events in Chinese.

Conclusion: The method effectively acquires common Chinese emotional events and shows strong potential for emotion cause extraction applications.

Abstract: Knowledge about emotional events is an important kind of knowledge which has been applied to improve the effectiveness of different applications. However, emotional events cannot be easily acquired, especially common or generalized emotional events that are context-independent. The goal of this paper is to obtain common emotional events in Chinese language such as “win a prize” and “be criticized”. Our approach begins by collecting a comprehensive list of Chinese emotional event indicators. Then, we generate emotional events by prompting a Chinese large language model (LLM) using these indicators. To ensure the quality of these emotional events, we train a filter to discard invalid generated results. We also classify these emotional events as being positive events and negative events using different techniques. Finally, we harvest a total of 102,218 high-quality common emotional events with sentiment polarity labels, which is the only large-scale commonsense knowledge base of emotional events in Chinese language. Intrinsic evaluation results show that the proposed method in this paper can be effectively used to acquire common Chinese emotional events. An extrinsic use case also demonstrates the strong potential of common emotional events in the field of emotion cause extraction (ECE). Related resources including emotional event indicators and emotional events will be released after the publication of this paper.

[25] Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies

Prasoon Varshney, Makesh Narsimhan Sreedhar, Liwei Jiang, Traian Rebedea, Christopher Parisien

Main category: cs.CL

TL;DR: PBSUITE is a dynamic evaluation suite that assesses LLMs’ ability to adhere to diverse behavioral policies in multi-turn conversations, revealing that current models fail significantly under adversarial interactions despite good single-turn performance.

Details

Motivation: Real-world LLM applications occur in organizational contexts with specific policies and requirements, but current alignment focuses on universal safety principles rather than pluralistic adaptation to diverse user values and needs.

Method: Developed PBSUITE consisting of: (1) 300 realistic LLM behavioral policies across 30 industries, and (2) a dynamic evaluation framework for stress-testing model compliance with custom behavioral specifications under adversarial conditions.

Result: Leading LLMs show strong adherence in single-turn settings (<4% failure rates) but compliance weakens substantially in multi-turn adversarial interactions (up to 84% failure rates).

Conclusion: Current alignment and safety moderation methods are inadequate for coherently enforcing pluralistic behavioral policies in real-world LLM interactions, highlighting the need for robust, context-aware pluralistic alignment techniques.

Abstract: Large language models (LLMs) are typically aligned to a universal set of safety and usage principles intended for broad public acceptability. Yet, real-world applications of LLMs often take place within organizational ecosystems shaped by distinctive corporate policies, regulatory requirements, use cases, brand guidelines, and ethical commitments. This reality highlights the need for rigorous and comprehensive evaluation of LLMs with pluralistic alignment goals, an alignment paradigm that emphasizes adaptability to diverse user values and needs. In this work, we present PLURALISTIC BEHAVIOR SUITE (PBSUITE), a dynamic evaluation suite designed to systematically assess LLMs’ capacity to adhere to pluralistic alignment specifications in multi-turn, interactive conversations. PBSUITE consists of (1) a diverse dataset of 300 realistic LLM behavioral policies, grounded in 30 industries; and (2) a dynamic evaluation framework for stress-testing model compliance with custom behavioral specifications under adversarial conditions. Using PBSUITE, We find that leading open- and closed-source LLMs maintain robust adherence to behavioral policies in single-turn settings (less than 4% failure rates), but their compliance weakens substantially in multi-turn adversarial interactions (up to 84% failure rates). These findings highlight that existing model alignment and safety moderation methods fall short in coherently enforcing pluralistic behavioral policies in real-world LLM interactions. Our work contributes both the dataset and analytical framework to support future research toward robust and context-aware pluralistic alignment techniques.

[26] UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian

Mykyta Syromiatnikov, Victoria Ruvinskaya

Main category: cs.CL

TL;DR: UA-Code-Bench is a new benchmark for evaluating code generation in Ukrainian using 500 competitive programming problems, showing even top models solve only half the problems.

Details

Motivation: Existing benchmarks focus on English or simple tasks, lacking comprehensive evaluation for low-resource languages like Ukrainian.

Method: Used 500 Eolymp problems across 5 difficulty levels, evaluated 13 models with one-shot Python code generation via hidden tests in dedicated environment.

Result: Top models like OpenAI o3 and GPT-5 solved only 50% of problems, with performance analysis across difficulty levels and computational efficiency metrics.

Conclusion: Competitive programming benchmarks are valuable for evaluating LLMs in underrepresented languages, paving way for multilingual code generation research.

Abstract: Evaluating the real capabilities of large language models in low-resource languages still represents a challenge, as many existing benchmarks focus on widespread tasks translated from English or evaluate only simple language understanding. This paper introduces UA-Code-Bench, a new open-source benchmark established for a thorough evaluation of language models’ code generation and competitive programming problem-solving abilities in Ukrainian. The benchmark comprises 500 problems from the Eolymp platform, evenly distributed across five complexity levels from very easy to very hard. A diverse set of 13 leading proprietary and open-source models, generating Python solutions based on a one-shot prompt, was evaluated via the dedicated Eolymp environment against hidden tests, ensuring code correctness. The obtained results reveal that even top-performing models, such as OpenAI o3 and GPT-5, solve only half of the problems, highlighting the challenge of code generation in low-resource natural language. Furthermore, this research presents a comprehensive analysis of performance across various difficulty levels, as well as an assessment of solution uniqueness and computational efficiency, measured by both elapsed time and memory consumption of the generated solutions. In conclusion, this work demonstrates the value of competitive programming benchmarks in evaluating large language models, especially in underrepresented languages. It also paves the way for future research on multilingual code generation and reasoning-enhanced models. The benchmark, data parsing, preparation, code generation, and evaluation scripts are available at https://huggingface.co/datasets/NLPForUA/ua-code-bench.

[27] Order-Level Attention Similarity Across Language Models: A Latent Commonality

Jinglin Liang, Jin Zhong, Shuangping Huang, Yunqing Hu, Huiyuan Zhang, Huifang Li, Lixin Fan, Hanlin Gu

Main category: cs.CL

TL;DR: The paper reveals that Language Models share common context aggregation patterns through Order-Level Attention (OLA), enabling cross-model knowledge transfer via a training-free adapter method called TOA.

Details

Motivation: To systematically analyze commonalities in context aggregation patterns across multiple Language Models, which previous works lacked by focusing on individual models or attention heads.

Method: Introduces Order-Level Attention (OLA) from attention rollout decomposition, discovers OLA-syntax mapping, and proposes Transferable OLA Adapter (TOA) - a training-free cross-LM adapter that uses OLA as unified syntactic features.

Result: OLA at the same order across LMs shows significant similarities, and TOA effectively enhances performance of unseen LMs without parameter updates through cross-LM generalization.

Conclusion: The discovered OLA commonalities enable effective cross-model knowledge transfer, providing new insights into LM understanding and facilitating model interoperability.

Abstract: In this paper, we explore an important yet previously neglected question: Do context aggregation patterns across Language Models (LMs) share commonalities? While some works have investigated context aggregation or attention weights in LMs, they typically focus on individual models or attention heads, lacking a systematic analysis across multiple LMs to explore their commonalities. In contrast, we focus on the commonalities among LMs, which can deepen our understanding of LMs and even facilitate cross-model knowledge transfer. In this work, we introduce the Order-Level Attention (OLA) derived from the order-wise decomposition of Attention Rollout and reveal that the OLA at the same order across LMs exhibits significant similarities. Furthermore, we discover an implicit mapping between OLA and syntactic knowledge. Based on these two findings, we propose the Transferable OLA Adapter (TOA), a training-free cross-LM adapter transfer method. Specifically, we treat the OLA as a unified syntactic feature representation and train an adapter that takes OLA as input. Due to the similarities in OLA across LMs, the adapter generalizes to unseen LMs without requiring any parameter updates. Extensive experiments demonstrate that TOA’s cross-LM generalization effectively enhances the performance of unseen LMs. Code is available at https://github.com/jinglin-liang/OLAS.

Manan Sharma, Arya Suneesh, Manish Jain, Pawan Kumar Rajpoot, Prasanna Devadiga, Bharatdeep Hazarika, Ashish Shrivastava, Kishan Gurumurthy, Anshuman B Suresh, Aditya U Baliga

Main category: cs.CL

TL;DR: A multilingual claim normalization system that transforms noisy social media posts into verifiable statements across 20 languages using systematic decomposition and achieves strong cross-lingual transfer despite English-only training.

Details

Motivation: To address the challenge of multilingual misinformation detection by creating a system that can normalize noisy social media claims into clear, verifiable statements across multiple languages, enabling better fact-checking and misinformation detection.

Method: Systematic decomposition using Who, What, Where, When, Why and How questions; finetuning Qwen3-14B with LoRA; intra-post deduplication; token-level recall filtering for semantic alignment; retrieval-augmented few-shot learning with contextual examples during inference.

Result: Achieved METEOR scores from 41.16 (English) to 15.21 (Marathi); ranked third on English leaderboard and fourth for Dutch and Punjabi; 41.3% relative improvement in METEOR over baseline; effective cross-lingual generalization for Romance and Germanic languages.

Conclusion: The approach successfully demonstrates robust cross-lingual transfer capabilities for claim normalization, maintaining semantic coherence across diverse linguistic structures despite training exclusively on English data.

Abstract: We address claim normalization for multilingual misinformation detection - transforming noisy social media posts into clear, verifiable statements across 20 languages. The key contribution demonstrates how systematic decomposition of posts using Who, What, Where, When, Why and How questions enables robust cross-lingual transfer despite training exclusively on English data. Our methodology incorporates finetuning Qwen3-14B using LoRA with the provided dataset after intra-post deduplication, token-level recall filtering for semantic alignment and retrieval-augmented few-shot learning with contextual examples during inference. Our system achieves METEOR scores ranging from 41.16 (English) to 15.21 (Marathi), securing third rank on the English leaderboard and fourth rank for Dutch and Punjabi. The approach shows 41.3% relative improvement in METEOR over baseline configurations and substantial gains over existing methods. Results demonstrate effective cross-lingual generalization for Romance and Germanic languages while maintaining semantic coherence across diverse linguistic structures.

[29] On Text Simplification Metrics and General-Purpose LLMs for Accessible Health Information, and A Potential Architectural Advantage of The Instruction-Tuned LLM class

P. Bilha Githinji, Aikaterini Meilliou, Peiwu Qin

Main category: cs.CL

TL;DR: Instruction-tuned Mistral 24B outperforms reasoning-augmented QWen2.5 32B in biomedical text simplification, achieving better balance between readability (SARI: 42.46) and discourse fidelity (BERTScore: 0.91).

Details

Motivation: Address the need for scalable solutions to adapt complex scientific documents into plain language for public health information consumption, while resolving the tension between readability optimization and discourse fidelity preservation.

Method: Comparative analysis of two LLM classes: instruction-tuned Mistral 24B and reasoning-augmented QWen2.5 32B, evaluated against human benchmarks using 21 metrics spanning readability, discourse fidelity, content safety, and distributional measures.

Result: Mistral exhibits tempered lexical simplification strategy with enhanced readability (SARI: 42.46) and preserved human-level discourse (BERTScore: 0.91), while QWen shows disconnect in balancing readability and accuracy (BERTScore: 0.89). Strong functional redundancies found among five readability indices.

Conclusion: Instruction-tuned Mistral 24B is identified as superior for text simplification, with lexical support identified as primary domain-adaptation issue, providing baseline performance tracking and metric selection heuristics for evolving LLMs.

Abstract: The increasing health-seeking behavior and digital consumption of biomedical information by the general public necessitate scalable solutions for automatically adapting complex scientific and technical documents into plain language. Automatic text simplification solutions, including advanced large language models, however, continue to face challenges in reliably arbitrating the tension between optimizing readability performance and ensuring preservation of discourse fidelity. This report empirically assesses the performance of two major classes of general-purpose LLMs, demonstrating their linguistic capabilities and foundational readiness for the task compared to a human benchmark. Using a comparative analysis of the instruction-tuned Mistral 24B and the reasoning-augmented QWen2.5 32B, we identify a potential architectural advantage in the instruction-tuned LLM. Mistral exhibits a tempered lexical simplification strategy that enhances readability across a suite of metrics and the simplification-specific formula SARI (mean 42.46), while preserving human-level discourse with a BERTScore of 0.91. QWen also attains enhanced readability performance, but its operational strategy shows a disconnect in balancing between readability and accuracy, reaching a statistically significantly lower BERTScore of 0.89. Additionally, a comprehensive correlation analysis of 21 metrics spanning readability, discourse fidelity, content safety, and underlying distributional measures for mechanistic insights, confirms strong functional redundancies among five readability indices. This empirical evidence tracks baseline performance of the evolving LLMs for the task of text simplification, identifies the instruction-tuned Mistral 24B for simplification, provides necessary heuristics for metric selection, and points to lexical support as a primary domain-adaptation issue for simplification.

[30] Iterative Layer-wise Distillation for Efficient Compression of Large Language Models

Grigory Kovalev, Mikhail Tikhomirov

Main category: cs.CL

TL;DR: An improved distillation method for LLMs that iteratively evaluates layer importance and reduces model size while maintaining performance, achieving 28 layers (2.47B params) with only 9.7% quality loss from original 36 layers.

Details

Motivation: To develop compact LLMs that preserve high performance for deployment in resource-limited settings, addressing the need for efficient models without significant quality degradation.

Method: Iterative evaluation of layer importance by measuring performance degradation when removing individual layers, combined with training using joint loss function (KL divergence + mean squared error). Builds upon ShortGPT approach.

Result: Reduced Qwen2.5-3B from 36 to 28 layers (2.47B params) with 9.7% quality loss, and to 24 layers with 18% loss. Found middle transformer layers contribute less to inference.

Conclusion: The method effectively creates efficient models through iterative distillation and fine-tuning, demonstrating potential for resource-limited deployment while maintaining acceptable performance levels.

Abstract: This work investigates distillation methods for large language models (LLMs) with the goal of developing compact models that preserve high performance. Several existing approaches are reviewed, with a discussion of their respective strengths and limitations. An improved method based on the ShortGPT approach has been developed, building upon the idea of incorporating iterative evaluation of layer importance. At each step, importance is assessed by measuring performance degradation when individual layers are removed, using a set of representative datasets. This process is combined with further training using a joint loss function based on KL divergence and mean squared error. Experiments on the Qwen2.5-3B model show that the number of layers can be reduced from 36 to 28 (resulting in a 2.47 billion parameter model) with only a 9.7% quality loss, and to 24 layers with an 18% loss. The findings suggest that the middle transformer layers contribute less to inference, underscoring the potential of the proposed method for creating efficient models. The results demonstrate the effectiveness of iterative distillation and fine-tuning, making the approach suitable for deployment in resource-limited settings.

[31] A Toolbox for Improving Evolutionary Prompt Search

Daniel Grießhaber, Maximilian Kimmich, Johannes Maucher, Ngoc Thang Vu

Main category: cs.CL

TL;DR: Improved evolutionary prompt optimization with decomposition, LLM-based verification, human feedback integration, and efficient evaluation strategies.

Details

Motivation: Existing evolutionary prompt optimization approaches lack robust operators and efficient evaluation mechanisms.

Method: Decompose evolution into distinct steps, introduce LLM-based judge for verification, integrate human feedback to refine operators, and develop efficient evaluation strategies.

Result: Approach improves both optimization quality and efficiency in prompt optimization.

Conclusion: The proposed improvements can generalize to prompt optimization in general, and code is released to facilitate further research.

Abstract: Evolutionary prompt optimization has demonstrated effectiveness in refining prompts for LLMs. However, existing approaches lack robust operators and efficient evaluation mechanisms. In this work, we propose several key improvements to evolutionary prompt optimization that can partially generalize to prompt optimization in general: 1) decomposing evolution into distinct steps to enhance the evolution and its control, 2) introducing an LLM-based judge to verify the evolutions, 3) integrating human feedback to refine the evolutionary operator, and 4) developing more efficient evaluation strategies that maintain performance while reducing computational overhead. Our approach improves both optimization quality and efficiency. We release our code, enabling prompt optimization on new tasks and facilitating further research in this area.

[32] ManufactuBERT: Efficient Continual Pretraining for Manufacturing

Robin Armingaud, Romaric Besançon

Main category: cs.CL

TL;DR: ManufactuBERT is a RoBERTa model continually pretrained on a deduplicated manufacturing corpus, achieving state-of-the-art performance on manufacturing NLP tasks with 33% faster convergence.

Details

Motivation: General-purpose Transformer encoders perform poorly in specialized domains like manufacturing due to lack of domain-specific terminology and semantics exposure.

Method: Created a comprehensive data processing pipeline with domain-specific filtering and multi-stage deduplication, then continually pretrained RoBERTa on the curated manufacturing corpus.

Result: ManufactuBERT establishes new state-of-the-art on manufacturing NLP tasks and shows 33% reduction in training time/computational cost compared to non-deduplicated dataset.

Conclusion: The pipeline provides a reproducible example for developing high-performing encoders in specialized domains, with model and corpus released publicly.

Abstract: While large general-purpose Transformer-based encoders excel at general language understanding, their performance diminishes in specialized domains like manufacturing due to a lack of exposure to domain-specific terminology and semantics. In this paper, we address this gap by introducing ManufactuBERT, a RoBERTa model continually pretrained on a large-scale corpus curated for the manufacturing domain. We present a comprehensive data processing pipeline to create this corpus from web data, involving an initial domain-specific filtering step followed by a multi-stage deduplication process that removes redundancies. Our experiments show that ManufactuBERT establishes a new state-of-the-art on a range of manufacturing-related NLP tasks, outperforming strong specialized baselines. More importantly, we demonstrate that training on our carefully deduplicated corpus significantly accelerates convergence, leading to a 33% reduction in training time and computational cost compared to training on the non-deduplicated dataset. The proposed pipeline offers a reproducible example for developing high-performing encoders in other specialized domains. We will release our model and curated corpus at https://huggingface.co/cea-list-ia.

[33] Mind the Gap… or Not? How Translation Errors and Evaluation Details Skew Multilingual Results

Jan-Thorsten Peter, David Vilar, Tobias Domhan, Dan Malkin, Markus Freitag

Main category: cs.CL

TL;DR: The paper reveals that reported performance gaps in multilingual math benchmarks are largely due to translation errors in datasets and inconsistent answer extraction methods, not actual model capability differences.

Details

Motivation: To investigate the true performance of LLMs across different languages in math tasks, challenging the commonly reported performance gaps between high-resource and low-resource languages.

Method: Analyzed the MGSM benchmark for translation errors, proposed automatic quality assurance methods for dataset validation, and provided recommendations for standardized answer extraction from LLM outputs.

Result: After correcting translation errors and standardizing answer extraction, the previously reported performance gap between languages mostly disappeared, showing similar capabilities across different languages.

Conclusion: The perceived language gap in LLM performance is largely an artifact of dataset quality issues rather than true model limitations, highlighting the need for better dataset validation and standardized evaluation protocols.

Abstract: Most current large language models (LLMs) support a wide variety of languages in addition to English, including high-resource languages (e.g. German, Chinese, French), as well as low-resource ones (e.g. Swahili, Telugu). In addition they have also shown impressive capabilities in different domains, like coding, science and math. In this short paper, taking math as an example domain, we study the performance of different LLMs across languages. Experimental results show that there exists a non-negligible and consistent gap in the performance of the models across languages. Interestingly, and somewhat against expectations, the gap exists for both high- and low-resource languages. We hope that these results influence further research into cross-lingual capability generalization for next generation LLMs. If it weren’t for the fact that they are false! By analyzing one of the standard multilingual math benchmarks (MGSM), we determine that several translation errors are present in the data. Furthermore, the lack of standardized answer extraction from LLM outputs further influences the final results. We propose a method for automatic quality assurance to address the first issue at scale, and give recommendations to address the second one. Combining these two approaches we show that the aforementioned language gap mostly disappears, leading to completely different conclusions from our research. We additionally release the corrected dataset to the community.

[34] Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

Cong-Thanh Do, Rama Doddipatla, Kate Knill

Main category: cs.CL

TL;DR: CoT prompting enhances white-box knowledge distillation, improving smaller LLMs’ reasoning on complex tasks.

Details

Motivation: To investigate how Chain-of-Thought (CoT) can effectively transfer reasoning capabilities from larger to smaller LLMs through knowledge distillation.

Method: Used white-box knowledge distillation with CoT data from CoT-Collection dataset on Qwen and Llama2 LLMs, evaluated on BIG-Bench-Hard tasks.

Result: CoT improved white-box KD effectiveness, enabling distilled models to achieve better average performance on natural language reasoning tasks.

Conclusion: CoT plays a crucial role in enhancing reasoning capability transfer from larger to smaller LLMs through knowledge distillation.

Abstract: Chain-of-Thought (CoT) prompting is a widely used method to improve the reasoning capability of Large Language Models (LLMs). More recently, CoT has been leveraged in Knowledge Distillation (KD) to transfer reasoning capability from a larger LLM to a smaller one. This paper examines the role of CoT in distilling the reasoning capability from larger LLMs to smaller LLMs using white-box KD, analysing its effectiveness in improving the performance of the distilled models for various natural language reasoning and understanding tasks. We conduct white-box KD experiments using LLMs from the Qwen and Llama2 families, employing CoT data from the CoT-Collection dataset. The distilled models are then evaluated on natural language reasoning and understanding tasks from the BIG-Bench-Hard (BBH) benchmark, which presents complex challenges for smaller LLMs. Experimental results demonstrate the role of CoT in improving white-box KD effectiveness, enabling the distilled models to achieve better average performance in natural language reasoning and understanding tasks from BBH.

[35] Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese

Zilong Li, Jie Cao

Main category: cs.CL

TL;DR: This paper addresses the low-resource problem of translating classical Chinese to Japanese using character-level annotation methods, proposing an LLM-based pipeline and showing that auxiliary Chinese NLP tasks improve sequence tagging performance.

Details

Motivation: To solve the low-resource problem in translating classical Chinese to Japanese using traditional character annotation methods, and to adapt this ancient translation process to modern language technologies.

Method: Introduces an LLM-based annotation pipeline, constructs a new dataset from digitalized open-source translation data, and uses auxiliary Chinese NLP tasks to enhance sequence tagging training under low-resource conditions.

Result: Auxiliary Chinese NLP tasks improve sequence tagging performance in low-resource settings. LLMs achieve high scores in direct machine translation but struggle with character annotation tasks.

Conclusion: The proposed method can supplement LLMs by providing effective character-level annotation capabilities where LLMs are confused, offering a viable solution for classical Chinese-Japanese translation in low-resource scenarios.

Abstract: Ancient people translated classical Chinese into Japanese by annotating around each character. We abstract this process as sequence tagging tasks and fit them into modern language technologies. The research of this annotation and translation system is a facing low-resource problem. We release this problem by introducing a LLM-based annotation pipeline and construct a new dataset from digitalized open-source translation data. We show that under the low-resource setting, introducing auxiliary Chinese NLP tasks has a promoting effect on the training of sequence tagging tasks. We also evaluate the performance of large language models. They achieve high scores in direct machine translation, but they are confused when being asked to annotate characters. Our method could work as a supplement of LLMs.

[36] Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

Teqi Hao, Xioayu Tan, Shaojie Shi, Yinghui Xu, Xihe Qiu

Main category: cs.CL

TL;DR: RPO is a two-stage framework that decouples content generation from personalization, using a reflection module to rewrite generic responses to align with user preferences, outperforming existing methods.

Details

Motivation: Existing personalization approaches burden LLMs with both content generation and style alignment, leading to trade-offs that compromise output quality and control.

Method: Two-stage process: base model generates generic response, then reflection module rewrites it for personalization. Reflection module trained via supervised fine-tuning on rewriting trajectories and refined with reinforcement learning.

Result: RPO significantly outperforms state-of-the-art baselines on LaMP benchmark, demonstrating superiority of explicit response shaping over implicit context injection.

Conclusion: RPO provides an efficient, model-agnostic personalization layer that can be integrated with any base model, offering a new direction for user-centric generation.

Abstract: The personalization of black-box large language models (LLMs) is a critical yet challenging task. Existing approaches predominantly rely on context injection, where user history is embedded into the prompt to directly guide the generation process. However, this single-step paradigm imposes a dual burden on the model: generating accurate content while simultaneously aligning with user-specific styles. This often results in a trade-off that compromises output quality and limits precise control. To address this fundamental tension, we propose Reflective Personalization Optimization (RPO), a novel framework that redefines the personalization paradigm by decoupling content generation from alignment. RPO operates in two distinct stages: first, a base model generates a high-quality, generic response; then, an external reflection module explicitly rewrites this output to align with the user’s preferences. This reflection module is trained using a two-stage process. Initially, supervised fine-tuning is employed on structured rewriting trajectories to establish a core personalized reasoning policy that models the transformation from generic to user-aligned responses. Subsequently, reinforcement learning is applied to further refine and enhance the quality of the personalized outputs. Comprehensive experiments on the LaMP benchmark demonstrate that RPO, by decoupling content generation from personalization, significantly outperforms state-of-the-art baselines. These findings underscore the superiority of explicit response shaping over implicit context injection. Moreover, RPO introduces an efficient, model-agnostic personalization layer that can be seamlessly integrated with any underlying base model, paving the way for a new and effective direction in user-centric generation scenarios.

[37] Listening Between the Lines: Decoding Podcast Narratives with Language Modeling

Shreya Gupta, Ojasva Saxena, Arghodeep Nandi, Sarah Masud, Kiran Garimella, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: Fine-tuned BERT model for analyzing narrative frames in podcasts by linking frames to specific entities, enabling better understanding of discourse trends in conversational data.

Details

Motivation: Podcasts are important for shaping public opinion but their unscripted, conversational nature makes automated analysis challenging, as existing LLMs struggle with subtle narrative cues.

Method: Developed a fine-tuned BERT model that explicitly links narrative frames to specific entities mentioned in conversations, then correlates granular frame labels with high-level topics.

Result: The approach more closely aligns with human judgment for messy conversational data and reveals systematic relationships between topics and presentation frames.

Conclusion: Provides a more robust framework for studying influence in digital media by connecting what is being discussed with how it’s presented through narrative frames.

Abstract: Podcasts have become a central arena for shaping public opinion, making them a vital source for understanding contemporary discourse. Their typically unscripted, multi-themed, and conversational style offers a rich but complex form of data. To analyze how podcasts persuade and inform, we must examine their narrative structures – specifically, the narrative frames they employ. The fluid and conversational nature of podcasts presents a significant challenge for automated analysis. We show that existing large language models, typically trained on more structured text such as news articles, struggle to capture the subtle cues that human listeners rely on to identify narrative frames. As a result, current approaches fall short of accurately analyzing podcast narratives at scale. To solve this, we develop and evaluate a fine-tuned BERT model that explicitly links narrative frames to specific entities mentioned in the conversation, effectively grounding the abstract frame in concrete details. Our approach then uses these granular frame labels and correlates them with high-level topics to reveal broader discourse trends. The primary contributions of this paper are: (i) a novel frame-labeling methodology that more closely aligns with human judgment for messy, conversational data, and (ii) a new analysis that uncovers the systematic relationship between what is being discussed (the topic) and how it is being presented (the frame), offering a more robust framework for studying influence in digital media.

[38] What Are the Facts? Automated Extraction of Court-Established Facts from Criminal-Court Opinions

Klára Bendová, Tomáš Knap, Jan Černý, Vojtěch Pour, Jaromir Savelka, Ivana Kvapilíková, Jakub Drápal

Main category: cs.CL

TL;DR: This paper demonstrates that criminal behavior descriptions can be effectively extracted from Slovak court verdicts using advanced regular expressions and LLMs, achieving near-perfect accuracy (99.5%) when combined.

Details

Motivation: Criminal justice data lacks detailed offense information, but court verdicts contain extensive behavioral descriptions that remain unused in continental European systems.

Method: Used three approaches: baseline regular expressions, advanced regular expressions focusing on ‘sparing’ normalization, and LLM prompting with Gemini Flash 2.0 to extract descriptions from verdicts.

Result: Advanced methods significantly outperformed baseline: 97% accuracy with advanced regex, 98.75% with LLMs, and 99.5% when combined. Human evaluation showed 90% match with human annotations vs 34.5% for baseline.

Conclusion: Both advanced regular expressions and LLMs are highly effective for extracting criminal behavior descriptions from court verdicts, with LLMs achieving slightly better performance and near-perfect accuracy when combined.

Abstract: Criminal justice administrative data contain only a limited amount of information about the committed offense. However, there is an unused source of extensive information in continental European courts’ decisions: descriptions of criminal behaviors in verdicts by which offenders are found guilty. In this paper, we study the feasibility of extracting these descriptions from publicly available court decisions from Slovakia. We use two different approaches for retrieval: regular expressions and large language models (LLMs). Our baseline was a simple method employing regular expressions to identify typical words occurring before and after the description. The advanced regular expression approach further focused on “sparing” and its normalization (insertion of spaces between individual letters), typical for delineating the description. The LLM approach involved prompting the Gemini Flash 2.0 model to extract the descriptions using predefined instructions. Although the baseline identified descriptions in only 40.5% of verdicts, both methods significantly outperformed it, achieving 97% with advanced regular expressions and 98.75% with LLMs, and 99.5% when combined. Evaluation by law students showed that both advanced methods matched human annotations in about 90% of cases, compared to just 34.5% for the baseline. LLMs fully matched human-labeled descriptions in 91.75% of instances, and a combination of advanced regular expressions with LLMs reached 92%.

[39] Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE

Firoj Ahmmed Patwary, Abdullah Al Noman

Main category: cs.CL

TL;DR: BengaliBPE is a specialized Byte Pair Encoding tokenizer for Bengali that uses Unicode normalization, grapheme-level initialization, and morphology-aware merge rules to improve tokenization for morphologically rich languages.

Details

Motivation: Current subword tokenizers like SentencePiece or HuggingFace BPE are designed for Latin/multilingual corpora and perform poorly on morphologically rich languages like Bengali, creating a need for language-specific tokenization.

Method: Developed BengaliBPE with Unicode normalization, grapheme-level initialization, and morphology-aware merge rules. Compared against Whitespace, SentencePiece BPE, and HuggingFace BPE on a Bengali news classification dataset.

Result: BengaliBPE provides the most detailed segmentation and best morphological interpretability, though with slightly higher computational cost. All methods performed reasonably well on classification accuracy.

Conclusion: Language-aware tokenization is crucial for morphologically rich scripts. BengaliBPE establishes a strong foundation for future Bengali NLP systems and large-scale pretraining of contextual language models.

Abstract: Tokenization is an important first step in Natural Language Processing (NLP) pipelines because it decides how models learn and represent linguistic information. However, current subword tokenizers like SentencePiece or HuggingFace BPE are mostly designed for Latin or multilingual corpora and do not perform well on languages with rich morphology such as Bengali. To address this limitation, we present BengaliBPE, a Byte Pair Encoding (BPE) tokenizer specifically developed for the Bengali script. BengaliBPE applies Unicode normalization, grapheme-level initialization, and morphology-aware merge rules to maintain linguistic consistency and preserve subword integrity. We use a large-scale Bengali news classification dataset to compare BengaliBPE with three baselines: Whitespace, SentencePiece BPE, and HuggingFace BPE. The evaluation considers tokenization granularity, encoding speed, and downstream classification accuracy. While all methods perform reasonably well, BengaliBPE provides the most detailed segmentation and the best morphological interpretability, albeit with slightly higher computational cost. These findings highlight the importance of language-aware tokenization for morphologically rich scripts and establish BengaliBPE as a strong foundation for future Bengali NLP systems, including large-scale pretraining of contextual language models.

[40] A multimodal multiplex of the mental lexicon for multilingual individuals

Maria Huynh, Wilder C. Rodrigues

Main category: cs.CL

TL;DR: This research explores how visual input affects language acquisition in multilingual individuals, specifically examining whether visual cues in translation tasks improve proficiency and accuracy compared to text-only conditions.

Details

Motivation: To understand how heritage languages influence additional language acquisition and investigate the structure of the mental lexicon in multilingual individuals, building on previous bilingualism research that shows multilinguals can outperform monolinguals in cognitive tasks.

Method: Uses a multilayer network approach with multimodality, extending Stella et al.’s multiplex model by adding a visual input layer that connects to lexical representations across multilingual layers of the mental lexicon, based on the Bilingual Interactive Activation (BIA+) framework.

Result: The abstract does not present specific results as this appears to be a research proposal rather than completed study findings.

Conclusion: The study aims to provide insights into how visual input affects language processing in multilingual contexts and contribute to understanding the architecture of the bilingual/multilingual mental lexicon.

Abstract: Historically, bilingualism was often perceived as an additional cognitive load that could hinder linguistic and intellectual development. However, over the last three decades, this view has changed considerably. Numerous studies have aimed to model and understand the architecture of the bilingual word recognition system Dijkstra and van Heuven (2002), investigating how parallel activation operates in the brain and how one language influences another Kroll et al. (2015). Increasingly, evidence suggests that multilinguals, individuals who speak three or more languages, can perform better than monolinguals in various linguistic and cognitive tasks, such as learning an additional language Abu-Rabia and Sanitsky (2010). This research proposal focuses on the study of the mental lexicon and how it may be structured in individuals who speak multiple languages. Building on the work of Stella et al. (2018), who investigated explosive learning in humans using a multiplex model of the mental lexicon, and the Bilingual Interactive Activation (BIA+) framework proposed by Dijkstra and van Heuven (2002), the present study applies the same multilayer network principles introduced by Kivela et al. (2014). Our experimental design extends previous research by incorporating multimodality into the multiplex model, introducing an additional layer that connects visual inputs to their corresponding lexical representations across the multilingual layers of the mental lexicon. In this research, we aim to explore how a heritage language influences the acquisition of another language. Specifically, we ask: Does the presence of visual input in a translation task influence participants’ proficiency and accuracy compared to text-only conditions?

[41] Large Language Models for Explainable Threat Intelligence

Tiago Dinis, Miguel Correia, Roger Tavares

Main category: cs.CL

TL;DR: RAGRecon uses LLMs with retrieval-augmented generation to provide explainable cybersecurity threat intelligence through knowledge graph visualizations.

Details

Motivation: Traditional security mechanisms struggle with complex cyber threats, while LLMs offer advanced text processing capabilities for cybersecurity applications.

Method: Proposed RAGRecon system combines LLMs with RAG to answer cybersecurity questions and generates visual knowledge graphs to explain AI reasoning.

Result: Experimental evaluation with 2 datasets and 7 LLMs showed responses matched reference responses over 91% of the time for best combinations.

Conclusion: RAGRecon successfully demonstrates explainable AI for cybersecurity threat intelligence with high accuracy and improved transparency.

Abstract: As cyber threats continue to grow in complexity, traditional security mechanisms struggle to keep up. Large language models (LLMs) offer significant potential in cybersecurity due to their advanced capabilities in text processing and generation. This paper explores the use of LLMs with retrieval-augmented generation (RAG) to obtain threat intelligence by combining real-time information retrieval with domain-specific data. The proposed system, RAGRecon, uses a LLM with RAG to answer questions about cybersecurity threats. Moreover, it makes this form of Artificial Intelligence (AI) explainable by generating and visually presenting to the user a knowledge graph for every reply. This increases the transparency and interpretability of the reasoning of the model, allowing analysts to better understand the connections made by the system based on the context recovered by the RAG system. We evaluated RAGRecon experimentally with two datasets and seven different LLMs and the responses matched the reference responses more than 91% of the time for the best combinations.

[42] Minority-Aware Satisfaction Estimation in Dialogue Systems via Preference-Adaptive Reinforcement Learning

Yahui Fu, Zi Haur Pang, Tatsuya Kawahara

Main category: cs.CL

TL;DR: A unified framework for user satisfaction estimation that models both individual and group-level preferences through personalized reasoning chains and preference-aware clustering, integrated with adaptive reinforcement learning.

Details

Motivation: Existing dialogue alignment methods use one-size-fits-all approaches that overlook minority user perspectives and individual preferences, leading to biased satisfaction estimation.

Method: Proposes Chain-of-Personalized-Reasoning (CoPeR) for individual preferences, Majority-Minority Preference-Aware Clustering (M2PC) for group discovery, and PAda-PPO reinforcement learning framework for joint optimization.

Result: Experiments on Emotional Support Conversation dataset show consistent improvements in user satisfaction estimation, especially for underrepresented user groups.

Conclusion: The proposed framework effectively addresses the limitations of existing methods by capturing both individual and group preferences, leading to more accurate and fair satisfaction estimation across diverse user populations.

Abstract: User satisfaction in dialogue systems is inherently subjective. When the same response strategy is applied across users, minority users may assign different satisfaction ratings than majority users due to variations in individual intents and preferences. However, existing alignment methods typically train one-size-fits-all models that aim for broad consensus, often overlooking minority perspectives and user-specific adaptation. We propose a unified framework that models both individual- and group-level preferences for user satisfaction estimation. First, we introduce Chain-of-Personalized-Reasoning (CoPeR) to capture individual preferences through interpretable reasoning chains. Second, we propose an expectation-maximization-based Majority-Minority Preference-Aware Clustering (M2PC) algorithm that discovers distinct user groups in an unsupervised manner to learn group-level preferences. Finally, we integrate these components into a preference-adaptive reinforcement learning framework (PAda-PPO) that jointly optimizes alignment with both individual and group preferences. Experiments on the Emotional Support Conversation dataset demonstrate consistent improvements in user satisfaction estimation, particularly for underrepresented user groups.

[43] Steering Language Models with Weight Arithmetic

Constanza Fierro, Fabien Roger

Main category: cs.CL

TL;DR: Contrastive weight steering is a post-training method that edits LLM parameters using weight arithmetic to isolate and modify behavioral directions, enabling better generalization from narrow training data while preserving capabilities.

Details

Motivation: Providing high-quality feedback across diverse distributions is expensive, and narrow training data can cause unintended generalizations. Need methods to better leverage limited training data.

Method: Isolate behavior direction by subtracting weight deltas from two fine-tunes (one inducing desired behavior, one inducing opposite), then add/remove this direction to modify model weights.

Result: Weight steering generalizes better than activation steering, achieves stronger out-of-distribution control before degrading capabilities, and can mitigate behavioral drift while preserving task performance.

Conclusion: Weight steering enables effective behavioral control from narrow data, and weight direction similarity can potentially detect emergent misalignment during training.

Abstract: Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes – one that induces the desired behavior and another that induces its opposite – and then add or remove this direction to modify the model’s weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an “evil” weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.

[44] MIMIC-SR-ICD11: A Dataset for Narrative-Based Diagnosis

Yuexin Wu, Shiqi Wang, Vasile Rus

Main category: cs.CL

TL;DR: MIMIC-SR-ICD11 is a diagnostic dataset from EHR discharge notes aligned with WHO ICD-11 terminology, and LL-Rank is a likelihood-based re-ranking framework that outperforms baseline methods by isolating semantic compatibility from label frequency bias.

Details

Motivation: Disease diagnosis is crucial for healthcare, and self-reports preserve clinically important signals that EHR documentation often misses, especially subtle details. The goal is to leverage these self-reports for better diagnostic accuracy.

Method: Introduce MIMIC-SR-ICD11 dataset from EHR discharge notes aligned with ICD-11 terminology. Present LL-Rank, a likelihood-based re-ranking framework that computes length-normalized joint likelihood of labels given clinical context and subtracts report-free prior likelihood.

Result: LL-Rank consistently outperforms GenMap baseline across seven model backbones. Ablation experiments show gains primarily come from PMI-based scoring that isolates semantic compatibility from label frequency bias.

Conclusion: The proposed LL-Rank framework effectively improves diagnostic accuracy by focusing on semantic compatibility between clinical reports and labels, reducing the impact of label frequency bias.

Abstract: Disease diagnosis is a central pillar of modern healthcare, enabling early detection and timely intervention for acute conditions while guiding lifestyle adjustments and medication regimens to prevent or slow chronic disease. Self-reports preserve clinically salient signals that templated electronic health record (EHR) documentation often attenuates or omits, especially subtle but consequential details. To operationalize this shift, we introduce MIMIC-SR-ICD11, a large English diagnostic dataset built from EHR discharge notes and natively aligned to WHO ICD-11 terminology. We further present LL-Rank, a likelihood-based re-ranking framework that computes a length-normalized joint likelihood of each label given the clinical report context and subtracts the corresponding report-free prior likelihood for that label. Across seven model backbones, LL-Rank consistently outperforms a strong generation-plus-mapping baseline (GenMap). Ablation experiments show that LL-Rank’s gains primarily stem from its PMI-based scoring, which isolates semantic compatibility from label frequency bias.

[45] To Word Senses and Beyond: Inducing Concepts with Contextualized Language Models

Bastien Liétard, Pascal Denis, Mikaela Keller

Main category: cs.CL

TL;DR: The paper introduces Concept Induction as an unsupervised task that learns soft clustering of words into concepts, generalizing Word Sense Induction. It proposes a bi-level approach combining local lemma-centric and global cross-lexicon views, achieving good performance on semantic clustering and competitive results on Word-in-Context tasks.

Details

Motivation: Polysemy and synonymy are typically studied independently in NLP tasks, with polysemy focusing on word senses and synonymy on concepts. The paper aims to bridge this gap by developing a unified approach that can handle both phenomena together through concept induction.

Method: A bi-level approach to Concept Induction that leverages both local lemma-centric view (focusing on individual words) and global cross-lexicon view (considering relationships across the lexicon) to induce concepts from data in an unsupervised manner.

Result: The method achieves BCubed F1 above 0.60 on SemCor’s annotated data, showing that local and global levels are mutually beneficial for inducing both concepts and senses. Concept embeddings created from the induced concepts achieve competitive performance with State-of-the-Art on Word-in-Context task.

Conclusion: Concept Induction successfully bridges the gap between polysemy and synonymy studies, demonstrating that a unified approach can effectively induce concepts while also benefiting sense induction. The bi-level approach proves effective for capturing both local word meanings and global conceptual relationships.

Abstract: Polysemy and synonymy are two crucial interrelated facets of lexical ambiguity. While both phenomena are widely documented in lexical resources and have been studied extensively in NLP, leading to dedicated systems, they are often being considered independently in practical problems. While many tasks dealing with polysemy (e.g. Word Sense Disambiguation or Induction) highlight the role of word’s senses, the study of synonymy is rooted in the study of concepts, i.e. meanings shared across the lexicon. In this paper, we introduce Concept Induction, the unsupervised task of learning a soft clustering among words that defines a set of concepts directly from data. This task generalizes Word Sense Induction. We propose a bi-level approach to Concept Induction that leverages both a local lemma-centric view and a global cross-lexicon view to induce concepts. We evaluate the obtained clustering on SemCor’s annotated data and obtain good performance (BCubed F1 above 0.60). We find that the local and the global levels are mutually beneficial to induce concepts and also senses in our setting. Finally, we create static embeddings representing our induced concepts and use them on the Word-in-Context task, obtaining competitive performance with the State-of-the-Art.

[46] LEME: Open Large Language Models for Ophthalmology with Advanced Reasoning and Clinical Validation

Hyunjae Kim, Xuguang Ai, Sahana Srinivasan, Aidan Gilson, Maxwell B. Singer, Krithi Pushpanathan, Qianqian Xie, Jungwoo Park, Serina Applebaum, Gabriel Dawei Yang, Minjie Zou, David Ziyou Chen, Ke Zou, Soshian Sarrafpour, Ji Liu, Yu Yin, Jimin Huang, Quang Ngoc Nguyen, Erping Long, Peixing Wan, Dianbo Liu, Richard Hintz, W. Jim Zheng, Sophia Y. Wang, Lucila Ohno-Machado, Hua Xu, Ron A. Adelman, Luciano V. Del Priore, Yih-Chung Tham, Qingyu Chen

Main category: cs.CL

TL;DR: LEME is an open-weight LLM suite for ophthalmology that outperforms GPT-4o and other baselines on clinical tasks through instruction tuning and reinforcement learning, achieving near-attending-level performance in patient QA, visual acuity extraction, and treatment planning.

Details

Motivation: The rising prevalence of eye diseases creates a growing public health burden, and while LLMs can reduce documentation workload and support clinical decision-making, few have been tailored for ophthalmology with proper clinical validation.

Method: Two-stage development: (1) instruction tuning on 200,000 samples from clinical guidelines, textbooks, and case reports; (2) reinforcement learning with ~30,000 preference labels to enhance accuracy and informativeness.

Result: Outperformed all 7 baselines including GPT-4o (3.32% absolute ROUGE-L gain), achieved highest clinician ratings in patient QA (4.67-4.88/5), surpassed expert-written answers in completeness, and achieved highest F1 in visual acuity extraction (14.1% better than LLaMA-3, 59.0% better than Eye-LLaMA).

Conclusion: LEME demonstrates strong performance across multiple ophthalmology tasks, approaching attending-level performance, and all models, data, and code will be released to support clinical translation and improved patient care.

Abstract: The rising prevalence of eye diseases poses a growing public health burden. Large language models (LLMs) offer a promising path to reduce documentation workload and support clinical decision-making. However, few have been tailored for ophthalmology, and most evaluations focus mainly on knowledge-based QA without clinically relevant benchmarks or real-world validation. Here, we present LEME, a suite of open-weight LLMs developed through a two-stage process: (1) instruction tuning on 200,000 samples from clinical guidelines, textbooks, and case reports to enhance reasoning and task-following, and (2) reinforcement learning with ~30,000 preference labels to enhance accuracy and informativeness. LEME was evaluated on five curated zero-shot benchmarks spanning tasks such as patient QA, consultation, and treatment planning. It outperformed all seven baselines (all p < 0.004), exceeding GPT-4o by 3.32% (absolute ROUGE-L gain). It was further evaluated on three downstream tasks using deidentified patient data, reviewed by clinicians. In patient QA, LEME received the highest ratings from attending clinicians in 3 out of 4 criteria, with scores of 4.67 for factuality, 4.77 for specificity, 4.79 for completeness, and 4.88 for safety (1-5 scale). Its completeness score surpassed that of expert-written answers (4.79 vs. 4.56; p = 0.015). In visual acuity extraction, LEME achieved the highest F1, outperforming LLaMA-3 by 14.1% and Eye-LLaMA by 59.0%. In a pilot evaluation on assessment and treatment planning for diabetic retinopathy, AMD, and glaucoma, LEME received scores of 4.36 for factuality, 4.55 for specificity, 4.42 for completeness, and 4.36 for safety, approaching attending-level performance. All models, data, and code will be released to support further development and clinical translation, laying the groundwork for improved efficiency and patient care

[47] Extracting narrative signals from public discourse: a network-based approach

Armin Pournaki, Tom Willaert

Main category: cs.CL

TL;DR: A graph-based method using Abstract Meaning Representation (AMR) to extract and analyze political narratives from digital texts by identifying actors, events, and perspectivization as core narrative signals.

Details

Motivation: Growing need for empirical analysis methods to understand political narratives in digital media, addressing societal issues like polarization and misinformation.

Method: Extract AMR graphs from text corpora, apply narratology-based heuristics to filter for actors, events, and perspectivization, then reassemble these signals into networks for narrative reconstruction.

Result: Successfully demonstrated through case study of State of the European Union addresses (2010-2023) to surface political narrative signals from public discourse.

Conclusion: The proposed formalism enables systematic extraction and analysis of political narratives through combined distant and close reading approaches.

Abstract: Narratives are key interpretative devices by which humans make sense of political reality. As the significance of narratives for understanding current societal issues such as polarization and misinformation becomes increasingly evident, there is a growing demand for methods that support their empirical analysis. To this end, we propose a graph-based formalism and machine-guided method for extracting, representing, and analyzing selected narrative signals from digital textual corpora, based on Abstract Meaning Representation (AMR). The formalism and method introduced here specifically cater to the study of political narratives that figure in texts from digital media such as archived political speeches, social media posts, transcripts of parliamentary debates, and political manifestos on party websites. We approach the study of such political narratives as a problem of information retrieval: starting from a textual corpus, we first extract a graph-like representation of the meaning of each sentence in the corpus using AMR. Drawing on transferable concepts from narratology, we then apply a set of heuristics to filter these graphs for representations of 1) actors and their relationships, 2) the events in which these actors figure, and 3) traces of the perspectivization of these events. We approach these references to actors, events, and instances of perspectivization as core narrative signals that allude to larger political narratives. By systematically analyzing and re-assembling these signals into networks that guide the researcher to the relevant parts of the text, the underlying narratives can be reconstructed through a combination of distant and close reading. A case study of State of the European Union addresses (2010 – 2023) demonstrates how the formalism can be used to inductively surface signals of political narratives from public discourse.

[48] iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use

Yirong Zeng, Xiao Ding, Yuxian Wang, Weiwen Liu, Wu Ning, Yutai Hou, Xu Huang, Duyu Tang, Dandan Tu, Bing Qin, Ting Liu

Main category: cs.CL

TL;DR: Proposes iterative reinforced fine-tuning to address decay in training gains from synthetic tool-use data, achieving significant performance improvements over baseline models.

Details

Motivation: Training gains decay as synthetic tool-use data increases, limiting LLMs' ability to develop advanced tool-use capabilities in complex scenarios, particularly due to parameter errors in responses.

Method: Iterative reinforced fine-tuning strategy: (1) enhances response diversity through Monte Carlo Tree Search path exploration, (2) identifies model deficiencies via fine-grained preference pairs and improves them using preference optimization algorithms.

Result: Achieves 13.11% better performance than same-size base model, 6.5% improvement in complex scenarios over baseline, and outperforms larger open-source and closed-source models.

Conclusion: The proposed iterative reinforced fine-tuning strategy effectively addresses the limitations of synthetic data training and significantly enhances LLMs’ tool-use capabilities.

Abstract: Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve this. However, our investigation reveals that training gains significantly decay as synthetic data increases. The model struggles to benefit from additional synthetic data, which fails to endow it with advanced tool-use capabilities in complex scenarios Moreover, we discovered that the above limitation usually manifests as a fragment deficiency (i.e., parameter errors) in response. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation. This strategy involves: (1) enhancing the diversity of response for synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively pinpointing the model’s deficiency by constructing fine-grained preference pairs, and then improving it by preference optimization algorithms for targeted improvement. The experiments show that our method achieves 13.11% better performance than the same-size base model. It achieves an improvement of 6.5% in complex scenarios compared to the baseline, and it also outperforms larger open-source and closed-source models.

[49] Activation-Informed Merging of Large Language Models

Amin Heyrani Nobari, Kaveh Alim, Ali ArjomandBigdeli, Akash Srivastava, Faez Ahmed, Navid Azizan

Main category: cs.CL

TL;DR: AIM (Activation-Informed Merging) improves LLM merging by using activation space information to preserve critical weights, boosting performance by up to 40% on benchmarks.

Details

Motivation: To enhance model merging performance by incorporating activation space information, drawing inspiration from continual learning and model compression principles.

Method: AIM integrates activation space information into existing merging methods using a task-agnostic calibration set to selectively prioritize essential weights during merging.

Result: AIM significantly improves merged model performance across multiple benchmarks, with up to 40% performance increase compared to standard merging approaches.

Conclusion: Activation-space information provides substantial advancements in LLM merging strategies, making AIM a flexible and effective complementary solution for model merging.

Abstract: Model merging, a method that combines the parameters and embeddings of multiple fine-tuned large language models (LLMs), offers a promising approach to enhance model performance across various tasks while maintaining computational efficiency. This paper introduces Activation-Informed Merging (AIM), a technique that integrates the information from the activation space of LLMs into the merging process to improve performance and robustness. AIM is designed as a flexible, complementary solution that is applicable to any existing merging method. It aims to preserve critical weights from the base model, drawing on principles from continual learning (CL) and model compression. Utilizing a task-agnostic calibration set, AIM selectively prioritizes essential weights during merging. We empirically demonstrate that AIM significantly enhances the performance of merged models across multiple benchmarks. Our findings suggest that considering the activation-space information can provide substantial advancements in the model merging strategies for LLMs, with up to a 40% increase in benchmark performance.

[50] NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Ilia Kulikov, Kyunghyun Cho, Dong Wang, Yuandong Tian, Jason E Weston, Xian Li

Main category: cs.CL

TL;DR: NaturalReasoning is a 2.8M-question dataset spanning STEM, Economics, Social Sciences and more, enabling scalable reasoning capability development through knowledge distillation and self-training methods.

Details

Motivation: Scaling reasoning capabilities beyond traditional domains like math and coding is limited by lack of diverse, high-quality questions across various domains.

Method: Created NaturalReasoning dataset with 2.8M questions across multiple domains, then used knowledge distillation from strong teacher models and unsupervised self-training with reward models.

Result: The dataset effectively elicits and transfers reasoning capabilities from teacher models and works well for unsupervised self-training using external reward models or self-rewarding.

Conclusion: NaturalReasoning enables scalable development of reasoning capabilities across diverse domains and is publicly released to foster future research.

Abstract: Scaling reasoning capabilities beyond traditional domains such as math and coding is hindered by the lack of diverse and high-quality questions. To overcome this limitation, we introduce a scalable approach for generating diverse and challenging reasoning questions, accompanied by reference answers. We present NaturalReasoning, a comprehensive dataset comprising 2.8 million questions that span multiple domains, including STEM fields (e.g., Physics, Computer Science), Economics, Social Sciences, and more. We demonstrate the utility of the questions in NaturalReasoning through knowledge distillation experiments which show that NaturalReasoning can effectively elicit and transfer reasoning capabilities from a strong teacher model. Furthermore, we demonstrate that NaturalReasoning is also effective for unsupervised self-training using external reward models or self-rewarding. To foster future work, we publicly release NaturalReasoning at https://huggingface.co/datasets/facebook/natural_reasoning.

[51] InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

Henry Hengyuan Zhao, Wenqi Pei, Yifei Tao, Haiyang Mei, Mike Zheng Shou

Main category: cs.CL

TL;DR: InterFeedback is an interactive framework that evaluates Large Multimodal Models’ ability to refine responses based on human feedback, revealing that even state-of-the-art models like OpenAI-o1 struggle with this capability.

Details

Motivation: Existing benchmarks don't test LMMs' interactive intelligence with human users, which is crucial for developing general-purpose AI assistants.

Method: Developed InterFeedback framework for autonomous assessment, created InterFeedback-Bench using MMMU-Pro and MathVerse datasets to test 10 LMMs, and collected InterFeedback-Human dataset of 120 cases for manual testing.

Result: Even state-of-the-art LMMs like OpenAI-o1 struggle to refine responses based on human feedback, achieving less than 50% average score.

Conclusion: There’s a critical need for methods to enhance LMMs’ capabilities to interpret and benefit from feedback for better interactive intelligence.

Abstract: Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users, which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-Sonnet-4. Our evaluation results indicate that even the state-of-the-art LMM, OpenAI-o1, struggles to refine its responses based on human feedback, achieving an average score of less than 50%. Our findings point to the need for methods that can enhance LMMs’ capabilities to interpret and benefit from feedback.

Hyungyu Shin, Jingyu Tang, Yoonjoo Lee, Nayoung Kim, Hyunseung Lim, Ji Yong Cho, Hwajung Hong, Moontae Lee, Juho Kim

Main category: cs.CL

TL;DR: LLM-generated paper reviews have biased focus distributions compared to human experts, overemphasizing technical validity while significantly overlooking novelty assessment.

Details

Motivation: Peer review is strained by reviewer shortages, and while LLMs can draft reviews, it's unclear if they focus on the same critical facets (strengths/weaknesses) that drive human accept/reject decisions.

Method: Developed a focus-level evaluation framework operationalizing attention distribution across predefined facets (target: problem, method, experiment; aspect: validity, clarity, novelty), using 676 paper reviews with 3,657 human-identified strengths/weaknesses from OpenReview.

Result: Off-the-shelf LLMs consistently show more biased focus distributions than human experts, with excessive attention to technical validity and significant neglect of novelty assessment when criticizing papers.

Conclusion: Current LLM-generated reviews have systematic biases in focus distribution compared to human experts, particularly overlooking novelty evaluation, which limits their trustworthiness for automated peer review.

Abstract: Peer review underpins scientific progress, but it is increasingly strained by reviewer shortages and growing workloads. Large Language Models (LLMs) can automatically draft reviews now, but determining whether LLM-generated reviews are trustworthy requires systematic evaluation. Researchers have evaluated LLM reviews at either surface-level (e.g., BLEU and ROUGE) or content-level (e.g., specificity and factual accuracy). Yet it remains uncertain whether LLM-generated reviews attend to the same critical facets that human experts weigh – the strengths and weaknesses that ultimately drive an accept-or-reject decision. We introduce a focus-level evaluation framework that operationalizes the focus as a normalized distribution of attention across predefined facets in paper reviews. Based on the framework, we developed an automatic focus-level evaluation pipeline based on two sets of facets: target (e.g., problem, method, and experiment) and aspect (e.g., validity, clarity, and novelty), leveraging 676 paper reviews (https://figshare.com/s/d5adf26c802527dd0f62) from OpenReview that consists of 3,657 strengths and weaknesses identified from human experts. The comparison of focus distributions between LLMs and human experts showed that the off-the-shelf LLMs consistently have a more biased focus towards examining technical validity while significantly overlooking novelty assessment when criticizing papers.

[53] Exploring Multimodal Perception in Large Language Models Through Perceptual Strength Ratings

Jonghyun Lee, Dojun Park, Jiwoo Lee, Hoekeon Choi, Sung-Eun Lee

Main category: cs.CL

TL;DR: Multimodal LLMs can approximate human sensory grounding with 85-90% accuracy and 0.58-0.65 correlations, but still differ from human embodied cognition despite multimodal integration.

Details

Motivation: To investigate whether multimodal large language models can achieve human-like sensory grounding and examine how model characteristics influence this capability.

Method: Evaluated 21 models from GPT, Gemini, LLaMA, and Qwen families using 3,611 words from Lancaster Sensorimotor Norms through correlation, distance metrics, and qualitative analysis.

Result: Larger, multimodal, and newer models generally outperformed smaller, text-based, and older counterparts. Top models achieved substantial similarity to human ratings but showed processing differences.

Conclusion: Advanced LLMs can approximate human sensory-linguistic associations through statistical learning but still differ from human embodied cognition in processing mechanisms, even with multimodal integration.

Abstract: This study investigated whether multimodal large language models can achieve human-like sensory grounding by examining their ability to capture perceptual strength ratings across sensory modalities. We explored how model characteristics (size, multimodal capabilities, architectural generation) influence grounding performance, distributional factor dependencies (word frequency, embeddings, feature distances), and human-model processing differences. We evaluated 21 models from four families (GPT, Gemini, LLaMA, Qwen) using 3,611 words from the Lancaster Sensorimotor Norms through correlation, distance metrics, and qualitative analysis. Results showed that larger (6 out of 8 comparisons), multimodal (5 of 7), and newer models (5 of 8) generally outperformed their smaller, text-based, and older counterparts. Top models achieved 85-90% accuracy and 0.58-0.65 correlations with human ratings, demonstrating substantial similarity. Moreover, distributional factors showed minimal impact, not exceeding human dependency levels. However, despite strong alignment, models were not identical to humans, as even top performers showed differences in distance and correlation measures, with qualitative analysis revealing processing patterns related to absent sensory grounding. Additionally, it remains questionable whether introducing multimodality resolves this grounding deficit. Although multimodality improved performance, it seems to provide similar information as massive text rather than qualitatively different data, as benefits occurred across unrelated sensory dimensions and massive text-only models achieved comparable results. Our findings demonstrate that while advanced LLMs can approximate human sensory-linguistic associations through statistical learning, they still differ from human embodied cognition in processing mechanisms, even with multimodal integration.

[54] MorphTok: Morphologically Grounded Tokenization for Indian Languages

Maharaj Brahma, N J Karthika, Atul Singh, Devaraj Adiga, Smruti Bhate, Ganesh Ramakrishnan, Rohit Saluja, Maunendra Sankar Desarkar

Main category: cs.CL

TL;DR: The paper proposes morphology-aware segmentation and Constrained BPE (CBPE) to improve tokenization for Indic languages, showing better performance in machine translation and language modeling.

Details

Motivation: Existing LLMs use Byte-pair Encoding (BPE) that often produces linguistically misaligned segmentation, especially problematic for Indic languages with complex morphology and dependent vowels.

Method: 1) Morphology-aware pre-tokenization using sandhi splitting for Hindi and Marathi; 2) Constrained BPE (CBPE) with script-specific constraints to handle dependent vowels; 3) New human evaluation metric EvalTok.

Result: Morphologically grounded tokenization improves downstream task performance. CBPE reduces fertility scores by 1.68% while maintaining or improving machine translation and language modeling performance.

Conclusion: Morphology-aware segmentation and CBPE provide computationally efficient alternatives to standard BPE that better align with linguistic units, particularly beneficial for Indic languages.

Abstract: Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm for subword tokenization that greedily merges frequent character bigrams, often leading to segmentation that does not align with linguistically meaningful units. To address this, we propose morphology-aware segmentation as a pre-tokenization step before applying BPE. To facilitate morphology-aware segmentation, we create a novel dataset for Hindi and Marathi, incorporating sandhi splitting to enhance the subword tokenization. Experiments on downstream tasks show that morphologically grounded tokenization improves machine translation and language modeling performance. Additionally, to handle the dependent vowels common in syllable-based writing systems used by Indic languages, we propose Constrained BPE (CBPE), an extension to the standard BPE algorithm incorporating script-specific constraints. In particular, CBPE handles dependent vowels to form a cohesive unit with other characters instead of occurring as a single unit. Our results show that CBPE achieves a 1.68% reduction in fertility scores while maintaining comparable or improved downstream performance in machine translation and language modeling, offering a computationally efficient alternative to standard BPE. Moreover, to evaluate segmentation across different tokenization algorithms, we introduce a new human evaluation metric, \textit{EvalTok}, enabling more human-grounded assessment.

[55] Fair Document Valuation in LLM Summaries via Shapley Values

Zikun Ye, Hema Yoganarasimhan

Main category: cs.CL

TL;DR: Proposes Cluster Shapley, a Shapley value-based framework for fair document valuation in LLM-generated summaries, addressing credit attribution challenges by leveraging semantic similarity to improve computational efficiency.

Details

Motivation: LLM-based systems that retrieve and summarize content obscure individual contributions of original creators, raising concerns about fair credit attribution and compensation.

Method: Developed Cluster Shapley approximation algorithm that uses semantic similarity among documents to reduce Shapley value computation cost while maintaining accuracy, compared against Monte Carlo sampling and Kernel SHAP.

Result: Cluster Shapley substantially improves the efficiency-accuracy frontier compared to off-the-shelf Shapley approximations, and simple attribution rules lead to highly unfair outcomes despite computational cheapness.

Conclusion: Structure-aware Shapley approximations like Cluster Shapley are promising for scalable and fair content attribution in LLM summarization systems, offering practical guidance for platforms.

Abstract: Large Language Models (LLMs) are increasingly used in systems that retrieve and summarize content from multiple sources, such as search engines and AI assistants. While these systems enhance user experience through coherent summaries, they obscure the individual contributions of original content creators, raising concerns about credit attribution and compensation. We address the challenge of valuing individual documents used in LLM-generated summaries by proposing a Shapley value-based framework for fair document valuation. Although theoretically appealing, exact Shapley value computation is prohibitively expensive at scale. To improve efficiency, we develop Cluster Shapley, a simple approximation algorithm that leverages semantic similarity among documents to reduce computation while maintaining attribution accuracy. Using Amazon product review data, we empirically show that off-the-shelf Shapley approximations, such as Monte Carlo sampling and Kernel SHAP, perform suboptimally in LLM settings, whereas Cluster Shapley substantially improves the efficiency-accuracy frontier. Moreover, simple attribution rules (e.g., equal or relevance-based allocation), though computationally cheap, lead to highly unfair outcomes. Together, our findings highlight the potential of structure-aware Shapley approximations tailored to LLM summarization and offer guidance for platforms seeking scalable and fair content attribution mechanisms.

[56] ProRefine: Inference-Time Prompt Refinement with Textual Feedback

Deepak Pandita, Tharindu Cyril Weerasooriya, Ankit Parag Shah, Isabelle Diana May-Xin Ng, Christopher M. Homan, Wei Wei

Main category: cs.CL

TL;DR: ProRefine is an inference-time prompt optimization method that uses LLM agentic loops to dynamically refine prompts for multi-step reasoning tasks, achieving significant performance improvements over zero-shot Chain-of-Thought baselines.

Details

Motivation: Agentic workflows are crucial for complex AI tasks but suffer from sub-optimal performance due to poorly designed prompts that can snowball in multi-agent systems, limiting reliability and scalability.

Method: ProRefine uses an agentic loop of LLMs to generate and apply textual feedback, dynamically refining prompts for multi-step reasoning tasks without requiring additional training or ground truth labels.

Result: Evaluated on five mathematical reasoning datasets, ProRefine surpassed zero-shot Chain-of-Thought baselines by 3 to 37 percentage points and enabled smaller models to approach the performance of larger counterparts.

Conclusion: ProRefine demonstrates potential for building more cost-effective and powerful hybrid AI systems, democratizing access to high-performing AI through improved prompt optimization.

Abstract: Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, play a substantial role in many cutting-edge commercial applications, and continue to fascinate researchers across fields for their potential to accomplish expensive, complex tasks that, until recently, only humans have been trusted to do. These workflows depend critically on the prompts used to provide the roles models play in such workflows. Poorly designed prompts that fail even slightly to guide individual agents can lead to sub-optimal performance that may snowball within a system of agents, limiting their reliability and scalability. To address this important problem of inference-time prompt optimization, we introduce ProRefine, an innovative inference-time optimization method that uses an agentic loop of LLMs to generate and apply textual feedback. ProRefine dynamically refines prompts for multi-step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to approach the performance of their larger counterparts. This highlights its potential for building more cost-effective and powerful hybrid AI systems, thereby democratizing access to high-performing AI.

[57] Scalable Medication Extraction and Discontinuation Identification from Electronic Health Records Using Large Language Models

Chong Shao, Douglas Snyder, Chiran Li, Bowen Gu, Kerry Ngan, Chun-Ting Yang, Jiageng Wu, Richard Wyss, Kueiyu Joshua Lin, Jie Yang

Main category: cs.CL

TL;DR: LLMs show promising performance for medication extraction and discontinuation classification from EHR notes, with GPT-4o achieving the best results and open-source models like Llama-3.1-70B-Instruct offering scalable alternatives.

Details

Motivation: Identifying medication discontinuations in EHRs is crucial for patient safety but challenging due to unstructured data, requiring scalable automated solutions without human annotation.

Method: Evaluated 12 advanced LLMs on three EHR datasets using multiple prompting strategies including zero-shot, few-shot, and chain-of-thought reasoning for medication extraction and status classification.

Result: GPT-4o achieved highest average F1 scores: 94.0% for extraction, 78.1% for classification, 72.7% for joint task. Open-source models performed competitively, with Llama-3.1-70B-Instruct achieving best results on specific datasets.

Conclusion: LLMs demonstrate strong potential for medication extraction and discontinuation identification, with open-source models providing scalable alternatives and few-shot learning further improving performance.

Abstract: Identifying medication discontinuations in electronic health records (EHRs) is vital for patient safety but is often hindered by information being buried in unstructured notes. This study aims to evaluate the capabilities of advanced open-sourced and proprietary large language models (LLMs) in extracting medications and classifying their medication status from EHR notes, focusing on their scalability on medication information extraction without human annotation. We collected three EHR datasets from diverse sources to build the evaluation benchmark. We evaluated 12 advanced LLMs and explored multiple LLM prompting strategies. Performance on medication extraction, medication status classification, and their joint task (extraction then classification) was systematically compared across all experiments. We found that LLMs showed promising performance on the medication extraction and discontinuation classification from EHR notes. GPT-4o consistently achieved the highest average F1 scores in all tasks under zero-shot setting - 94.0% for medication extraction, 78.1% for discontinuation classification, and 72.7% for the joint task. Open-sourced models followed closely, Llama-3.1-70B-Instruct achieved the highest performance in medication status classification on the MIV-Med dataset (68.7%) and in the joint task on both the Re-CASI (76.2%) and MIV-Med (60.2%) datasets. Medical-specific LLMs demonstrated lower performance compared to advanced general-domain LLMs. Few-shot learning generally improved performance, while CoT reasoning showed inconsistent gains. LLMs demonstrate strong potential for medication extraction and discontinuation identification on EHR notes, with open-sourced models offering scalable alternatives to proprietary systems and few-shot can further improve LLMs’ capability.

[58] NMIXX: Domain-Adapted Neural Embeddings for Cross-Lingual eXploration of Finance

Hanwool Lee, Sara Yu, Yewon Hwang, Jonghyun Choi, Heejae Ahn, Sungbum Jung, Youngjae Yu

Main category: cs.CL

TL;DR: NMIXX is a cross-lingual embedding model suite for finance that achieves significant performance gains on financial semantic textual similarity tasks, especially for Korean, while releasing a new Korean financial benchmark (KorFinSTS).

Details

Motivation: General-purpose sentence embedding models struggle with financial semantics in low-resource languages like Korean due to domain jargon, temporal meaning shifts, and bilingual vocabulary misalignment.

Method: Fine-tuned cross-lingual embedding models using 18.8K high-confidence triplets containing in-domain paraphrases, hard negatives from semantic-shift typology, and exact Korean-English translations.

Result: NMIXX’s multilingual bge-m3 variant achieved Spearman’s rho gains of +0.10 on English FinSTS and +0.22 on KorFinSTS, outperforming all baselines with the largest margin, though with modest trade-off in general STS performance.

Conclusion: Models with richer Korean token coverage adapt more effectively, highlighting the importance of tokenizer design in low-resource cross-lingual settings. Both models and benchmark are released publicly to advance domain-adapted multilingual representation learning in finance.

Abstract: General-purpose sentence embedding models often struggle to capture specialized financial semantics, especially in low-resource languages like Korean, due to domain-specific jargon, temporal meaning shifts, and misaligned bilingual vocabularies. To address these gaps, we introduce NMIXX (Neural eMbeddings for Cross-lingual eXploration of Finance), a suite of cross-lingual embedding models fine-tuned with 18.8K high-confidence triplets that pair in-domain paraphrases, hard negatives derived from a semantic-shift typology, and exact Korean-English translations. Concurrently, we release KorFinSTS, a 1,921-pair Korean financial STS benchmark spanning news, disclosures, research reports, and regulations, designed to expose nuances that general benchmarks miss. When evaluated against seven open-license baselines, NMIXX’s multilingual bge-m3 variant achieves Spearman’s rho gains of +0.10 on English FinSTS and +0.22 on KorFinSTS, outperforming its pre-adaptation checkpoint and surpassing other models by the largest margin, while revealing a modest trade-off in general STS performance. Our analysis further shows that models with richer Korean token coverage adapt more effectively, underscoring the importance of tokenizer design in low-resource, cross-lingual settings. By making both models and the benchmark publicly available, we provide the community with robust tools for domain-adapted, multilingual representation learning in finance.

Sneha Oram, Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: This paper investigates pragmatic reasoning capabilities of LLMs in mental health, introduces the PRiMH dataset with pragmatic implicature and presupposition tasks, benchmarks four models, and studies mental health stigma using three StiPRompts.

Details

Motivation: To bridge NLP and mental health through interpretable and reasoning-capable AI systems, as reasoning has not been examined in depth compared to explainability and interpretability in mental health AI.

Method: Introduces PRiMH dataset with pragmatic reasoning tasks (two implicature tasks, one presupposition task), benchmarks four LLMs (Llama3.1, Mistral, MentaLLaMa, Qwen), analyzes MentaLLaMA with rollout attention, and proposes StiPRompts to study stigma with GPT4o-mini, Deepseek-chat, and Claude-3.5-haiku.

Result: Mistral and Qwen show substantial reasoning abilities in mental health domain. Claude-3.5-haiku deals with mental health stigma more responsibly compared to GPT4o-mini and Deepseek-chat.

Conclusion: The study demonstrates varying reasoning capabilities of LLMs in mental health domain and highlights the importance of responsible AI handling of mental health stigma, with Claude-3.5-haiku showing the most responsible behavior.

Abstract: Although explainability and interpretability have received significant attention in artificial intelligence (AI) and natural language processing (NLP) for mental health, reasoning has not been examined in the same depth. Addressing this gap is essential to bridge NLP and mental health through interpretable and reasoning-capable AI systems. To this end, we investigate the pragmatic reasoning capability of large-language models (LLMs) in the mental health domain. We introduce PRiMH dataset, and propose pragmatic reasoning tasks in mental health with pragmatic implicature and presupposition phenomena. In particular, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the tasks presented, we consider four models: Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning abilities in the domain. Subsequently, we study the behavior of MentaLLaMA on the proposed reasoning tasks with the rollout attention mechanism. In addition, we also propose three StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT4o-mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with stigma more responsibly compared to the other two LLMs.

[60] Learning Dynamics of Meta-Learning in Small Model Pretraining

David Demitri Africa, Yuval Weiss, Paula Buttery, Richard Diehl Martinez

Main category: cs.CL

TL;DR: Meta-learning with MAML improves small language model pretraining by making it faster (1.6x speedup), better (improved multilingual NER), and more interpretable with clear two-stage training dynamics.

Details

Motivation: Large language models are powerful but costly, so the research aims to make pretraining of small language models more efficient and interpretable using meta-learning.

Method: Integrate first-order MAML with subset-masked LM pretraining, producing four LLama-style decoder-only models (11M-570M params), and evaluate on fundamental NLP tasks with various settings.

Result: Compared to vanilla training: reaches same loss 1.6x faster, improves F1 on multilingual Universal NER under equal compute, and shows clear two-stage training dynamics with representation diversification followed by compression.

Conclusion: Meta-learning provides a compact, interpretable signature of adaptation with measurable performance improvements, making small language model training more efficient and transparent.

Abstract: Large language models are powerful but costly. We ask whether meta-learning can make the pretraining of small language models not only better but also more interpretable. We integrate first-order MAML with subset-masked LM pretraining, producing four LLama-style decoder-only models (11M-570M params), and evaluate it on a fundamental NLP task with many settings and real-world applications. Compared with vanilla training, our model (i) reaches the same loss up to 1.6x sooner, (ii) improves F1 on multilingual Universal NER under equal compute, and (iii) makes the training dynamics easy to read: first the network’s representations fan out (“diversify”) and later they collapse into a smaller, shared subspace (“compress”). This two-stage shift shows up as a rise-and-fall in both effective-rank curves and attention-head entropy. The same curves pinpoint which layers specialise earliest and which later reconverge, giving a compact, interpretable signature of meta-adaptation. Code, checkpoints and WandB logs are released.

[61] Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration

Songyuan Sui, Hongyi Liu, Serena Liu, Li Li, Soo-Hyun Choi, Rui Chen, Xia Hu

Main category: cs.CL

TL;DR: Chain-of-Query (CoQ) is a multi-agent framework that improves table understanding by using natural-language schema representations, clause-by-clause SQL generation, and hybrid reasoning to reduce structural complexity and execution dependency.

Details

Motivation: LLMs struggle with table understanding due to structural complexity of tabular data, and existing multi-agent SQL frameworks have limitations like poor table structure comprehension, error propagation, and over-reliance on execution correctness.

Method: CoQ uses natural-language-style table schema representations to reduce structural noise, employs clause-by-clause SQL generation for better query quality, and implements hybrid reasoning that separates SQL-based mechanical reasoning from LLM-based logical inference.

Result: Extensive experiments across four models and five benchmarks show CoQ achieves substantial accuracy improvements and significantly lowers invalid SQL rates compared to prior LLM-based, SQL-aided, and hybrid baselines.

Conclusion: CoQ demonstrates superior effectiveness in table understanding by addressing key limitations of existing approaches through its novel multi-agent framework design.

Abstract: Table understanding requires structured, multi-step reasoning. Large Language Models (LLMs) struggle with it due to the structural complexity of tabular data. Recently, multi-agent frameworks for SQL generation have shown promise in tackling the challenges of understanding tabular data, but existing approaches often suffer from limitations such as the inability to comprehend table structure for reliable SQL generation, error propagation that results in invalid queries, and over-reliance on execution correctness. To address these issues, we propose Chain-of-Query (CoQ), a novel multi-agent framework for SQL-aided table understanding. CoQ adopts natural-language-style representations of table schemas to abstract away structural noise and enhance understanding. It employs a clause-by-clause SQL generation strategy to improve query quality and introduces a hybrid reasoning division that separates SQL-based mechanical reasoning from LLM-based logical inference, thereby reducing reliance on execution outcomes. Extensive experiments across four models and five widely used benchmarks demonstrate that CoQ achieves substantial accuracy improvements and significantly lowers invalid SQL rates compared to prior generic LLM-based, SQL-aided, and hybrid baselines, confirming its superior effectiveness in table understanding. The code is available at https://github.com/SongyuanSui/ChainofQuery.

[62] DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models

Kaiwen Yan, Xuanqing Shi, Hongcheng Guo, Wenxuan Wang, Zhuosheng Zhang, Chengwei Qin

Main category: cs.CL

TL;DR: DRQA addresses overthinking in reasoning LLMs by training models to allocate reasoning resources adaptively, reducing token usage while maintaining accuracy.

Details

Motivation: RLLMs suffer from overthinking - producing unnecessarily long reasoning chains for simple questions, leading to computational inefficiency and excessive token consumption.

Method: Dynamic Reasoning Quota Allocation (DRQA) uses batch-generated preference data and reinforcement learning to train models to allocate reasoning resources adaptively, transferring benefits of resource competition from batch to single-question inference.

Result: Extensive experiments show DRQA significantly reduces token usage while maintaining or improving answer accuracy across mathematical and scientific reasoning benchmarks.

Conclusion: DRQA effectively mitigates overthinking and offers a promising direction for more efficient and scalable deployment of RLLMs, inspiring further exploration into fine-grained control of reasoning behaviors.

Abstract: Reasoning large language models (RLLMs), such as OpenAI-O3 and DeepSeek-R1, have recently demonstrated remarkable capabilities by performing structured and multi-step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit more resource-efficient behavior by dynamically compressing reasoning steps for easier problems, due to implicit resource competition. Inspired by this, we propose Dynamic Reasoning Quota Allocation (DRQA), a novel method that transfers the benefits of resource competition from batch processing to single-question inference. Specifically, DRQA leverages batch-generated preference data and reinforcement learning to train the model to allocate reasoning resources adaptively. By encouraging the model to internalize a preference for responses that are both accurate and concise, DRQA enables it to generate concise answers for simple questions while retaining sufficient reasoning depth for more challenging ones. Extensive experiments on a wide range of mathematical and scientific reasoning benchmarks demonstrate that DRQA significantly reduces token usage while maintaining, and in many cases improving, answer accuracy. By effectively mitigating the overthinking problem, DRQA offers a promising direction for more efficient and scalable deployment of RLLMs, and we hope it inspires further exploration into fine-grained control of reasoning behaviors.

[63] SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM

Pengjiang Li, Zaitian Wang, Xinhao Zhang, Ran Zhang, Lu Jiang, Pengfei Wang, Yuanchun Zhou

Main category: cs.CL

TL;DR: SciTopic is an advanced topic discovery method that uses large language models (LLMs) to enhance scientific topic identification by capturing comprehensive semantic understanding from publications and optimizing text representations through contrastive learning.

Details

Motivation: Existing topic discovery methods rely on word embeddings and struggle with complex text relationships, lacking comprehensive understanding of scientific publications. LLMs' exceptional text comprehension capabilities can improve scientific topic identification.

Method: 1) Build a textual encoder for scientific publications (metadata, title, abstract); 2) Use entropy-based sampling and LLM-guided triplet tasks to optimize space; 3) Fine-tune the encoder using LLM guidance and contrastive loss optimization to better discriminate different topics.

Result: Extensive experiments on three real-world scientific publication datasets show that SciTopic outperforms state-of-the-art scientific topic discovery methods.

Conclusion: SciTopic enables researchers to gain deeper and faster insights into scientific literature by leveraging LLMs for improved topic discovery and semantic understanding.

Abstract: Topic discovery in scientific literature provides valuable insights for researchers to identify emerging trends and explore new avenues for investigation, facilitating easier scientific information retrieval. Many machine learning methods, particularly deep embedding techniques, have been applied to discover research topics. However, most existing topic discovery methods rely on word embedding to capture the semantics and lack a comprehensive understanding of scientific publications, struggling with complex, high-dimensional text relationships. Inspired by the exceptional comprehension of textual information by large language models (LLMs), we propose an advanced topic discovery method enhanced by LLMs to improve scientific topic identification, namely SciTopic. Specifically, we first build a textual encoder to capture the content from scientific publications, including metadata, title, and abstract. Next, we construct a space optimization module that integrates entropy-based sampling and triplet tasks guided by LLMs, enhancing the focus on thematic relevance and contextual intricacies between ambiguous instances. Then, we propose to fine-tune the textual encoder based on the guidance from the LLMs by optimizing the contrastive loss of the triplets, forcing the text encoder to better discriminate instances of different topics. Finally, extensive experiments conducted on three real-world datasets of scientific publications demonstrate that SciTopic outperforms the state-of-the-art (SOTA) scientific topic discovery methods, enabling researchers to gain deeper and faster insights.

[64] Are Humans as Brittle as Large Language Models?

Jiahui Li, Sean Papay, Roman Klinger

Main category: cs.CL

TL;DR: This paper compares prompt brittleness between LLMs and human annotators, finding both show sensitivity to certain prompt modifications like label substitutions, but humans are less affected by typos and reversed label order.

Details

Motivation: To investigate whether human annotators exhibit similar prompt sensitivity as LLMs, challenging the assumption that prompt brittleness is unique to LLMs and exploring if it might reflect natural human annotation variance.

Method: Systematically comparing effects of prompt modifications on both LLMs and human annotators through text classification tasks with identical instruction variations.

Result: Both humans and LLMs show increased brittleness to specific prompt modifications (alternative label sets/formats), but human judgments are less affected by typographical errors and reversed label order than LLMs.

Conclusion: Prompt brittleness is not unique to LLMs - humans also exhibit similar sensitivity patterns, suggesting this phenomenon may reflect inherent characteristics of annotation processes rather than being problematic LLM behavior.

Abstract: The output of large language models (LLMs) is unstable, due both to non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to prompt changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.

[65] MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems

Channdeth Sok, David Luz, Yacine Haddam

Main category: cs.CL

TL;DR: MetaRAG is a metamorphic testing framework for detecting hallucinations in Retrieval-Augmented Generation (RAG) systems by decomposing answers into factoids, generating mutations, verifying against retrieved context, and scoring inconsistencies.

Details

Motivation: Existing hallucination detection methods like SelfCheckGPT and MetaQA don't address RAG systems' unique challenges where responses must align with retrieved evidence, creating reliability concerns for enterprise deployment.

Method: Four-stage framework: (1) decompose answers into atomic factoids, (2) generate mutations using synonym/antonym substitutions, (3) verify variants against retrieved context (synonyms should be entailed, antonyms contradicted), (4) aggregate penalties into hallucination scores.

Result: Experiments on proprietary enterprise data show MetaRAG effectively detects hallucinations and enables trustworthy deployment of RAG-based conversational agents, with span-level localization for identity-sensitive content.

Conclusion: MetaRAG provides real-time, unsupervised hallucination detection for RAG systems without ground-truth or model access, supporting identity-aware AI deployment through localized claim verification and configurable safeguards.

Abstract: Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited by hallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore present MetaRAG, a metamorphic testing framework for hallucination detection in Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time, unsupervised, black-box setting, requiring neither ground-truth references nor access to model internals, making it suitable for proprietary and high-stakes domains. The framework proceeds in four stages: (1) decompose answers into atomic factoids, (2) generate controlled mutations of each factoid using synonym and antonym substitutions, (3) verify each variant against the retrieved context (synonyms are expected to be entailed and antonyms contradicted), and (4) aggregate penalties for inconsistencies into a response-level hallucination score. Crucially for identity-aware AI, MetaRAG localizes unsupported claims at the factoid span where they occur (e.g., pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility), allowing users to see flagged spans and enabling system designers to configure thresholds and guardrails for identity-sensitive queries. Experiments on a proprietary enterprise dataset illustrate the effectiveness of MetaRAG for detecting hallucinations and enabling trustworthy deployment of RAG-based conversational agents. We also outline a topic-based deployment design that translates MetaRAG’s span-level scores into identity-aware safeguards; this design is discussed but not evaluated in our experiments.

[66] Policy-as-Prompt: Turning AI Governance Rules into Guardrails for AI Agents

Gauri Kholkar, Ratinder Ahuja

Main category: cs.CL

TL;DR: A regulatory ML framework that converts design documents into verifiable runtime guardrails using Policy as Prompt method, enabling secure-by-design AI agent deployment with continuous compliance.

Details

Motivation: Need for effective ways to turn policy into enforceable controls for autonomous AI agents in regulated and safety-critical settings.

Method: Policy as Prompt method reads unstructured design artifacts (PRDs, TDDs, code) to build source-linked policy trees, which are compiled into lightweight prompt-based classifiers for real-time runtime monitoring.

Result: System reduces prompt-injection risk, blocks out-of-scope requests, limits toxic outputs, and generates auditable rationales aligned with AI governance frameworks.

Conclusion: Treating policies as executable prompts enables secure-by-design deployment, continuous compliance, and scalable AI safety and security assurance for regulatable ML.

Abstract: As autonomous AI agents are used in regulated and safety-critical settings, organizations need effective ways to turn policy into enforceable controls. We introduce a regulatory machine learning framework that converts unstructured design artifacts (like PRDs, TDDs, and code) into verifiable runtime guardrails. Our Policy as Prompt method reads these documents and risk controls to build a source-linked policy tree. This tree is then compiled into lightweight, prompt-based classifiers for real-time runtime monitoring. The system is built to enforce least privilege and data minimization. For conformity assessment, it provides complete provenance, traceability, and audit logging, all integrated with a human-in-the-loop review process. Evaluations show our system reduces prompt-injection risk, blocks out-of-scope requests, and limits toxic outputs. It also generates auditable rationales aligned with AI governance frameworks. By treating policies as executable prompts (a policy-as-code for agents), this approach enables secure-by-design deployment, continuous compliance, and scalable AI safety and AI security assurance for regulatable ML.

[67] Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs

Lee Qi Zun, Mohamad Zulhilmi Bin Abdul Halim, Goh Man Fye

Main category: cs.CL

TL;DR: This paper proposes a framework to specialize MedGemma for generating high-fidelity medical image captions to improve multimodal Retrieval-Augmented Generation systems in clinical decision support.

Details

Motivation: Current Vision-Language Models lack clinical specificity and factual grounding for image-based queries in Malaysian Clinical Practice Guidelines, limiting the effectiveness of Retrieval-Augmented Generation systems.

Method: Employed knowledge distillation to create synthetic dataset across dermatology, fundus, and chest radiography domains, then fine-tuned MedGemma using QLoRA parameter-efficient method.

Result: The fine-tuned model showed substantial improvements in classification performance and significant gains in caption faithfulness and correctness as measured by RAGAS framework.

Conclusion: Established a robust pipeline for specializing medical VLMs and validated the model as a high-quality query generator for enhancing multimodal RAG systems in evidence-based clinical decision support.

Abstract: Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines. However, their effectiveness with image-based queries is limited, as general Vision-Language Model captions often lack clinical specificity and factual grounding. This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions that serve as superior queries. To overcome data scarcity, we employ a knowledge distillation pipeline to create a synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance was rigorously assessed through a dual framework measuring both classification accuracy and, via a novel application of the RAGAS framework, caption faithfulness, relevancy, and correctness. The fine-tuned model demonstrated substantial improvements in classification performance, while RAGAS evaluation confirmed significant gains in caption faithfulness and correctness, validating the models ability to produce reliable, factually grounded descriptions. This work establishes a robust pipeline for specializing medical VLMs and validates the resulting model as a high-quality query generator, laying the groundwork for enhancing multimodal RAG systems in evidence-based clinical decision support.

[68] What Can String Probability Tell Us About Grammaticality?

Jennifer Hu, Ethan Gotlieb Wilcox, Siyuan Song, Kyle Mahowald, Roger P. Levy

Main category: cs.CL

TL;DR: The paper analyzes whether language models learn grammatical knowledge by examining the relationship between string probability and grammaticality, using minimal pairs to validate theoretical predictions.

Details

Motivation: To understand what language models have learned about grammar, given that probability and grammaticality are distinct concepts in linguistics.

Method: Theoretical analysis of grammar-meaning-probability relationship, validated with 280K sentence pairs in English and Chinese using minimal pairs to test three predictions.

Result: Validated three predictions: correlation between probabilities of minimal pairs, correlation between model and human judgments on minimal pairs, and poor separation between grammatical/ungrammatical strings in probability space.

Conclusion: Provides theoretical grounding for using probability to study LMs’ structural knowledge and suggests directions for future grammatical evaluation of language models.

Abstract: What have language models (LMs) learned about grammar? This question remains hotly debated, with major ramifications for linguistic theory. However, since probability and grammaticality are distinct notions in linguistics, it is not obvious what string probabilities can reveal about an LM’s underlying grammatical knowledge. We present a theoretical analysis of the relationship between grammar, meaning, and string probability, based on simple assumptions about the generative process of corpus data. Our framework makes three predictions, which we validate empirically using 280K sentence pairs in English and Chinese: (1) correlation between the probability of strings within minimal pairs, i.e., string pairs with minimal semantic differences; (2) correlation between models’ and humans’ deltas within minimal pairs; and (3) poor separation in probability space between unpaired grammatical and ungrammatical strings. Our analyses give theoretical grounding for using probability to learn about LMs’ structural knowledge, and suggest directions for future work in LM grammatical evaluation.

[69] Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

Akshara Prabhakar, Roshan Ram, Zixiang Chen, Silvio Savarese, Frank Wang, Caiming Xiong, Huan Wang, Weiran Yao

Main category: cs.CL

TL;DR: EDR is a multi-agent system that transforms unstructured enterprise data into actionable insights through specialized search agents, adaptive planning, and visualization capabilities, outperforming state-of-the-art systems without human intervention.

Details

Motivation: Enterprises face challenges in converting growing unstructured data into coherent insights, with existing autonomous agents struggling with domain-specific nuances, intent alignment, and enterprise integration.

Method: Multi-agent system with Master Planning Agent for query decomposition, four specialized search agents (General, Academic, GitHub, LinkedIn), extensible MCP-based tool ecosystem, Visualization Agent, and reflection mechanism with optional human-in-the-loop guidance.

Result: EDR outperforms state-of-the-art agentic systems on open-ended benchmarks (DeepResearch Bench and DeepConsult) without human steering, validated on internal datasets.

Conclusion: The EDR framework enables automated report generation, real-time streaming, and seamless enterprise deployment, advancing multi-agent reasoning applications in enterprise settings.

Abstract: As information grows exponentially, enterprises face increasing pressure to transform unstructured data into coherent, actionable insights. While autonomous agents show promise, they often struggle with domain-specific nuances, intent alignment, and enterprise integration. We present Enterprise Deep Research (EDR), a multi-agent system that integrates (1) a Master Planning Agent for adaptive query decomposition, (2) four specialized search agents (General, Academic, GitHub, LinkedIn), (3) an extensible MCP-based tool ecosystem supporting NL2SQL, file analysis, and enterprise workflows, (4) a Visualization Agent for data-driven insights, and (5) a reflection mechanism that detects knowledge gaps and updates research direction with optional human-in-the-loop steering guidance. These components enable automated report generation, real-time streaming, and seamless enterprise deployment, as validated on internal datasets. On open-ended benchmarks including DeepResearch Bench and DeepConsult, EDR outperforms state-of-the-art agentic systems without any human steering. We release the EDR framework and benchmark trajectories to advance research on multi-agent reasoning applications. Code at https://github.com/SalesforceAIResearch/enterprise-deep-research and Dataset at https://huggingface.co/datasets/Salesforce/EDR-200

[70] Re:Member: Emotional Question Generation from Personal Memories

Zackary Rackauckas, Nobuaki Minematsu, Julia Hirschberg

Main category: cs.CL

TL;DR: Re:Member is an emotionally expressive language learning system that uses personal videos and stylized spoken questions to enhance second language acquisition through affective recall and conversational engagement.

Details

Motivation: To explore how emotionally expressive, memory-grounded interaction can support more engaging second language learning by leveraging users' personal videos and affective recall.

Method: Uses WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis in a modular generation pipeline. Aligns emotional tone with visual context using expressive speech styles like whispers or late-night tones.

Result: The system successfully generates stylized spoken questions in the target language that evoke specific moods and encourage conversational engagement.

Conclusion: Re:Member demonstrates the importance of affect and personal media in learner-centered educational technologies, highlighting how emotional expression and memory grounding can enhance language learning experiences.

Abstract: We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users’ personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.

[71] Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation

Ling Team, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bingwei Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Qian, Chenchen Ju, Chenchen Li, Chengfu Tang, Chilin Fu, Chunshao Ren, Chunwei Wu, Cong Zhang, Cunyin Peng, Dafeng Xu, Daixin Wang, Dalong Zhang, Dingnan Jin, Dingyuan Zhu, Dongke Hu, Fangzheng Zhao, Feifan Wu, Feng Zhu, Gangshan Wang, Haitao Zhang, Hailin Zhao, Hanxiao Zhang, Hanzi Wang, Hao Qian, Haoyi Yu, Heng Zhang, Hongliang Zhang, Hongzhi Luan, Huirong Dong, Huizhong Li, Jia Li, Jia Liu, Jialong Zhu, Jian Sha, Jianping Wei, Jiaolong Yang, Jieyue Ma, Jiewei Wu, Jinjing Huang, Jingyun Tian, Jingyuan Zhang, Jinquan Sun, Juanhui Tu, Jun Liu, Jun Xu, Jun Zhou, Junjie Ou, Junpeng Fang, Kaihong Zhang, Kaiqin Hu, Ke Shi, Kun Tang, Kunlong Chen, Lanyin Mei, Lei Liang, Lei Xu, Libo Zhang, Lin Ju, Lin Yuan, Ling Zhong, Lintao Ma, Lu Liu, Lu Yu, Lun Cai, Meiqi Zhu, Mengying Li, Min Chen, Minghao Xue, Minghong Cai, Mingming Yin, Peijie Jiang, Peilong Zhao, Pingping Liu, Qian Zhao, Qing Cui, Qingxiang Huang, Qingyuan Yang, Quankun Yu, Shaowei Wei, Shijie Lian, Shoujian Zheng, Shun Song, Shungen Zhang, Shuo Zhang, Siyuan Li, Song Liu, Ting Guo, Tong Zhao, Wanli Gu, Weichang Wu, Weiguang Han, Wenjing Fang, Wubin Wang, Xiang Shu, Xiao Shi, Xiaoshun Lan, Xiaolu Zhang, Xiaqing Sun, Xin Zhao, Xingyu Lu, Xiong Xu, Xudong Wang, Xudong Wang, Xuemin Yang, Yajie Yang, Yang Xiang, Yanzhe Li, Yi Zhang, Yilong Wang, Yingxue Li, Yongzhen Guo, Yuzhuo Fu, Yuanyuan Wang, Yue Yang, Yue Yu, Yufeng Deng, Yun Zhang, Yunfei Yu, Yuqi Zhang, Yuxiao He, Zengke Gui, Zhaoxin Huan, Zhaoyang Wang, Zhibo Zhu, Zhihao Wang, Zhiqiang Zhang, Zhoufei Wang, Zihang Zeng, Ziqi Liu, Zitao Xuan, Zuoli Tang

Main category: cs.CL

TL;DR: Ling 2.0 is a reasoning-oriented language model series using Mixture-of-Experts architecture that scales from 16B to 1T parameters, achieving up to 7x compute efficiency through high sparsity and coordinated innovations across architecture, training, and infrastructure.

Details

Motivation: To create scalable and efficient reasoning-focused language models that demonstrate sparse activation can enable superior computational efficiency when properly aligned with reasoning objectives.

Method: Uses high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data with mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines.

Result: Ling-1T establishes a new Pareto frontier of reasoning accuracy vs computational efficiency, with the series achieving up to 7-fold active-compute efficiency compared to dense counterparts.

Conclusion: Sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence, providing a coherent foundation for advancing future reasoning and thinking models.

Abstract: We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three non-thinking (instruct) models - Ling-mini-2.0, Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and achieving up to 7-fold active-compute efficiency compared with dense counterparts. Ling 2.0 integrates coordinated innovations across model architecture, pre-training, post-training, and infrastructure: a high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines. At the trillion scale, Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, demonstrating that sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence. Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for advancing future reasoning and thinking models, including the Ring series built upon the same base.

[72] Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

Fei Wei, Daoyuan Chen, Ce Wang, Yilun Huang, Yushuo Chen, Xuchen Pan, Yaliang Li, Bolin Ding

Main category: cs.CL

TL;DR: Learn-to-Ask is a simulator-free framework that transforms passive LLMs into proactive dialogue agents using offline expert data, enabling goal-oriented conversations by learning when to ask questions and when to stop.

Details

Motivation: Current approaches for proactive LLMs either optimize single-turn attributes or rely on brittle user simulators, creating a reality gap. There's a need for practical methods to make LLMs proactive partners in high-stakes domains.

Method: Reframes offline policy learning using observed future trajectories to infer dense, turn-by-turn rewards. Uses structured (action, state_assessment) tuples and Automated Grader Calibration to purge noise from LLM-based rewards with minimal human supervision.

Result: Successfully deployed in real-world medical dataset with LLMs up to 32B. In live online service, achieved performance superior to human experts in rigorous evaluations.

Conclusion: Provides a practical blueprint for transforming passive LLMs into proactive, goal-oriented applications that can translate offline data into real-world impact.

Abstract: Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap’’. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert’s revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework’s ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.

[73] ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai

Surapon Nonesung, Teetouch Jaknamon, Sirinya Chaiophat, Natapong Nitarach, Chanakan Wittayasakpan, Warit Sirichotedumrong, Adisai Na-Thalang, Kunat Pipatanakul

Main category: cs.CL

TL;DR: ThaiOCRBench is the first comprehensive benchmark for evaluating vision-language models on Thai text-rich visual understanding tasks, addressing the underrepresentation of Thai in existing benchmarks.

Details

Motivation: Existing multimodal benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding.

Method: Created a diverse, human-annotated dataset of 2,808 samples across 13 task categories and evaluated state-of-the-art VLMs in zero-shot settings, including both proprietary and open-source systems.

Result: Proprietary models (e.g., Gemini 2.5 Pro) significantly outperform open-source counterparts, with fine-grained text recognition and handwritten content extraction showing the steepest performance drops among open-source models.

Conclusion: ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings and identifies key challenges like language bias, structural mismatch, and hallucinated content for improving Thai-language document understanding.

Abstract: We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.

cs.CV

[74] GSE: Evaluating Sticker Visual Semantic Similarity via a General Sticker Encoder

Heng Er Metilda Chee, Jiayin Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang

Main category: cs.CV

TL;DR: The paper introduces Triple-S, the first benchmark for sticker semantic similarity, and proposes GSE, a lightweight model that learns robust sticker embeddings for better sticker understanding and retrieval.

Details

Motivation: Stickers are popular for visual communication but understanding their semantic relationships is challenging due to diverse and symbolic content. Existing models struggle with nuanced sticker semantics.

Method: Created Triple-S benchmark with 905 human-annotated sticker pairs. Proposed General Sticker Encoder (GSE) - a lightweight model that learns robust sticker embeddings using Triple-S and additional datasets.

Result: GSE achieves superior performance on unseen stickers and demonstrates strong results on downstream tasks like emotion classification and sticker-to-sticker retrieval. Existing pretrained models struggle with sticker semantics.

Conclusion: Triple-S and GSE provide standardized evaluation tools and robust embeddings, enabling future research in sticker understanding, retrieval, and multimodal content generation.

Abstract: Stickers have become a popular form of visual communication, yet understanding their semantic relationships remains challenging due to their highly diverse and symbolic content. In this work, we formally {define the Sticker Semantic Similarity task} and introduce {Triple-S}, the first benchmark for this task, consisting of 905 human-annotated positive and negative sticker pairs. Through extensive evaluation, we show that existing pretrained vision and multimodal models struggle to capture nuanced sticker semantics. To address this, we propose the {General Sticker Encoder (GSE)}, a lightweight and versatile model that learns robust sticker embeddings using both Triple-S and additional datasets. GSE achieves superior performance on unseen stickers, and demonstrates strong results on downstream tasks such as emotion classification and sticker-to-sticker retrieval. By releasing both Triple-S and GSE, we provide standardized evaluation tools and robust embeddings, enabling future research in sticker understanding, retrieval, and multimodal content generation. The Triple-S benchmark and GSE have been publicly released and are available here.

[75] Splatography: Sparse multi-view dynamic Gaussian Splatting for filmmaking challenges

Adrian Azzarelli, Nantheera Anantrasirichai, David R Bull

Main category: cs.CV

TL;DR: A method that improves dynamic 3D reconstruction from sparse camera setups by separating foreground and background Gaussian splats, achieving better quality with smaller model size.

Details

Motivation: Address limitations of current methods that struggle with complex dynamic features when using sparse camera configurations common in filmmaking with tight budgets.

Method: Splits canonical Gaussians and deformation field into foreground/background components using sparse masks, trains them separately with different loss functions, and models different parameters for each deformation field based on their dynamic characteristics.

Result: Achieves state-of-the-art qualitative and quantitative results with up to 3 PSNR higher quality and half the model size on 3D scenes, while producing segmented dynamic reconstructions without dense mask supervision.

Conclusion: The proposed foreground-background separation approach enables effective dynamic 3D reconstruction from sparse camera setups, outperforming existing methods in both quality and efficiency.

Abstract: Deformable Gaussian Splatting (GS) accomplishes photorealistic dynamic 3-D reconstruction from dense multi-view video (MVV) by learning to deform a canonical GS representation. However, in filmmaking, tight budgets can result in sparse camera configurations, which limits state-of-the-art (SotA) methods when capturing complex dynamic features. To address this issue, we introduce an approach that splits the canonical Gaussians and deformation field into foreground and background components using a sparse set of masks for frames at t=0. Each representation is separately trained on different loss functions during canonical pre-training. Then, during dynamic training, different parameters are modeled for each deformation field following common filmmaking practices. The foreground stage contains diverse dynamic features so changes in color, position and rotation are learned. While, the background containing film-crew and equipment, is typically dimmer and less dynamic so only changes in point position are learned. Experiments on 3-D and 2.5-D entertainment datasets show that our method produces SotA qualitative and quantitative results; up to 3 PSNR higher with half the model size on 3-D scenes. Unlike the SotA and without the need for dense mask supervision, our method also produces segmented dynamic reconstructions including transparent and dynamic textures. Code and video comparisons are available online: https://interims-git.github.io/

[76] IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

Ali Faraz, Akash, Shaharukh Khan, Raja Kolla, Akshat Patidar, Suranjan Goswami, Abhinav Ravi, Chandra Khatri, Shubham Agarwal

Main category: cs.CV

TL;DR: IndicVisionBench is a large-scale multimodal benchmark for evaluating vision-language models on Indian cultural and linguistic diversity, covering 10 Indian languages and 13 culturally grounded topics across 3 tasks.

Details

Motivation: Most vision-language model evaluation benchmarks are Western-centric, leaving gaps in understanding model performance in culturally diverse and multilingual settings, particularly for the Indian subcontinent.

Method: Created a benchmark with ~5K images and 37K+ QA pairs covering English and 10 Indian languages across 3 multimodal tasks (OCR, MMT, VQA) with 6 question types and 13 cultural topics, plus a parallel corpus for bias analysis.

Result: Evaluation of 8 models revealed substantial performance gaps, highlighting limitations of current VLMs in culturally diverse contexts.

Conclusion: IndicVisionBench establishes a reproducible framework for more inclusive multimodal research by centering cultural diversity and multilinguality.

Abstract: Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.

[77] Knowledge-based anomaly detection for identifying network-induced shape artifacts

Rucha Deshpande, Tahsin Rahman, Miguel Lago, Adarsh Subbaswamy, Jana G. Delfino, Ghada Zamzmi, Elim Thompson, Aldo Badano, Seyed Kahaki

Main category: cs.CV

TL;DR: A novel knowledge-based anomaly detection method for identifying network-induced shape artifacts in synthetic medical images, using angle gradient analysis and isolation forests.

Details

Motivation: Synthetic data helps address data scarcity but may introduce artifacts that compromise model performance and clinical utility, requiring quality assessment methods.

Method: Two-stage framework: (1) feature extractor analyzing per-image distribution of angle gradients along anatomical boundaries, (2) isolation forest-based anomaly detector.

Result: Achieved AUC values of 0.97 and 0.91 on two mammography datasets, with human reader agreement rates of 66% and 68% for the most anomalous images.

Conclusion: The method enables responsible use of synthetic data by allowing developers to evaluate synthetic images against anatomic constraints and improve dataset quality.

Abstract: Synthetic data provides a promising approach to address data scarcity for training machine learning models; however, adoption without proper quality assessments may introduce artifacts, distortions, and unrealistic features that compromise model performance and clinical utility. This work introduces a novel knowledge-based anomaly detection method for detecting network-induced shape artifacts in synthetic images. The introduced method utilizes a two-stage framework comprising (i) a novel feature extractor that constructs a specialized feature space by analyzing the per-image distribution of angle gradients along anatomical boundaries, and (ii) an isolation forest-based anomaly detector. We demonstrate the effectiveness of the method for identifying network-induced shape artifacts in two synthetic mammography datasets from models trained on CSAW-M and VinDr-Mammo patient datasets respectively. Quantitative evaluation shows that the method successfully concentrates artifacts in the most anomalous partition (1st percentile), with AUC values of 0.97 (CSAW-syn) and 0.91 (VMLO-syn). In addition, a reader study involving three imaging scientists confirmed that images identified by the method as containing network-induced shape artifacts were also flagged by human readers with mean agreement rates of 66% (CSAW-syn) and 68% (VMLO-syn) for the most anomalous partition, approximately 1.5-2 times higher than the least anomalous partition. Kendall-Tau correlations between algorithmic and human rankings were 0.45 and 0.43 for the two datasets, indicating reasonable agreement despite the challenging nature of subtle artifact detection. This method is a step forward in the responsible use of synthetic data, as it allows developers to evaluate synthetic images for known anatomic constraints and pinpoint and address specific issues to improve the overall quality of a synthetic dataset.

[78] Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, Hui En Pang, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang

Main category: cs.CV

TL;DR: EASI framework evaluates multimodal LLMs on spatial intelligence, revealing GPT-5 leads but still lags behind human performance significantly, with spatial tasks exposing greater model deficiencies than non-spatial ones.

Details

Motivation: To assess the current state of spatial understanding in leading multimodal models (GPT, Gemini, Grok, Seed, Qwen, Intern) and examine where they stand on the path toward spatial intelligence, especially with GPT-5's recent release.

Method: Proposed EASI framework with comprehensive taxonomy of spatial tasks that unifies existing benchmarks, standardized evaluation protocol, and empirical study across eight benchmarks using over ten billion tokens.

Result: GPT-5 shows unprecedented spatial intelligence strength but still significantly underperforms humans across broad spectrum of spatial tasks. Spatial tasks reveal greater model capability gaps than non-spatial tasks, and proprietary models don’t have decisive advantage on most difficult tasks.

Conclusion: Current multimodal models, including the most advanced GPT-5, still have substantial limitations in spatial understanding and reasoning compared to human capabilities, indicating significant room for improvement in this crucial aspect of artificial general intelligence.

Abstract: Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence. We thus propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence. EASI conceptualizes a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and a standardized protocol for the fair evaluation of state-of-the-art proprietary and open-source models. In this report, we conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence (SI), yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail even the most advanced multimodal models.

[79] CPO: Condition Preference Optimization for Controllable Image Generation

Zonglin Lyu, Ming Li, Xinxin Liu, Chen Chen

Main category: cs.CV

TL;DR: ControlNet++ improves controllability in text-to-image generation but has limitations in optimizing high-noise timesteps. The proposed Condition Preference Optimization (CPO) method performs preference learning over control conditions instead of images, achieving better controllability with lower variance and computational cost.

Details

Motivation: Existing methods like ControlNet++ optimize only low-noise timesteps using approximations, ignoring high-noise timesteps and introducing errors. DPO faces challenges in ensuring win-lose image pairs differ only in controllability while maintaining other factors constant.

Method: Propose Condition Preference Optimization (CPO) that performs preference learning over control signals (c^w and c^l) rather than generated images. This eliminates confounding factors and creates a low-variance training objective.

Result: CPO significantly improves controllability over ControlNet++: over 10% error rate reduction in segmentation, 70-80% in human pose, and consistent 2-5% reductions in edge and depth maps. Theoretically shows lower contrastive loss variance than DPO.

Conclusion: CPO provides a more effective approach for improving controllability in text-to-image generation by optimizing over control conditions rather than images, achieving superior results with lower computational requirements.

Abstract: To enhance controllability in text-to-image generation, ControlNet introduces image-based control signals, while ControlNet++ improves pixel-level cycle consistency between generated images and the input control signal. To avoid the prohibitive cost of back-propagating through the sampling process, ControlNet++ optimizes only low-noise timesteps (e.g., $t < 200$) using a single-step approximation, which not only ignores the contribution of high-noise timesteps but also introduces additional approximation errors. A straightforward alternative for optimizing controllability across all timesteps is Direct Preference Optimization (DPO), a fine-tuning method that increases model preference for more controllable images ($I^{w}$) over less controllable ones ($I^{l}$). However, due to uncertainty in generative models, it is difficult to ensure that win–lose image pairs differ only in controllability while keeping other factors, such as image quality, fixed. To address this, we propose performing preference learning over control conditions rather than generated images. Specifically, we construct winning and losing control signals, $\mathbf{c}^{w}$ and $\mathbf{c}^{l}$, and train the model to prefer $\mathbf{c}^{w}$. This method, which we term \textit{Condition Preference Optimization} (CPO), eliminates confounding factors and yields a low-variance training objective. Our approach theoretically exhibits lower contrastive loss variance than DPO and empirically achieves superior results. Moreover, CPO requires less computation and storage for dataset curation. Extensive experiments show that CPO significantly improves controllability over the state-of-the-art ControlNet++ across multiple control types: over $10%$ error rate reduction in segmentation, $70$–$80%$ in human pose, and consistent $2$–$5%$ reductions in edge and depth maps.

[80] Med-Banana-50K: A Cross-modality Large-Scale Dataset for Text-guided Medical Image Editing

Zhihui Chen, Mengling Feng

Main category: cs.CV

TL;DR: Med-Banana-50K is a comprehensive dataset of 50k medically curated image edits across chest X-ray, brain MRI, and fundus photography for 23 diseases, supporting bidirectional lesion editing with rigorous quality control.

Details

Motivation: The lack of large-scale, high-quality, and openly accessible datasets tailored for medical contexts with strict anatomical and clinical constraints has hindered progress in medical image editing.

Method: Dataset constructed using Gemini-2.5-Flash-Image based on real clinical images, with medically grounded quality control using LLM-as-Judge evaluation framework and iterative refinement over up to five rounds.

Result: Created Med-Banana-50K dataset with over 50k medically curated image edits spanning multiple modalities and diseases, including 37,000 failed editing attempts with full evaluation logs.

Conclusion: Med-Banana-50K establishes a critical foundation for developing and evaluating reliable medical image editing systems by offering a large-scale, medically rigorous, and fully documented resource.

Abstract: Medical image editing has emerged as a pivotal technology with broad applications in data augmentation, model interpretability, medical education, and treatment simulation. However, the lack of large-scale, high-quality, and openly accessible datasets tailored for medical contexts with strict anatomical and clinical constraints has significantly hindered progress in this domain. To bridge this gap, we introduce Med-Banana-50K, a comprehensive dataset of over 50k medically curated image edits spanning chest X-ray, brain MRI, and fundus photography across 23 diseases. Each sample supports bidirectional lesion editing (addition and removal) and is constructed using Gemini-2.5-Flash-Image based on real clinical images. A key differentiator of our dataset is the medically grounded quality control protocol: we employ an LLM-as-Judge evaluation framework with criteria such as instruction compliance, structural plausibility, image realism, and fidelity preservation, alongside iterative refinement over up to five rounds. Additionally, Med-Banana-50K includes around 37,000 failed editing attempts with full evaluation logs to support preference learning and alignment research. By offering a large-scale, medically rigorous, and fully documented resource, Med-Banana-50K establishes a critical foundation for developing and evaluating reliable medical image editing systems. Our dataset and code are publicly available. [https://github.com/richardChenzhihui/med-banana-50k].

[81] DARN: Dynamic Adaptive Regularization Networks for Efficient and Robust Foundation Model Adaptation

Dhenenjay Yadav, Rohan Sawai

Main category: cs.CV

TL;DR: DARN is a novel decoder architecture that dynamically adapts regularization for geospatial foundation models, achieving state-of-the-art performance in both full fine-tuning and efficient adaptation scenarios.

Details

Motivation: Standard adaptation methods for foundation models in geospatial analysis use fixed regularization strategies that fail to account for the significant heterogeneity in satellite imagery, limiting their effectiveness.

Method: DARN integrates three innovations: Task Complexity Predictor (TCP) for per-sample difficulty estimation, Adaptive Dropout Modulation (ADM) for dynamic dropout rate adjustment, and Dynamic Capacity Gating (DCG) for channel activation modulation.

Result: DARN achieves new SOTA on GeoBench (86.66% mIoU, +5.56 pp) in full fine-tuning, and SOTA-competitive accuracy (90.5% mIoU) with superior OOD generalization (+9.5 pp), enhanced robustness (17% error reduction), and improved minority class performance in efficient adaptation.

Conclusion: DARN provides a more intelligent, robust, and efficient approach to leveraging foundation models in critical geospatial applications by dynamically adapting regularization based on task complexity.

Abstract: Foundation models (FMs) offer powerful representations for geospatial analysis, but adapting them effectively remains challenging. Standard adaptation methods, whether full fine-tuning or efficient frozen-backbone approaches, typically employ decoders with fixed regularization strategies, failing to account for the significant heterogeneity in satellite imagery. We introduce Dynamic Adaptive Regularization Networks (DARN), a novel decoder architecture designed to address this limitation. DARN integrates three key innovations: (1) a lightweight Task Complexity Predictor (TCP) that estimates per-sample difficulty, (2) Adaptive Dropout Modulation (ADM), dynamically adjusting dropout rates (from 0.1 to 0.5) based on predicted complexity, and (3) Dynamic Capacity Gating (DCG) that modulates channel activation. We provide theoretical justifications linking DARN’s optimization to stationary point convergence and its mechanism to adaptive information bottlenecks. Empirically, DARN demonstrates exceptional performance across both major adaptation paradigms. In full fine-tuning (unfrozen backbone), DARN achieves a new state-of-the-art on the multi-task GeoBench benchmark (86.66% mIoU, +5.56 pp over prior SOTA). In efficient adaptation (frozen backbone), DARN achieves SOTA-competitive accuracy (90.5% mIoU on Sen1Floods11) while delivering substantial advantages crucial for real-world deployment: superior out-of-distribution (OOD) generalization (+9.5 pp mIoU on AI4SmallFarms), enhanced robustness (17% relative reduction in corruption error), and improved performance on minority classes. DARN offers a more intelligent, robust, and efficient approach to leveraging FMs in critical geospatial applications.

[82] Global 3D Reconstruction of Clouds & Tropical Cyclones

Shirin Ermis, Cesar Aybar, Lilli Freischem, Stella Girtsou, Kyriaki-Margarita Bintsi, Emiliano Diaz Salas-Porras, Michael Eisinger, William Jones, Anna Jungbluth, Benoit Tremblay

Main category: cs.CV

TL;DR: A new framework using pre-training and fine-tuning to create 3D cloud maps from 2D satellite imagery, specifically designed for tropical cyclones and intense storms.

Details

Motivation: Accurate forecasting of tropical cyclones is challenging due to limited satellite observations of TC structure and difficulties in resolving cloud properties involved in TC intensification.

Method: Pre-training–fine-tuning pipeline that learns from multiple satellites with global coverage to translate 2D satellite imagery into 3D cloud maps of relevant cloud properties, applied to a custom-built TC dataset.

Result: First-ever creation of global instantaneous 3D cloud maps and accurate reconstruction of 3D structure of intense storms, extending available satellite observations and providing estimates when observations are missing.

Conclusion: This framework is crucial for advancing understanding of TC intensification and improving forecasts by providing comprehensive 3D cloud structure data.

Abstract: Accurate forecasting of tropical cyclones (TCs) remains challenging due to limited satellite observations probing TC structure and difficulties in resolving cloud properties involved in TC intensification. Recent research has demonstrated the capabilities of machine learning methods for 3D cloud reconstruction from satellite observations. However, existing approaches have been restricted to regions where TCs are uncommon, and are poorly validated for intense storms. We introduce a new framework, based on a pre-training–fine-tuning pipeline, that learns from multiple satellites with global coverage to translate 2D satellite imagery into 3D cloud maps of relevant cloud properties. We apply our model to a custom-built TC dataset to evaluate performance in the most challenging and relevant conditions. We show that we can - for the first time - create global instantaneous 3D cloud maps and accurately reconstruct the 3D structure of intense storms. Our model not only extends available satellite observations but also provides estimates when observations are missing entirely. This is crucial for advancing our understanding of TC intensification and improving forecasts.

[83] Automatic segmentation of colorectal liver metastases for ultrasound-based navigated resection

Tiziano Natali, Karin A. Olthof, Niels F. M. Kok, Koert F. D. Kuhlmann, Theo J. M. Ruers, Matteo Fusaglia

Main category: cs.CV

TL;DR: Automatic 3D segmentation of colorectal liver metastases in intraoperative ultrasound using a cropped 3D U-Net achieves near real-time performance with expert-level accuracy, enabling efficient ultrasound-based navigation for liver surgery.

Details

Motivation: Accurate delineation of colorectal liver metastases during surgery is challenging with intraoperative ultrasound due to low contrast, noise, and operator dependency. Automated segmentation could enhance precision and efficiency in ultrasound-based navigation workflows.

Method: Used 85 tracked 3D intraoperative ultrasound volumes from CRLM patients to train and evaluate a 3D U-Net via nnU-Net framework. Compared two variants: full-volume training vs cropped regions around tumors. Integrated workflow into 3D Slicer for real-time intraoperative use.

Result: Cropped-volume model significantly outperformed full-volume model (AUC-ROC = 0.898 vs 0.718). Achieved median DSC = 0.74, recall = 0.79, and HDist. = 17.1 mm, comparable to semi-automatic segmentation but ~4x faster (~1 min). Prospective testing confirmed robust performance for real-time surgical guidance.

Conclusion: Automatic 3D segmentation of CRLM in intraoperative ultrasound using cropped 3D U-Net provides reliable, near real-time results with minimal operator input, enabling efficient registration-free ultrasound-based navigation with expert-level accuracy while reducing manual workload and procedure time.

Abstract: Introduction: Accurate intraoperative delineation of colorectal liver metastases (CRLM) is crucial for achieving negative resection margins but remains challenging using intraoperative ultrasound (iUS) due to low contrast, noise, and operator dependency. Automated segmentation could enhance precision and efficiency in ultrasound-based navigation workflows. Methods: Eighty-five tracked 3D iUS volumes from 85 CRLM patients were used to train and evaluate a 3D U-Net implemented via the nnU-Net framework. Two variants were compared: one trained on full iUS volumes and another on cropped regions around tumors. Segmentation accuracy was assessed using Dice Similarity Coefficient (DSC), Hausdorff Distance (HDist.), and Relative Volume Difference (RVD) on retrospective and prospective datasets. The workflow was integrated into 3D Slicer for real-time intraoperative use. Results: The cropped-volume model significantly outperformed the full-volume model across all metrics (AUC-ROC = 0.898 vs 0.718). It achieved median DSC = 0.74, recall = 0.79, and HDist. = 17.1 mm comparable to semi-automatic segmentation but with ~~4x faster execution (~~ 1 min). Prospective intraoperative testing confirmed robust and consistent performance, with clinically acceptable accuracy for real-time surgical guidance. Conclusion: Automatic 3D segmentation of CRLM in iUS using a cropped 3D U-Net provides reliable, near real-time results with minimal operator input. The method enables efficient, registration-free ultrasound-based navigation for hepatic surgery, approaching expert-level accuracy while substantially reducing manual workload and procedure time.

[84] EETnet: a CNN for Gaze Detection and Tracking for Smart-Eyewear

Andrea Aspesi, Andrea Simpsi, Aaron Tognoli, Simone Mentasti, Luca Merigo, Matteo Matteucci

Main category: cs.CV

TL;DR: EETnet is a CNN for event-based eye tracking that runs on microcontrollers, using classification and regression approaches for pupil detection.

Details

Motivation: Existing event-based eye tracking solutions require powerful GPUs and lack deployment on embedded devices with limited resources.

Method: Developed EETnet CNN for event-based eye tracking, with classification (grid-based pupil detection) and regression (pixel-level) versions, plus training/evaluation/quantization methodology.

Result: Created a neural network capable of running on microcontrollers using purely event-based data for eye tracking.

Conclusion: Successfully designed and implemented EETnet for efficient, low-power eye tracking on resource-constrained embedded devices.

Abstract: Event-based cameras are becoming a popular solution for efficient, low-power eye tracking. Due to the sparse and asynchronous nature of event data, they require less processing power and offer latencies in the microsecond range. However, many existing solutions are limited to validation on powerful GPUs, with no deployment on real embedded devices. In this paper, we present EETnet, a convolutional neural network designed for eye tracking using purely event-based data, capable of running on microcontrollers with limited resources. Additionally, we outline a methodology to train, evaluate, and quantize the network using a public dataset. Finally, we propose two versions of the architecture: a classification model that detects the pupil on a grid superimposed on the original image, and a regression model that operates at the pixel level.

[85] 3D Gaussian Point Encoders

Jim James, Ben Wilson, Simon Lucey, James Hays

Main category: cs.CV

TL;DR: 3D Gaussian Point Encoder is an explicit geometric representation using learned 3D Gaussians for 3D recognition, offering faster performance and better parameter efficiency than traditional PointNets.

Details

Motivation: To move from implicit representations like PointNet to explicit geometric representations for 3D recognition tasks, similar to the shift from NeRF to Gaussian Splatting in 3D reconstruction.

Method: Developed optimization techniques using natural gradients and distillation from PointNets to learn Gaussian Basis that reconstructs PointNet activations, and extended filtering techniques from 3D Gaussian Splatting.

Result: Achieved 2.7x faster speed than comparable PointNet with 46% less memory and 88% fewer FLOPs; in Mamba3D, ran 1.27x faster with 42% memory reduction and 54% FLOPs reduction.

Conclusion: 3D Gaussian Point Encoders provide efficient explicit geometric representation for 3D recognition, enabling high framerates even on CPU-only devices while maintaining accuracy.

Abstract: In this work, we introduce the 3D Gaussian Point Encoder, an explicit per-point embedding built on mixtures of learned 3D Gaussians. This explicit geometric representation for 3D recognition tasks is a departure from widely used implicit representations such as PointNet. However, it is difficult to learn 3D Gaussian encoders in end-to-end fashion with standard optimizers. We develop optimization techniques based on natural gradients and distillation from PointNets to find a Gaussian Basis that can reconstruct PointNet activations. The resulting 3D Gaussian Point Encoders are faster and more parameter efficient than traditional PointNets. As in the 3D reconstruction literature where there has been considerable interest in the move from implicit (e.g., NeRF) to explicit (e.g., Gaussian Splatting) representations, we can take advantage of computational geometry heuristics to accelerate 3D Gaussian Point Encoders further. We extend filtering techniques from 3D Gaussian Splatting to construct encoders that run 2.7 times faster as a comparable accuracy PointNet while using 46% less memory and 88% fewer FLOPs. Furthermore, we demonstrate the effectiveness of 3D Gaussian Point Encoders as a component in Mamba3D, running 1.27 times faster and achieving a reduction in memory and FLOPs by 42% and 54% respectively. 3D Gaussian Point Encoders are lightweight enough to achieve high framerates on CPU-only devices.

[86] Data Efficiency and Transfer Robustness in Biomedical Image Segmentation: A Study of Redundancy and Forgetting with Cellpose

Shuo Zhao, Jianxu Chen

Main category: cs.CV

TL;DR: This paper analyzes data redundancy and catastrophic forgetting in biomedical image segmentation using Cellpose, showing that only 10% of training data is needed for performance saturation, and proposes dataset quantization and selective replay strategies to mitigate forgetting during cross-domain transfer.

Details

Motivation: To address two underexplored challenges in generalist biomedical image segmentation models: training data redundancy and the impact of cross-domain transfer on model retention (catastrophic forgetting).

Method: Used Cellpose as a case study, proposed dataset quantization (DQ) for compact training subsets, performed cross-domain fine-tuning experiments, and implemented selective DQ-based replay with 5-10% source data to mitigate forgetting.

Result: Image segmentation performance saturates with only 10% of data, revealing substantial redundancy. Cross-domain fine-tuning causes significant source domain degradation, but selective replay effectively restores source performance while full replay hinders target adaptation. Training domain sequencing improves generalization.

Conclusion: Efficient biomedical image segmentation requires both compact training subsets and retention-aware learning strategies with informed domain ordering, highlighting the importance of data-centric design.

Abstract: Generalist biomedical image segmentation models such as Cellpose are increasingly applied across diverse imaging modalities and cell types. However, two critical challenges remain underexplored: (1) the extent of training data redundancy and (2) the impact of cross domain transfer on model retention. In this study, we conduct a systematic empirical analysis of these challenges using Cellpose as a case study. First, to assess data redundancy, we propose a simple dataset quantization (DQ) strategy for constructing compact yet diverse training subsets. Experiments on the Cyto dataset show that image segmentation performance saturates with only 10% of the data, revealing substantial redundancy and potential for training with minimal annotations. Latent space analysis using MAE embeddings and t-SNE confirms that DQ selected patches capture greater feature diversity than random sampling. Second, to examine catastrophic forgetting, we perform cross domain finetuning experiments and observe significant degradation in source domain performance, particularly when adapting from generalist to specialist domains. We demonstrate that selective DQ based replay reintroducing just 5-10% of the source data effectively restores source performance, while full replay can hinder target adaptation. Additionally, we find that training domain sequencing improves generalization and reduces forgetting in multi stage transfer. Our findings highlight the importance of data centric design in biomedical image segmentation and suggest that efficient training requires not only compact subsets but also retention aware learning strategies and informed domain ordering. The code is available at https://github.com/MMV-Lab/biomedseg-efficiency.

[87] An Active Learning Pipeline for Biomedical Image Instance Segmentation with Minimal Human Intervention

Shuo Zhao, Yu Zhou, Jianxu Chen

Main category: cs.CV

TL;DR: A data-centric AI workflow combining foundation models and nnU-Net with active learning and pseudo-labeling to reduce manual annotation needs in biomedical image segmentation.

Details

Motivation: Address limitations of traditional methods (noise sensitivity), nnU-Net (requires extensive annotated data), and foundation models (underperformance on specialized datasets) in biomedical image segmentation.

Method: Pipeline that generates pseudo-labels from foundation models for nnU-Net self-configuration, selects representative core-set for minimal manual annotation, and fine-tunes nnU-Net model.

Result: Significantly reduces manual annotation requirements while maintaining competitive segmentation performance.

Conclusion: Provides accessible solution for biomedical researchers to apply state-of-the-art AI techniques with minimal human intervention in segmentation tasks.

Abstract: Biomedical image segmentation is critical for precise structure delineation and downstream analysis. Traditional methods often struggle with noisy data, while deep learning models such as U-Net have set new benchmarks in segmentation performance. nnU-Net further automates model configuration, making it adaptable across datasets without extensive tuning. However, it requires a substantial amount of annotated data for cross-validation, posing a challenge when only raw images but no labels are available. Large foundation models offer zero-shot generalizability, but may underperform on specific datasets with unique characteristics, limiting their direct use for analysis. This work addresses these bottlenecks by proposing a data-centric AI workflow that leverages active learning and pseudo-labeling to combine the strengths of traditional neural networks and large foundation models while minimizing human intervention. The pipeline starts by generating pseudo-labels from a foundation model, which are then used for nnU-Net’s self-configuration. Subsequently, a representative core-set is selected for minimal manual annotation, enabling effective fine-tuning of the nnU-Net model. This approach significantly reduces the need for manual annotations while maintaining competitive performance, providing an accessible solution for biomedical researchers to apply state-of-the-art AI techniques in their segmentation tasks. The code is available at https://github.com/MMV-Lab/AL_BioMed_img_seg.

[88] Geometry Denoising with Preferred Normal Vectors

Manuel Weiß, Lukas Baumgärtner, Roland Herzog, Stephan Schmidt

Main category: cs.CV

TL;DR: A geometry denoising method using surface normal priors and segmentation via label vectors, solved with split Bregman optimization.

Details

Motivation: To leverage prior knowledge about preferred surface normal vectors for more effective geometry denoising.

Method: Uses label vectors as normal priors, embeds segmentation in denoising process, applies total variation regularization, and solves with split Bregman (ADMM) approach with vertex updates based on second-order shape calculus.

Result: A novel paradigm that integrates segmentation and denoising through normal vector similarity and regularization.

Conclusion: The approach successfully combines geometry denoising with segmentation using normal vector priors and efficient optimization methods.

Abstract: We introduce a new paradigm for geometry denoising using prior knowledge about the surface normal vector. This prior knowledge comes in the form of a set of preferred normal vectors, which we refer to as label vectors. A segmentation problem is naturally embedded in the denoising process. The segmentation is based on the similarity of the normal vector to the elements of the set of label vectors. Regularization is achieved by a total variation term. We formulate a split Bregman (ADMM) approach to solve the resulting optimization problem. The vertex update step is based on second-order shape calculus.

[89] Self-Supervised Implicit Attention Priors for Point Cloud Reconstruction

Kyle Fogarty, Chenyue Cai, Jing Yang, Zhilin Guo, Cengiz Öztireli

Main category: cs.CV

TL;DR: An implicit self-prior approach that learns shape-specific priors directly from input point clouds using cross-attention with a learnable dictionary, enabling high-quality surface reconstruction without external training data.

Details

Motivation: Recovering high-quality surfaces from irregular point clouds is ill-posed without strong geometric priors, and existing methods often require external training data or fail to preserve fine details.

Method: Jointly trains a dictionary of learnable embeddings with an implicit distance field using cross-attention, then samples the trained field to extract dense points and normals for integration with robust implicit moving least squares (RIMLS).

Result: Outperforms both classical and learning-based approaches in generating high-fidelity surfaces with superior detail preservation and robustness to common data degradations.

Conclusion: The self-prior approach effectively captures and reuses repeating structures from input data, enabling high-quality surface reconstruction while preserving input fidelity through a hybrid strategy.

Abstract: Recovering high-quality surfaces from irregular point cloud is ill-posed unless strong geometric priors are available. We introduce an implicit self-prior approach that distills a shape-specific prior directly from the input point cloud itself and embeds it within an implicit neural representation. This is achieved by jointly training a small dictionary of learnable embeddings with an implicit distance field; at every query location, the field attends to the dictionary via cross-attention, enabling the network to capture and reuse repeating structures and long-range correlations inherent to the shape. Optimized solely with self-supervised point cloud reconstruction losses, our approach requires no external training data. To effectively integrate this learned prior while preserving input fidelity, the trained field is then sampled to extract densely distributed points and analytic normals via automatic differentiation. We integrate the resulting dense point cloud and corresponding normals into a robust implicit moving least squares (RIMLS) formulation. We show this hybrid strategy preserves fine geometric details in the input data, while leveraging the learned prior to regularize sparse regions. Experiments show that our method outperforms both classical and learning-based approaches in generating high-fidelity surfaces with superior detail preservation and robustness to common data degradations.

[90] Clinical-ComBAT: a diffusion-weighted MRI harmonization method for clinical applications

Gabriel Girard, Manon Edde, Félix Dumais, Yoan David, Matthieu Dumont, Guillaume Theaud, Jean-Christophe Houde, Arnaud Boré, Maxime Descoteaux, Pierre-Marc Jodoin

Main category: cs.CV

TL;DR: Clinical-ComBAT is a flexible diffusion MRI harmonization method that addresses limitations of traditional ComBAT by enabling independent site harmonization, non-linear modeling, and adaptation to small cohorts for real-world clinical use.

Details

Motivation: Current DW-MRI harmonization methods like ComBAT have limitations including linear covariate assumptions, fixed site requirements, and poor performance with small cohorts, which constrain clinical applicability.

Method: Clinical-ComBAT uses independent site harmonization with non-linear polynomial modeling, site-specific reference to normative data, variance priors for small cohorts, hyperparameter tuning, and goodness-of-fit assessment.

Result: The method shows improved alignment of diffusion metrics on both simulated and real data, with enhanced applicability for normative modeling compared to traditional approaches.

Conclusion: Clinical-ComBAT provides a more flexible and practical solution for diffusion MRI harmonization in real-world clinical settings, overcoming key limitations of existing methods.

Abstract: Diffusion-weighted magnetic resonance imaging (DW-MRI) derived scalar maps are effective for assessing neurodegenerative diseases and microstructural properties of white matter in large number of brain conditions. However, DW-MRI inherently limits the combination of data from multiple acquisition sites without harmonization to mitigate scanner-specific biases. While the widely used ComBAT method reduces site effects in research, its reliance on linear covariate relationships, homogeneous populations, fixed site numbers, and well populated sites constrains its clinical use. To overcome these limitations, we propose Clinical-ComBAT, a method designed for real-world clinical scenarios. Clinical-ComBAT harmonizes each site independently, enabling flexibility as new data and clinics are introduced. It incorporates a non-linear polynomial data model, site-specific harmonization referenced to a normative site, and variance priors adaptable to small cohorts. It further includes hyperparameter tuning and a goodness-of-fit metric for harmonization assessment. We demonstrate its effectiveness on simulated and real data, showing improved alignment of diffusion metrics and enhanced applicability for normative modeling.

[91] Validating Vision Transformers for Otoscopy: Performance and Data-Leakage Effects

James Ndubuisi, Fernando Auat, Marta Vallejo

Main category: cs.CV

TL;DR: This study evaluates Swin transformers for ear disease diagnosis, initially showing high accuracy (100% for Swin v1, 99.1% for Swin v2) but later discovering data leakage that reduced performance to 83% after correction, highlighting the importance of rigorous data preprocessing in medical ML.

Details

Motivation: To improve diagnostic accuracy for ear diseases given the 27% misdiagnosis rate among specialist otolaryngologists, by comparing vision transformer models with traditional CNNs.

Method: Used Swin v1 and Swin v2 transformer models on otoscopic videos from a clinical hospital, with frame selection based on Laplacian and Shannon entropy thresholds and removal of blank frames.

Result: Initial results showed excellent performance (100% Swin v1, 99.1% Swin v2, 99.5% ResNet), but after discovering and mitigating data leakage, corrected accuracies dropped to 83% for both Swin models and 82% for ResNet.

Conclusion: Vision transformers show promise but require optimal balance between advanced architectures and effective data preprocessing for reliable medical diagnosis models.

Abstract: This study evaluates the efficacy of vision transformer models, specifically Swin transformers, in enhancing the diagnostic accuracy of ear diseases compared to traditional convolutional neural networks. With a reported 27% misdiagnosis rate among specialist otolaryngologists, improving diagnostic accuracy is crucial. The research utilised a real-world dataset from the Department of Otolaryngology at the Clinical Hospital of the Universidad de Chile, comprising otoscopic videos of ear examinations depicting various middle and external ear conditions. Frames were selected based on the Laplacian and Shannon entropy thresholds, with blank frames removed. Initially, Swin v1 and Swin v2 transformer models achieved accuracies of 100% and 99.1%, respectively, marginally outperforming the ResNet model (99.5%). These results surpassed metrics reported in related studies. However, the evaluation uncovered a critical data leakage issue in the preprocessing step, affecting both this study and related research using the same raw dataset. After mitigating the data leakage, model performance decreased significantly. Corrected accuracies were 83% for both Swin v1 and Swin v2, and 82% for the ResNet model. This finding highlights the importance of rigorous data handling in machine learning studies, especially in medical applications. The findings indicate that while vision transformers show promise, it is essential to find an optimal balance between the benefits of advanced model architectures and those derived from effective data preprocessing. This balance is key to developing a reliable machine learning model for diagnosing ear diseases.

[92] Beta Distribution Learning for Reliable Roadway Crash Risk Assessment

Ahmad Elallaf, Nathan Jacobs, Xinyue Ye, Mei Chen, Gongbo Liang

Main category: cs.CV

TL;DR: A geospatial deep learning framework using satellite imagery to predict fatal crash risks with uncertainty-aware Beta probability distributions, achieving 17-23% recall improvement over baselines.

Details

Motivation: Traditional traffic safety studies examine risk factors in isolation and lack spatial complexity understanding. Conventional neural networks provide point estimates without uncertainty, limiting decision-making utility in safety-critical applications.

Method: Novel geospatial deep learning framework that leverages satellite imagery as comprehensive spatial input to capture nuanced spatial patterns and environmental risk factors. The model estimates full Beta probability distributions over fatal crash risk rather than deterministic outputs.

Result: Model outperforms baselines with 17-23% improvement in recall (key metric for flagging potential dangers) and delivers superior calibration. Provides reliable and interpretable risk assessments from satellite imagery alone.

Conclusion: The framework enables safer autonomous navigation and offers a highly scalable tool for urban planners and policymakers to enhance roadway safety equitably and cost-effectively through uncertainty-aware predictions critical for trustworthy AI in safety-critical applications.

Abstract: Roadway traffic accidents represent a global health crisis, responsible for over a million deaths annually and costing many countries up to 3% of their GDP. Traditional traffic safety studies often examine risk factors in isolation, overlooking the spatial complexity and contextual interactions inherent in the built environment. Furthermore, conventional Neural Network-based risk estimators typically generate point estimates without conveying model uncertainty, limiting their utility in critical decision-making. To address these shortcomings, we introduce a novel geospatial deep learning framework that leverages satellite imagery as a comprehensive spatial input. This approach enables the model to capture the nuanced spatial patterns and embedded environmental risk factors that contribute to fatal crash risks. Rather than producing a single deterministic output, our model estimates a full Beta probability distribution over fatal crash risk, yielding accurate and uncertainty-aware predictions–a critical feature for trustworthy AI in safety-critical applications. Our model outperforms baselines by achieving a 17-23% improvement in recall, a key metric for flagging potential dangers, while delivering superior calibration. By providing reliable and interpretable risk assessments from satellite imagery alone, our method enables safer autonomous navigation and offers a highly scalable tool for urban planners and policymakers to enhance roadway safety equitably and cost-effectively.

[93] Learning to Restore Multi-Degraded Images via Ingredient Decoupling and Task-Aware Path Adaptation

Hu Gao, Xiaoning Lei, Ying Zhang, Xichen Xu, Guannan Jiang, Lizhuang Ma

Main category: cs.CV

TL;DR: Proposes IMDNet, an adaptive multi-degradation image restoration network that uses decoupled degradation representations to dynamically select optimal restoration paths for handling multiple coexisting degradations like rain, noise, and haze.

Details

Motivation: Most existing image restoration methods focus on single degradation types, but real-world images often suffer from multiple coexisting degradations, limiting practical effectiveness.

Method: Uses degradation ingredient decoupling block (DIDBlock) to separate degradation ingredients statistically by integrating spatial and frequency domain information, fusion block (FBlock) to integrate degradation information, and task adaptation block (TABlock) to dynamically activate/fuse functional branches based on multi-degradation representation.

Result: Extensive experiments show superior performance on multi-degradation restoration while maintaining strong competitiveness on single-degradation tasks.

Conclusion: IMDNet provides an effective solution for handling multiple coexisting degradations in real-world images through adaptive path selection guided by decoupled degradation representations.

Abstract: Image restoration (IR) aims to recover clean images from degraded observations. Despite remarkable progress, most existing methods focus on a single degradation type, whereas real-world images often suffer from multiple coexisting degradations, such as rain, noise, and haze coexisting in a single image, which limits their practical effectiveness. In this paper, we propose an adaptive multi-degradation image restoration network that reconstructs images by leveraging decoupled representations of degradation ingredients to guide path selection. Specifically, we design a degradation ingredient decoupling block (DIDBlock) in the encoder to separate degradation ingredients statistically by integrating spatial and frequency domain information, enhancing the recognition of multiple degradation types and making their feature representations independent. In addition, we present fusion block (FBlock) to integrate degradation information across all levels using learnable matrices. In the decoder, we further introduce a task adaptation block (TABlock) that dynamically activates or fuses functional branches based on the multi-degradation representation, flexibly selecting optimal restoration paths under diverse degradation conditions. The resulting tightly integrated architecture, termed IMDNet, is extensively validated through experiments, showing superior performance on multi-degradation restoration while maintaining strong competitiveness on single-degradation tasks.

[94] A benchmark multimodal oro-dental dataset for large vision-language models

Haoxin Lv, Ijazul Haq, Jin Du, Jiaxin Ma, Binnian Zhu, Xiaobing Dang, Chaoan Liang, Ruxu Du, Yingjie Zhang, Muhammad Saqib

Main category: cs.CV

TL;DR: A comprehensive multimodal dental dataset with 8775 checkups from 4800 patients, including images and text records, used to fine-tune vision-language models for dental anomaly classification and diagnostic report generation.

Details

Motivation: To advance AI in oral healthcare by providing a large-scale multimodal dataset that captures clinical complexity, addressing the lack of comprehensive dental datasets for AI research.

Method: Collected 8775 dental checkups over 8 years with 50K intraoral images, 8056 radiographs, and detailed text records. Fine-tuned Qwen-VL 3B and 7B models on two tasks: classifying six oro-dental anomalies and generating diagnostic reports from multimodal inputs.

Result: Fine-tuned models achieved substantial gains over base models and GPT-4o, validating the dataset’s effectiveness for advancing AI-driven dental healthcare solutions.

Conclusion: The publicly available dataset provides an essential resource for future AI dentistry research, demonstrating significant improvements in dental AI applications through multimodal learning.

Abstract: The advancement of artificial intelligence in oral healthcare relies on the availability of large-scale multimodal datasets that capture the complexity of clinical practice. In this paper, we present a comprehensive multimodal dataset, comprising 8775 dental checkups from 4800 patients collected over eight years (2018-2025), with patients ranging from 10 to 90 years of age. The dataset includes 50000 intraoral images, 8056 radiographs, and detailed textual records, including diagnoses, treatment plans, and follow-up notes. The data were collected under standard ethical guidelines and annotated for benchmarking. To demonstrate its utility, we fine-tuned state-of-the-art large vision-language models, Qwen-VL 3B and 7B, and evaluated them on two tasks: classification of six oro-dental anomalies and generation of complete diagnostic reports from multimodal inputs. We compared the fine-tuned models with their base counterparts and GPT-4o. The fine-tuned models achieved substantial gains over these baselines, validating the dataset and underscoring its effectiveness in advancing AI-driven oro-dental healthcare solutions. The dataset is publicly available, providing an essential resource for future research in AI dentistry.

[95] DeepForgeSeal: Latent Space-Driven Semi-Fragile Watermarking for Deepfake Detection Using Multi-Agent Adversarial Reinforcement Learning

Tharindu Fernando, Clinton Fookes, Sridha Sridharan

Main category: cs.CV

TL;DR: A novel deep learning framework using high-dimensional latent space representations and Multi-Agent Adversarial Reinforcement Learning (MAARL) to create robust and adaptive watermarks for proactive deepfake detection, achieving significant performance improvements over state-of-the-art methods.

Details

Motivation: Address the limitations of existing deepfake detectors that struggle with generalization to new deepfake types and the challenge of balancing robustness against benign distortions with sensitivity to malicious tampering in proactive watermarking approaches.

Method: Developed a learnable watermark embedder operating in latent space to capture high-level image semantics, combined with MAARL paradigm where a watermarking agent interacts with adversarial attacker agents simulating various image manipulations to optimize robustness-fragility balance.

Result: Achieved improvements of over 4.5% on CelebA and more than 5.3% on CelebA-HQ benchmarks under challenging manipulation scenarios, consistently outperforming state-of-the-art approaches.

Conclusion: The proposed framework successfully addresses the robustness-fragility trade-off in proactive deepfake detection through latent space watermarking and adversarial reinforcement learning, demonstrating superior performance in identifying synthetic media.

Abstract: Rapid advances in generative AI have led to increasingly realistic deepfakes, posing growing challenges for law enforcement and public trust. Existing passive deepfake detectors struggle to keep pace, largely due to their dependence on specific forgery artifacts, which limits their ability to generalize to new deepfake types. Proactive deepfake detection using watermarks has emerged to address the challenge of identifying high-quality synthetic media. However, these methods often struggle to balance robustness against benign distortions with sensitivity to malicious tampering. This paper introduces a novel deep learning framework that harnesses high-dimensional latent space representations and the Multi-Agent Adversarial Reinforcement Learning (MAARL) paradigm to develop a robust and adaptive watermarking approach. Specifically, we develop a learnable watermark embedder that operates in the latent space, capturing high-level image semantics, while offering precise control over message encoding and extraction. The MAARL paradigm empowers the learnable watermarking agent to pursue an optimal balance between robustness and fragility by interacting with a dynamic curriculum of benign and malicious image manipulations simulated by an adversarial attacker agent. Comprehensive evaluations on the CelebA and CelebA-HQ benchmarks reveal that our method consistently outperforms state-of-the-art approaches, achieving improvements of over 4.5% on CelebA and more than 5.3% on CelebA-HQ under challenging manipulation scenarios.

[96] CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting

Hexu Zhao, Xiwen Min, Xiaoteng Liu, Moonjun Gong, Yiming Li, Ang Li, Saining Xie, Jinyang Li, Aurojit Panda

Main category: cs.CV

TL;DR: CLM enables 3D Gaussian Splatting to render large scenes on consumer GPUs by offloading Gaussians to CPU memory and using pipelining to overlap communication and computation.

Details

Motivation: 3D Gaussian Splatting has high memory requirements that exceed GPU capacity for large scenes, limiting its practical use on consumer hardware.

Method: Offloading Gaussians to CPU memory with intelligent loading, pipelining GPU-CPU communication with computation, and reducing communication volume through access pattern analysis.

Result: Successfully renders scenes with 100 million Gaussians on a single RTX4090 GPU while maintaining state-of-the-art reconstruction quality.

Conclusion: CLM makes large-scale 3DGS rendering feasible on consumer-grade GPUs through efficient memory management and communication optimization.

Abstract: 3D Gaussian Splatting (3DGS) is an increasingly popular novel view synthesis approach due to its fast rendering time, and high-quality output. However, scaling 3DGS to large (or intricate) scenes is challenging due to its large memory requirement, which exceed most GPU’s memory capacity. In this paper, we describe CLM, a system that allows 3DGS to render large scenes using a single consumer-grade GPU, e.g., RTX4090. It does so by offloading Gaussians to CPU memory, and loading them into GPU memory only when necessary. To reduce performance and communication overheads, CLM uses a novel offloading strategy that exploits observations about 3DGS’s memory access pattern for pipelining, and thus overlap GPU-to-CPU communication, GPU computation and CPU computation. Furthermore, we also exploit observation about the access pattern to reduce communication volume. Our evaluation shows that the resulting implementation can render a large scene that requires 100 million Gaussians on a single RTX4090 and achieve state-of-the-art reconstruction quality.

[97] Pattern-Aware Diffusion Synthesis of fMRI/dMRI with Tissue and Microstructural Refinement

Xiongri Shen, Jiaqi Wang, Yi Zhong, Zhenxi Song, Leilei Zhao, Yichen Wei, Lingyan Liang, Shuqiang Wang, Baiying Lei, Demao Deng, Zhiguo Zhang

Main category: cs.CV

TL;DR: PDS is a novel method for synthesizing missing fMRI and dMRI modalities using a pattern-aware dual-modal 3D diffusion framework with tissue refinement, achieving state-of-the-art performance in neurodegenerative disease diagnosis.

Details

Motivation: Missing MRI modalities (fMRI and dMRI) pose a major barrier to clinical use for neurodegenerative disease studies. Existing GAN and diffusion models struggle with fMRI-dMRI synthesis due to significant signal differences and inadequate integration of disease-related neuroanatomical patterns.

Method: Proposes PDS with two key innovations: (1) pattern-aware dual-modal 3D diffusion framework for cross-modality learning, and (2) tissue refinement network integrated with efficient microstructure refinement to maintain structural fidelity.

Result: Achieves state-of-the-art results on OASIS-3, ADNI, and in-house datasets: PSNR/SSIM of 29.83 dB/90.84% for fMRI synthesis and 30.00 dB/77.55% for dMRI synthesis. Clinical validation shows 67.92%/66.02%/64.15% accuracy for NC vs. MCI vs. AD classification.

Conclusion: PDS effectively addresses fMRI-dMRI synthesis challenges and demonstrates strong diagnostic performance in neurodegenerative disease classification, making synthesized data clinically valuable.

Abstract: Magnetic resonance imaging (MRI), especially functional MRI (fMRI) and diffusion MRI (dMRI), is essential for studying neurodegenerative diseases. However, missing modalities pose a major barrier to their clinical use. Although GAN- and diffusion model-based approaches have shown some promise in modality completion, they remain limited in fMRI-dMRI synthesis due to (1) significant BOLD vs. diffusion-weighted signal differences between fMRI and dMRI in time/gradient axis, and (2) inadequate integration of disease-related neuroanatomical patterns during generation. To address these challenges, we propose PDS, introducing two key innovations: (1) a pattern-aware dual-modal 3D diffusion framework for cross-modality learning, and (2) a tissue refinement network integrated with a efficient microstructure refinement to maintain structural fidelity and fine details. Evaluated on OASIS-3, ADNI, and in-house datasets, our method achieves state-of-the-art results, with PSNR/SSIM scores of 29.83 dB/90.84% for fMRI synthesis (+1.54 dB/+4.12% over baselines) and 30.00 dB/77.55% for dMRI synthesis (+1.02 dB/+2.2%). In clinical validation, the synthesized data show strong diagnostic performance, achieving 67.92%/66.02%/64.15% accuracy (NC vs. MCI vs. AD) in hybrid real-synthetic experiments. Code is available in \href{https://github.com/SXR3015/PDS}{PDS GitHub Repository}

[98] Learning Fourier shapes to probe the geometric world of deep neural networks

Jian Wang, Yixing Yong, Haixia Bi, Lijun He, Fan Li

Main category: cs.CV

TL;DR: This paper introduces a framework to probe DNNs’ geometric understanding using optimized shapes as semantic carriers, interpretability tools, and adversarial examples.

Details

Motivation: Deep neural networks have focused heavily on texture while neglecting geometric understanding. The authors aim to investigate how DNNs process and interpret shape information.

Method: An end-to-end differentiable framework combining Fourier series for shape parameterization, winding number-based pixel mapping, and signal energy constraints for optimization efficiency and physical plausibility.

Result: Optimized shapes can generate high-confidence classifications, serve as precise interpretability tools to isolate salient regions, and create effective adversarial examples for downstream visual tasks.

Conclusion: The work provides a versatile framework for exploring geometric understanding in DNNs and opens new frontiers for challenging and understanding machine perception.

Abstract: While both shape and texture are fundamental to visual recognition, research on deep neural networks (DNNs) has predominantly focused on the latter, leaving their geometric understanding poorly probed. Here, we show: first, that optimized shapes can act as potent semantic carriers, generating high-confidence classifications from inputs defined purely by their geometry; second, that they are high-fidelity interpretability tools that precisely isolate a model’s salient regions; and third, that they constitute a new, generalizable adversarial paradigm capable of deceiving downstream visual tasks. This is achieved through an end-to-end differentiable framework that unifies a powerful Fourier series to parameterize arbitrary shapes, a winding number-based mapping to translate them into the pixel grid required by DNNs, and signal energy constraints that enhance optimization efficiency while ensuring physically plausible shapes. Our work provides a versatile framework for probing the geometric world of DNNs and opens new frontiers for challenging and understanding machine perception.

[99] Challenges in 3D Data Synthesis for Training Neural Networks on Topological Features

Dylan Peek, Matthew P. Skerritt, Siddharth Pritam, Stephan Chalup

Main category: cs.CV

TL;DR: A novel approach for generating labeled 3D datasets using the Repulsive Surface algorithm to address the lack of labeled data for supervised learning in Topological Data Analysis, enabling training of neural network estimators for topological invariants like hole count.

Details

Motivation: Traditional TDA methods like persistent homology are computationally demanding, and there's a lack of labeled 3D data with appropriate class distributions and diversity for supervised learning in TDA tasks.

Method: Systematically generate labeled 3D datasets using the Repulsive Surface algorithm to control topological invariants (e.g., hole count), then train a genus estimator network using a 3D convolutional transformer architecture on the synthetic dataset.

Result: The dataset provides varied geometry with topological labeling suitable for training neural network estimators. The trained genus estimator shows decreased accuracy as deformations increase, highlighting the importance of both topological and geometric complexity.

Conclusion: The generated dataset fills a gap in labeled 3D datasets for TDA, enabling better training and evaluation of models and techniques, while revealing that geometric complexity alongside topological complexity affects estimator generalization.

Abstract: Topological Data Analysis (TDA) involves techniques of analyzing the underlying structure and connectivity of data. However, traditional methods like persistent homology can be computationally demanding, motivating the development of neural network-based estimators capable of reducing computational overhead and inference time. A key barrier to advancing these methods is the lack of labeled 3D data with class distributions and diversity tailored specifically for supervised learning in TDA tasks. To address this, we introduce a novel approach for systematically generating labeled 3D datasets using the Repulsive Surface algorithm, allowing control over topological invariants, such as hole count. The resulting dataset offers varied geometry with topological labeling, making it suitable for training and benchmarking neural network estimators. This paper uses a synthetic 3D dataset to train a genus estimator network, created using a 3D convolutional transformer architecture. An observed decrease in accuracy as deformations increase highlights the role of not just topological complexity, but also geometric complexity, when training generalized estimators. This dataset fills a gap in labeled 3D datasets and generation for training and evaluating models and techniques for TDA.

[100] Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

Aakriti Agrawal, Gouthaman KV, Rohith Aralikatti, Gauri Jagatap, Jiaxin Yuan, Vijay Kamarshi, Andrea Fanelli, Furong Huang

Main category: cs.CV

TL;DR: The paper identifies a language bias in LVLM architectures and proposes refining text embeddings with average-pooled visual features to improve visual grounding and reduce hallucinations.

Details

Motivation: To address the inherent bias in current LVLM architectures toward language modality caused by simply appending visual embeddings to text sequences, which leads to poor visual grounding and hallucinations.

Method: Propose a simple method that refines textual embeddings by integrating average-pooled visual features, providing a straightforward and efficient way to incorporate visual information.

Result: The approach demonstrably improves visual grounding and significantly reduces hallucinations on established benchmarks.

Conclusion: Refining textual embeddings with visual information effectively mitigates modality imbalance issues, though more sophisticated fusion methods could further enhance performance.

Abstract: In this work, we identify an inherent bias in prevailing LVLM architectures toward the language modality, largely resulting from the common practice of simply appending visual embeddings to the input text sequence. To address this, we propose a simple yet effective method that refines textual embeddings by integrating average-pooled visual features. Our approach demonstrably improves visual grounding and significantly reduces hallucinations on established benchmarks. While average pooling offers a straightforward, robust, and efficient means of incorporating visual information, we believe that more sophisticated fusion methods could further enhance visual grounding and cross-modal alignment. Given that the primary focus of this work is to highlight the modality imbalance and its impact on hallucinations – and to show that refining textual embeddings with visual information mitigates this issue – we leave exploration of advanced fusion strategies for future work.

[101] Dynamic Residual Encoding with Slide-Level Contrastive Learning for End-to-End Whole Slide Image Representation

Jing Jin, Xu Liu, Te Gao, Zhihong Shi, Yixiong Liang, Ruiqing Zheng, Hulin Kuang, Min Zeng, Shichao Kan

Main category: cs.CV

TL;DR: Proposes DRE-SLCL method for end-to-end WSI representation using dynamic residual encoding with slide-level contrastive learning to handle gigapixel slides with thousands of tiles.

Details

Motivation: Training end-to-end WSI representation models is challenging due to GPU limitations when processing thousands of image tiles from gigapixel slides in a single mini-batch.

Method: Uses memory bank to store tile features across all WSIs, samples tiles during training, combines sampled features with memory bank features using residual encoding, and applies slide-level contrastive learning with histopathology reports.

Result: Experiments on cancer subtyping, cancer recognition, and mutation prediction tasks demonstrated the effectiveness of the proposed DRE-SLCL method.

Conclusion: The DRE-SLCL method successfully addresses the computational challenges of end-to-end WSI representation learning and proves effective for various cancer-related tasks.

Abstract: Whole Slide Image (WSI) representation is critical for cancer subtyping, cancer recognition and mutation prediction.Training an end-to-end WSI representation model poses significant challenges, as a standard gigapixel slide can contain tens of thousands of image tiles, making it difficult to compute gradients of all tiles in a single mini-batch due to current GPU limitations. To address this challenge, we propose a method of dynamic residual encoding with slide-level contrastive learning (DRE-SLCL) for end-to-end WSI representation. Our approach utilizes a memory bank to store the features of tiles across all WSIs in the dataset. During training, a mini-batch usually contains multiple WSIs. For each WSI in the batch, a subset of tiles is randomly sampled and their features are computed using a tile encoder. Then, additional tile features from the same WSI are selected from the memory bank. The representation of each individual WSI is generated using a residual encoding technique that incorporates both the sampled features and those retrieved from the memory bank. Finally, the slide-level contrastive loss is computed based on the representations and histopathology reports ofthe WSIs within the mini-batch. Experiments conducted over cancer subtyping, cancer recognition, and mutation prediction tasks proved the effectiveness of the proposed DRE-SLCL method.

[102] Pressure2Motion: Hierarchical Motion Synthesis from Ground Pressure with Text Guidance

Zhengxuan Li, Qinhui Yang, Yiyu Zhuang, Chuan Guo, Xinxin Zuo, Xiaoxiao Long, Yao Yao, Xun Cao, Qiu Shen, Hao Zhu

Main category: cs.CV

TL;DR: Pressure2Motion is a novel motion capture system that generates human motion from ground pressure sequences and text prompts, eliminating the need for cameras or wearable devices.

Details

Motivation: To enable privacy-preserving, low-light, and low-cost motion capture without specialized lighting, cameras, or wearable devices, addressing the ill-posed nature of mapping pressure signals to full-body motion.

Method: Uses a dual-level feature extractor to interpret pressure data, followed by a hierarchical diffusion model that discerns movement trajectories and posture adjustments, leveraging both physical pressure cues and semantic text guidance.

Result: Generates high-fidelity, physically plausible motions and establishes a new state-of-the-art for this task, with the MPL benchmark being the first benchmark for pressure-to-motion generation.

Conclusion: Pressure2Motion is a pioneering work that successfully combines pressure data and linguistic priors for motion generation, offering a practical solution for privacy-preserving and cost-effective motion capture.

Abstract: We present Pressure2Motion, a novel motion capture algorithm that synthesizes human motion from a ground pressure sequence and text prompt. It eliminates the need for specialized lighting setups, cameras, or wearable devices, making it suitable for privacy-preserving, low-light, and low-cost motion capture scenarios. Such a task is severely ill-posed due to the indeterminate nature of the pressure signals to full-body motion. To address this issue, we introduce Pressure2Motion, a generative model that leverages pressure features as input and utilizes a text prompt as a high-level guiding constraint. Specifically, our model utilizes a dual-level feature extractor that accurately interprets pressure data, followed by a hierarchical diffusion model that discerns broad-scale movement trajectories and subtle posture adjustments. Both the physical cues gained from the pressure sequence and the semantic guidance derived from descriptive texts are leveraged to guide the motion generation with precision. To the best of our knowledge, Pressure2Motion is a pioneering work in leveraging both pressure data and linguistic priors for motion generation, and the established MPL benchmark is the first benchmark for this task. Experiments show our method generates high-fidelity, physically plausible motions, establishing a new state-of-the-art for this task. The codes and benchmarks will be publicly released upon publication.

[103] Medical Referring Image Segmentation via Next-Token Mask Prediction

Xinyu Chen, Yiran Wang, Gaoyang Pang, Jiafu Hao, Chentao Yue, Luping Zhou, Yonghui Li

Main category: cs.CV

TL;DR: NTP-MRISeg reformulates medical referring image segmentation as an autoregressive next-token prediction task using unified multimodal sequences, achieving state-of-the-art performance with simplified architecture.

Details

Motivation: To simplify complex multimodal fusion designs and multi-stage decoders in existing MRIS approaches by creating a unified end-to-end framework.

Method: Formulates MRIS as autoregressive next-token prediction over tokenized image, text, and mask sequences. Introduces Next-k Token Prediction, Token-level Contrastive Learning, and memory-based Hard Error Token optimization.

Result: Achieves new state-of-the-art performance on QaTa-COV19 and MosMedData+ datasets, demonstrating superior segmentation accuracy.

Conclusion: NTP-MRISeg provides a streamlined and effective alternative to traditional MRIS pipelines with simplified architecture and improved performance.

Abstract: Medical Referring Image Segmentation (MRIS) involves segmenting target regions in medical images based on natural language descriptions. While achieving promising results, recent approaches usually involve complex design of multimodal fusion or multi-stage decoders. In this work, we propose NTP-MRISeg, a novel framework that reformulates MRIS as an autoregressive next-token prediction task over a unified multimodal sequence of tokenized image, text, and mask representations. This formulation streamlines model design by eliminating the need for modality-specific fusion and external segmentation models, supports a unified architecture for end-to-end training. It also enables the use of pretrained tokenizers from emerging large-scale multimodal models, enhancing generalization and adaptability. More importantly, to address challenges under this formulation-such as exposure bias, long-tail token distributions, and fine-grained lesion edges-we propose three novel strategies: (1) a Next-k Token Prediction (NkTP) scheme to reduce cumulative prediction errors, (2) Token-level Contrastive Learning (TCL) to enhance boundary sensitivity and mitigate long-tail distribution effects, and (3) a memory-based Hard Error Token (HET) optimization strategy that emphasizes difficult tokens during training. Extensive experiments on the QaTa-COV19 and MosMedData+ datasets demonstrate that NTP-MRISeg achieves new state-of-the-art performance, offering a streamlined and effective alternative to traditional MRIS pipelines.

[104] No Pose Estimation? No Problem: Pose-Agnostic and Instance-Aware Test-Time Adaptation for Monocular Depth Estimation

Mingyu Sung, Hyeonmin Choe, Il-Min Kim, Sangseok Yun, Jae Mo Kang

Main category: cs.CV

TL;DR: PITTA is a novel test-time adaptation framework for monocular depth estimation that works without camera pose information and uses instance-aware masking for dynamic objects.

Details

Motivation: Existing test-time adaptation methods for monocular depth estimation are ineffective in diverse and dynamic environments, requiring camera pose information which limits practical deployment.

Method: Pose-agnostic TTA paradigm without camera pose information, instance-aware masking using panoptic segmentation to remove dynamic objects, and edge extraction from input images and depth maps.

Result: Extensive experiments on DrivingStereo and Waymo datasets show PITTA surpasses state-of-the-art methods with remarkable performance improvements in varying environmental conditions.

Conclusion: PITTA provides an effective pose-agnostic solution for test-time adaptation in monocular depth estimation, enabling robust performance in diverse real-world scenarios without requiring camera pose information.

Abstract: Monocular depth estimation (MDE), inferring pixel-level depths in single RGB images from a monocular camera, plays a crucial and pivotal role in a variety of AI applications demanding a three-dimensional (3D) topographical scene. In the real-world scenarios, MDE models often need to be deployed in environments with different conditions from those for training. Test-time (domain) adaptation (TTA) is one of the compelling and practical approaches to address the issue. Although there have been notable advancements in TTA for MDE, particularly in a self-supervised manner, existing methods are still ineffective and problematic when applied to diverse and dynamic environments. To break through this challenge, we propose a novel and high-performing TTA framework for MDE, named PITTA. Our approach incorporates two key innovative strategies: (i) pose-agnostic TTA paradigm for MDE and (ii) instance-aware image masking. Specifically, PITTA enables highly effective TTA on a pretrained MDE network in a pose-agnostic manner without resorting to any camera pose information. Besides, our instance-aware masking strategy extracts instance-wise masks for dynamic objects (e.g., vehicles, pedestrians, etc.) from a segmentation mask produced by a pretrained panoptic segmentation network, by removing static objects including background components. To further boost performance, we also present a simple yet effective edge extraction methodology for the input image (i.e., a single monocular image) and depth map. Extensive experimental evaluations on DrivingStereo and Waymo datasets with varying environmental conditions demonstrate that our proposed framework, PITTA, surpasses the existing state-of-the-art techniques with remarkable performance improvements in MDE during TTA.

[105] Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach

Yuanxiang Huangfu, Chaochao Wang, Weilei Wang

Main category: cs.CV

TL;DR: Role-SynthCLIP is a data synthesis framework that uses multi-perspective role-playing prompts with MLLMs to generate semantically diverse captions, improving CLIP model performance with fewer training pairs.

Details

Motivation: Existing synthetic data generation methods focus on increasing data volume but result in limited semantic diversity and redundant/shallow captions, which limits CLIP model effectiveness.

Method: Proposes Role-SynthCLIP framework that uses multi-perspective role-playing prompts (e.g., compositional analyst, interpreter of image context) to guide MLLMs in generating semantically diverse captions from distinct viewpoints.

Result: CLIP-B/16 model trained on only 1 million Role-SynthCLIP pairs achieves 64.1% Recall@1 on MS COCO, surpassing the best existing synthetic baseline (trained on 5M pairs) by 2.8 percentage points.

Conclusion: The method effectively enhances semantic diversity and fine-grained image-text alignment of synthetic pairs, improving caption expressiveness and accuracy while maintaining the same number of image-text pairs.

Abstract: The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs, thereby improving caption expressiveness and accuracy while keeping the total number of image-text pairs unchanged. Experimental results demonstrate the effectiveness and efficiency of our method. A CLIP-B/16 model trained on only 1 million Role-SynthCLIP pairs achieves a Recall@1 of 64.1% on the MS COCO validation set, surpassing the best existing synthetic data baseline (trained on 5M pairs) by 2.8 percentage points. The code and trained models are released at https://github.com/huangfu170/Role-SynthCLIP.

[106] SurgiATM: A Physics-Guided Plug-and-Play Model for Deep Learning-Based Smoke Removal in Laparoscopic Surgery

Mingyu Sheng, Jianan Fan, Dongnan Liu, Guoyan Zheng, Ron Kikinis, Weidong Cai

Main category: cs.CV

TL;DR: SurgiATM is a lightweight plug-and-play module for surgical smoke removal that bridges physics-based atmospheric models with deep learning, enhancing existing desmoking methods without adding trainable parameters.

Details

Motivation: Surgical smoke during laparoscopic procedures degrades visual quality, increasing surgical error risks and hindering clinical decision-making and computer-assisted analysis.

Method: Proposes Surgical Atmospheric Model (SurgiATM) that statistically combines physics-based atmospheric modeling with data-driven deep learning, requiring only two hyperparameters and no additional trainable weights.

Result: Extensive experiments on three surgical datasets with ten desmoking methods show SurgiATM reduces restoration errors and enhances generalizability across diverse procedures without adding computational overhead.

Conclusion: SurgiATM provides a convenient, low-cost, effective, and generalizable solution for surgical smoke removal that can be seamlessly integrated into existing surgical desmoking architectures.

Abstract: During laparoscopic surgery, smoke generated by tissue cauterization can significantly degrade the visual quality of endoscopic frames, increasing the risk of surgical errors and hindering both clinical decision-making and computer-assisted visual analysis. Consequently, removing surgical smoke is critical to ensuring patient safety and maintaining operative efficiency. In this study, we propose the Surgical Atmospheric Model (SurgiATM) for surgical smoke removal. SurgiATM statistically bridges a physics-based atmospheric model and data-driven deep learning models, combining the superior generalizability of the former with the high accuracy of the latter. Furthermore, SurgiATM is designed as a lightweight, plug-and-play module that can be seamlessly integrated into diverse surgical desmoking architectures to enhance their accuracy and stability, better meeting clinical requirements. It introduces only two hyperparameters and no additional trainable weights, preserving the original network architecture with minimal computational and modification overhead. We conduct extensive experiments on three public surgical datasets with ten desmoking methods, involving multiple network architectures and covering diverse procedures, including cholecystectomy, partial nephrectomy, and diaphragm dissection. The results demonstrate that incorporating SurgiATM commonly reduces the restoration errors of existing models and relatively enhances their generalizability, without adding any trainable layers or weights. This highlights the convenience, low cost, effectiveness, and generalizability of the proposed method. The code for SurgiATM is released at https://github.com/MingyuShengSMY/SurgiATM.

[107] Deep learning models are vulnerable, but adversarial examples are even more vulnerable

Jun Li, Yanwei Xu, Keran Li, Xiaoli Zhang

Main category: cs.CV

TL;DR: Adversarial examples are more sensitive to occlusion than clean samples, enabling detection via Sliding Window Mask-based Adversarial Example Detection (SWM-AED) with up to 96.5% accuracy.

Details

Motivation: To understand intrinsic differences between adversarial and clean samples for improving DNN robustness and detection against adversarial attacks.

Method: Used Sliding Mask Confidence Entropy (SMCE) to quantify model confidence fluctuation under occlusion on 1800+ CIFAR-10 images with 9 canonical attacks, then proposed SWM-AED detection method.

Result: Adversarial examples show significantly higher confidence volatility under occlusion; SWM-AED achieves over 62% accuracy in most cases and up to 96.5% across classifiers and attacks.

Conclusion: Occlusion sensitivity is a key characteristic of adversarial examples that can be leveraged for effective detection while avoiding catastrophic overfitting from adversarial training.

Abstract: Understanding intrinsic differences between adversarial examples and clean samples is key to enhancing DNN robustness and detection against adversarial attacks. This study first empirically finds that image-based adversarial examples are notably sensitive to occlusion. Controlled experiments on CIFAR-10 used nine canonical attacks (e.g., FGSM, PGD) to generate adversarial examples, paired with original samples for evaluation. We introduce Sliding Mask Confidence Entropy (SMCE) to quantify model confidence fluctuation under occlusion. Using 1800+ test images, SMCE calculations supported by Mask Entropy Field Maps and statistical distributions show adversarial examples have significantly higher confidence volatility under occlusion than originals. Based on this, we propose Sliding Window Mask-based Adversarial Example Detection (SWM-AED), which avoids catastrophic overfitting of conventional adversarial training. Evaluations across classifiers and attacks on CIFAR-10 demonstrate robust performance, with accuracy over 62% in most cases and up to 96.5%.

[108] A Dual-stage Prompt-driven Privacy-preserving Paradigm for Person Re-Identification

Ruolin Li, Min Liu, Yuan Bian, Zhaoyang Li, Yuzhen Li, Xueping Wang, Yaonan Wang

Main category: cs.CV

TL;DR: DPPP is a dual-stage framework that generates privacy-preserving virtual data for person re-identification using diffusion models and learns domain-invariant features through prompt-driven disentanglement.

Details

Motivation: Address privacy concerns in person Re-ID by creating virtual datasets that overcome limitations of existing game-engine datasets like complex construction and poor domain generalization.

Method: Stage 1: Generate diverse virtual data (GenePerson dataset) using multi-dimensional prompts. Stage 2: Use Prompt-driven Disentanglement Mechanism with contrastive learning to separate style and content features for domain-invariant learning.

Result: Models trained on GenePerson with PDM achieve state-of-the-art generalization performance, outperforming models trained on both real and virtual Re-ID datasets.

Conclusion: The proposed DPPP paradigm effectively addresses privacy concerns while achieving superior domain generalization in person re-identification through virtual data generation and prompt-driven feature disentanglement.

Abstract: With growing concerns over data privacy, researchers have started using virtual data as an alternative to sensitive real-world images for training person re-identification (Re-ID) models. However, existing virtual datasets produced by game engines still face challenges such as complex construction and poor domain generalization, making them difficult to apply in real scenarios. To address these challenges, we propose a Dual-stage Prompt-driven Privacy-preserving Paradigm (DPPP). In the first stage, we generate rich prompts incorporating multi-dimensional attributes such as pedestrian appearance, illumination, and viewpoint that drive the diffusion model to synthesize diverse data end-to-end, building a large-scale virtual dataset named GenePerson with 130,519 images of 6,641 identities. In the second stage, we propose a Prompt-driven Disentanglement Mechanism (PDM) to learn domain-invariant generalization features. With the aid of contrastive learning, we employ two textual inversion networks to map images into pseudo-words representing style and content, respectively, thereby constructing style-disentangled content prompts to guide the model in learning domain-invariant content features at the image level. Experiments demonstrate that models trained on GenePerson with PDM achieve state-of-the-art generalization performance, surpassing those on popular real and virtual Re-ID datasets.

[109] Real-World Adverse Weather Image Restoration via Dual-Level Reinforcement Learning with High-Quality Cold Start

Fuyang Liu, Jiaqi Xu, Xiaowei Hu

Main category: cs.CV

TL;DR: A dual-level reinforcement learning framework for adverse weather visual perception that uses a physics-driven dataset for cold-start training and dynamically adapts to real-world conditions through local restoration optimization and global meta-controlling.

Details

Motivation: Existing vision models trained on synthetic data struggle to generalize to complex real-world weather degradations, requiring more adaptive approaches.

Method: Constructed HFLS-Weather dataset for cold-start training, then implemented dual-level RL: local level refines weather-specific restoration models via perturbation-driven quality optimization, global level uses meta-controller for dynamic model selection and execution order.

Result: Achieves state-of-the-art performance across diverse adverse weather scenarios and enables continuous adaptation to real-world conditions.

Conclusion: The proposed framework effectively addresses adverse weather perception challenges through physics-driven data and adaptive reinforcement learning, outperforming existing methods.

Abstract: Adverse weather severely impairs real-world visual perception, while existing vision models trained on synthetic data with fixed parameters struggle to generalize to complex degradations. To address this, we first construct HFLS-Weather, a physics-driven, high-fidelity dataset that simulates diverse weather phenomena, and then design a dual-level reinforcement learning framework initialized with HFLS-Weather for cold-start training. Within this framework, at the local level, weather-specific restoration models are refined through perturbation-driven image quality optimization, enabling reward-based learning without paired supervision; at the global level, a meta-controller dynamically orchestrates model selection and execution order according to scene degradation. This framework enables continuous adaptation to real-world conditions and achieves state-of-the-art performance across a wide range of adverse weather scenarios. Code is available at https://github.com/xxclfy/AgentRL-Real-Weather

[110] Early Alzheimer’s Disease Detection from Retinal OCT Images: A UK Biobank Study

Yasemin Turkan, F. Boray Tek, M. Serdar Nazlı, Öykü Eren

Main category: cs.CV

TL;DR: This study uses deep learning on raw OCT B-scan images for early Alzheimer’s disease detection, achieving modest AUC of 0.62 with ResNet-34, providing baseline for OCT-based AD prediction.

Details

Motivation: To explore direct classification of OCT B-scan images for early AD detection, moving beyond traditional segmented layer thickness measurements, as this represents the first application of deep learning to raw OCT B-scans for AD prediction.

Method: Fine-tuned multiple pretrained models (ImageNet-based networks and OCT-specific RETFound transformer) using UK Biobank data with subject-level cross-validation, age/sex/imaging instance matching, OCT-specific augmentation, and year-weighted loss function prioritizing cases diagnosed within 4 years.

Result: ResNet-34 achieved most stable results with AUC of 0.62 in the 4-year cohort, below clinical application threshold but explainability analyses confirmed localized structural differences in central macular subfield between AD and control groups.

Conclusion: Provides baseline for OCT-based AD prediction, highlights challenges of detecting subtle retinal biomarkers years before diagnosis, and points to need for larger datasets and multimodal approaches.

Abstract: Alterations in retinal layer thickness, measurable using Optical Coherence Tomography (OCT), have been associated with neurodegenerative diseases such as Alzheimer’s disease (AD). While previous studies have mainly focused on segmented layer thickness measurements, this study explored the direct classification of OCT B-scan images for the early detection of AD. To our knowledge, this is the first application of deep learning to raw OCT B-scans for AD prediction in the literature. Unlike conventional medical image classification tasks, early detection is more challenging than diagnosis because imaging precedes clinical diagnosis by several years. We fine-tuned and evaluated multiple pretrained models, including ImageNet-based networks and the OCT-specific RETFound transformer, using subject-level cross-validation datasets matched for age, sex, and imaging instances from the UK Biobank cohort. To reduce overfitting in this small, high-dimensional dataset, both standard and OCT-specific augmentation techniques were applied, along with a year-weighted loss function that prioritized cases diagnosed within four years of imaging. ResNet-34 produced the most stable results, achieving an AUC of 0.62 in the 4-year cohort. Although below the threshold for clinical application, our explainability analyses confirmed localized structural differences in the central macular subfield between the AD and control groups. These findings provide a baseline for OCT-based AD prediction, highlight the challenges of detecting subtle retinal biomarkers years before AD diagnosis, and point to the need for larger datasets and multimodal approaches.

[111] From Linear Probing to Joint-Weighted Token Hierarchy: A Foundation Model Bridging Global and Cellular Representations in Biomarker Detection

Jingsong Liu, Han Li, Nassir Navab, Peter J. Schüffler

Main category: cs.CV

TL;DR: JWTH is a pathology foundation model that integrates cell-level morphology with global patch embeddings using attention pooling, achieving superior performance in AI-based biomarker detection from H&E slides.

Details

Motivation: Most pathology foundation models rely on global patch-level embeddings and overlook important cell-level morphological information, limiting their biomarker detection capabilities.

Method: JWTH combines large-scale self-supervised pretraining with cell-centric post-tuning and attention pooling to fuse local and global tokens, creating a joint-weighted token hierarchy.

Result: Across four tasks involving four biomarkers and eight cohorts, JWTH achieves up to 8.3% higher balanced accuracy and 1.2% average improvement over prior pathology foundation models.

Conclusion: JWTH advances interpretable and robust AI-based biomarker detection in digital pathology by effectively integrating cell-level morphological information with global context.

Abstract: AI-based biomarkers can infer molecular features directly from hematoxylin & eosin (H&E) slides, yet most pathology foundation models (PFMs) rely on global patch-level embeddings and overlook cell-level morphology. We present a PFM model, JWTH (Joint-Weighted Token Hierarchy), which integrates large-scale self-supervised pretraining with cell-centric post-tuning and attention pooling to fuse local and global tokens. Across four tasks involving four biomarkers and eight cohorts, JWTH achieves up to 8.3% higher balanced accuracy and 1.2% average improvement over prior PFMs, advancing interpretable and robust AI-based biomarker detection in digital pathology.

[112] SnowyLane: Robust Lane Detection on Snow-covered Rural Roads Using Infrastructural Elements

Jörg Gamerdinger, Benedict Wetzel, Patrick Schulz, Sven Teufel, Oliver Bringmann

Main category: cs.CV

TL;DR: Novel lane detection method for snow-covered roads that detects roadside delineators instead of lane markings, using Bezier curves to fit lane trajectories, with a new synthetic dataset called SnowyLane.

Details

Motivation: Lane detection in snow-covered environments is challenging due to frequent absence or occlusion of traditional lane markings, requiring alternative approaches for autonomous driving.

Method: Detects vertical roadside posts (delineators) as indirect lane indicators, then fits smooth lane trajectories using parameterized Bezier curve model while leveraging spatial consistency and road geometry.

Result: Significantly improved robustness in adverse weather compared to state-of-the-art lane detection systems, particularly in heavy snow occlusion scenarios.

Conclusion: Establishes foundation for reliable lane detection in winter scenarios and contributes valuable resource for all-weather autonomous driving research with publicly available SnowyLane dataset.

Abstract: Lane detection for autonomous driving in snow-covered environments remains a major challenge due to the frequent absence or occlusion of lane markings. In this paper, we present a novel, robust and realtime capable approach that bypasses the reliance on traditional lane markings by detecting roadside features,specifically vertical roadside posts called delineators, as indirect lane indicators. Our method first perceives these posts, then fits a smooth lane trajectory using a parameterized Bezier curve model, leveraging spatial consistency and road geometry. To support training and evaluation in these challenging scenarios, we introduce SnowyLane, a new synthetic dataset containing 80,000 annotated frames capture winter driving conditions, with varying snow coverage, and lighting conditions. Compared to state-of-the-art lane detection systems, our approach demonstrates significantly improved robustness in adverse weather, particularly in cases with heavy snow occlusion. This work establishes a strong foundation for reliable lane detection in winter scenarios and contributes a valuable resource for future research in all-weather autonomous driving. The dataset is available at https://ekut-es.github.io/snowy-lane

[113] Another BRIXEL in the Wall: Towards Cheaper Dense Features

Alexander Lappe, Martin A. Giese

Main category: cs.CV

TL;DR: BRIXEL is a knowledge distillation method that enables DINOv3 models to produce high-resolution feature maps more efficiently by having students learn to generate their own feature maps at higher resolutions.

Details

Motivation: Vision foundation models like DINOv3 require high-resolution inputs and significant computational resources due to transformer's quadratic complexity, limiting their practical deployment.

Method: Proposes BRIXEL, a knowledge distillation approach where the student model learns to reproduce its own feature maps at higher resolution through a simple distillation process.

Result: Outperforms baseline DINOv3 models by large margins on downstream tasks at fixed resolution, and produces similar feature maps to the teacher with significantly reduced computational cost.

Conclusion: BRIXEL provides an effective solution to reduce computational requirements while maintaining high-quality dense feature maps, making high-resolution vision models more practical for deployment.

Abstract: Vision foundation models achieve strong performance on both global and locally dense downstream tasks. Pretrained on large images, the recent DINOv3 model family is able to produce very fine-grained dense feature maps, enabling state-of-the-art performance. However, computing these feature maps requires the input image to be available at very high resolution, as well as large amounts of compute due to the squared complexity of the transformer architecture. To address these issues, we propose BRIXEL, a simple knowledge distillation approach that has the student learn to reproduce its own feature maps at higher resolution. Despite its simplicity, BRIXEL outperforms the baseline DINOv3 models by large margins on downstream tasks when the resolution is kept fixed. Moreover, it is able to produce feature maps that are very similar to those of the teacher at a fraction of the computational cost. Code and model weights are available at https://github.com/alexanderlappe/BRIXEL.

[114] MUSE: Multi-Scale Dense Self-Distillation for Nucleus Detection and Classification

Zijiang Yang, Hanqing Chao, Bokai Zhao, Yelin Yang, Yunshuo Zhang, Dongmei Fu, Junping Zhang, Le Lu, Ke Yan, Dakai Jin, Minfeng Xu, Yun Bian, Hui Jiang

Main category: cs.CV

TL;DR: MUSE is a self-supervised learning method for nucleus detection and classification that uses multi-scale dense self-distillation with nucleus-based local alignment, eliminating the need for labor-intensive nucleus-level annotations.

Details

Motivation: Existing methods for nucleus detection and classification rely heavily on labor-intensive nucleus-level annotations and fail to effectively leverage large-scale unlabeled data for learning discriminative nucleus representations.

Method: Proposes MUSE with NuLo (Nucleus-based Local self-distillation) - a coordinate-guided mechanism for flexible local self-distillation using predicted nucleus positions, enabling cross-scale alignment without strict spatial alignment requirements. Also includes an encoder-decoder architecture and large field-of-view semi-supervised fine-tuning.

Result: Extensive experiments on three benchmarks show MUSE surpasses state-of-the-art supervised baselines and outperforms generic pathology foundation models, effectively addressing core challenges of histopathological nucleus detection and classification.

Conclusion: MUSE provides an effective self-supervised solution for nucleus detection and classification that reduces annotation dependency while achieving superior performance through multi-scale representation learning.

Abstract: Nucleus detection and classification (NDC) in histopathology analysis is a fundamental task that underpins a wide range of high-level pathology applications. However, existing methods heavily rely on labor-intensive nucleus-level annotations and struggle to fully exploit large-scale unlabeled data for learning discriminative nucleus representations. In this work, we propose MUSE (MUlti-scale denSE self-distillation), a novel self-supervised learning method tailored for NDC. At its core is NuLo (Nucleus-based Local self-distillation), a coordinate-guided mechanism that enables flexible local self-distillation based on predicted nucleus positions. By removing the need for strict spatial alignment between augmented views, NuLo allows critical cross-scale alignment, thus unlocking the capacity of models for fine-grained nucleus-level representation. To support MUSE, we design a simple yet effective encoder-decoder architecture and a large field-of-view semi-supervised fine-tuning strategy that together maximize the value of unlabeled pathology images. Extensive experiments on three widely used benchmarks demonstrate that MUSE effectively addresses the core challenges of histopathological NDC. The resulting models not only surpass state-of-the-art supervised baselines but also outperform generic pathology foundation models.

[115] Walk the Lines 2: Contour Tracking for Detailed Segmentation

André Peter Kelm, Max Braeschke, Emre Gülsoylu, Simone Frintrop

Main category: cs.CV

TL;DR: WtL2 is an enhanced contour tracking algorithm for detailed segmentation of IR ships and RGB objects, extending the original WtL method to handle infrared imagery and diverse object types.

Details

Motivation: To extend the original Walk the Lines algorithm beyond color ship segmentation to handle infrared imagery and diverse RGB objects, providing detailed segmentation for specialized applications.

Method: Uses contour tracking to refine object contours until achieving 1-pixel-wide closed shapes that can be binarized, replacing standard NMS. Adapts input object contour detector for IR ships and enhances algorithm for diverse RGB objects.

Result: Outperforms latest contour-based methods in achieving closed object contours, offers high peak IoU with impressive details, and broadens application range to IR and diverse RGB objects.

Conclusion: WtL2 is a compelling method for specialized applications requiring detailed segmentation or high-quality samples, potentially accelerating progress in niche areas of image segmentation.

Abstract: This paper presents Walk the Lines 2 (WtL2), a unique contour tracking algorithm specifically adapted for detailed segmentation of infrared (IR) ships and various objects in RGB.1 This extends the original Walk the Lines (WtL) [12], which focused solely on detailed ship segmentation in color. These innovative WtLs can replace the standard non-maximum suppression (NMS) by using contour tracking to refine the object contour until a 1-pixel-wide closed shape can be binarized, forming a segmentable area in foreground-background scenarios. WtL2 broadens the application range of WtL beyond its original scope, adapting to IR and expanding to diverse objects within the RGB context. To achieve IR segmentation, we adapt its input, the object contour detector, to IR ships. In addition, the algorithm is enhanced to process a wide range of RGB objects, outperforming the latest generation of contour-based methods when achieving a closed object contour, offering high peak Intersection over Union (IoU) with impressive details. This positions WtL2 as a compelling method for specialized applications that require detailed segmentation or high-quality samples, potentially accelerating progress in several niche areas of image segmentation.

[116] 4D3R: Motion-Aware Neural Reconstruction and Rendering of Dynamic Scenes from Monocular Videos

Mengqi Guo, Bo Xu, Yanyan Li, Gim Hee Lee

Main category: cs.CV

TL;DR: 4D3R is a pose-free dynamic neural rendering framework that decouples static and dynamic components using motion-aware bundle adjustment and efficient Gaussian splatting, achieving superior performance with reduced computational cost.

Details

Motivation: Novel view synthesis from monocular videos of dynamic scenes with unknown camera poses remains challenging. Existing methods like NeRF and 3DGS work well for static scenes but struggle with dynamic content and typically require pre-computed camera poses.

Method: Two-stage approach: 1) Uses 3D foundational models for initial pose/geometry estimation, 2) Motion-aware refinement with MA-BA module (transformer-based priors + SAM2 for dynamic segmentation) and MA-GS representation (control points with deformation field MLP and linear blend skinning).

Result: Achieves up to 1.8dB PSNR improvement over state-of-the-art methods, especially in challenging scenarios with large dynamic objects, while reducing computational requirements by 5x compared to previous dynamic scene representations.

Conclusion: 4D3R effectively addresses the challenge of pose-free dynamic neural rendering by decoupling static/dynamic components and introducing efficient motion modeling, demonstrating superior performance and computational efficiency.

Abstract: Novel view synthesis from monocular videos of dynamic scenes with unknown camera poses remains a fundamental challenge in computer vision and graphics. While recent advances in 3D representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown promising results for static scenes, they struggle with dynamic content and typically rely on pre-computed camera poses. We present 4D3R, a pose-free dynamic neural rendering framework that decouples static and dynamic components through a two-stage approach. Our method first leverages 3D foundational models for initial pose and geometry estimation, followed by motion-aware refinement. 4D3R introduces two key technical innovations: (1) a motion-aware bundle adjustment (MA-BA) module that combines transformer-based learned priors with SAM2 for robust dynamic object segmentation, enabling more accurate camera pose refinement; and (2) an efficient Motion-Aware Gaussian Splatting (MA-GS) representation that uses control points with a deformation field MLP and linear blend skinning to model dynamic motion, significantly reducing computational cost while maintaining high-quality reconstruction. Extensive experiments on real-world dynamic datasets demonstrate that our approach achieves up to 1.8dB PSNR improvement over state-of-the-art methods, particularly in challenging scenarios with large dynamic objects, while reducing computational requirements by 5x compared to previous dynamic scene representations.

[117] FreeControl: Efficient, Training-Free Structural Control via One-Step Attention Extraction

Jiang Lin, Xinyu Chen, Song Wu, Zhiqiu Zhang, Jizhi Zhang, Ye Wang, Qiang Tang, Qian Wang, Jian Yang, Zili Yi

Main category: cs.CV

TL;DR: FreeControl is a training-free framework for semantic structural control in diffusion models that uses one-step attention extraction from a single key timestep and Latent-Condition Decoupling to enable efficient structural guidance without inversion or retraining.

Details

Motivation: Existing methods like ControlNet require handcrafted condition maps and retraining, limiting flexibility, while inversion-based approaches have high inference costs due to dual-path denoising.

Method: Performs one-step attention extraction from a single optimally chosen key timestep and reuses it throughout denoising, with Latent-Condition Decoupling to separate the key timestep from the noised latent used in attention extraction.

Result: Enables structurally and semantically aligned, visually coherent generation directly from raw images with approximately 5% additional cost, supporting compositional control via reference images from multiple sources.

Conclusion: FreeControl introduces a new paradigm for test-time control that provides efficient structural guidance without inversion or retraining, offering flexibility for intuitive compositional design and compatibility with modern diffusion models.

Abstract: Controlling the spatial and semantic structure of diffusion-generated images remains a challenge. Existing methods like ControlNet rely on handcrafted condition maps and retraining, limiting flexibility and generalization. Inversion-based approaches offer stronger alignment but incur high inference cost due to dual-path denoising. We present FreeControl, a training-free framework for semantic structural control in diffusion models. Unlike prior methods that extract attention across multiple timesteps, FreeControl performs one-step attention extraction from a single, optimally chosen key timestep and reuses it throughout denoising. This enables efficient structural guidance without inversion or retraining. To further improve quality and stability, we introduce Latent-Condition Decoupling (LCD): a principled separation of the key timestep and the noised latent used in attention extraction. LCD provides finer control over attention quality and eliminates structural artifacts. FreeControl also supports compositional control via reference images assembled from multiple sources - enabling intuitive scene layout design and stronger prompt alignment. FreeControl introduces a new paradigm for test-time control, enabling structurally and semantically aligned, visually coherent generation directly from raw images, with the flexibility for intuitive compositional design and compatibility with modern diffusion models at approximately 5 percent additional cost.

[118] Accurate online action and gesture recognition system using detectors and Deep SPD Siamese Networks

Mohamed Sanim Akremi, Rim Slama, Hedi Tabia

Main category: cs.CV

TL;DR: Proposes an online skeleton-based motion recognition system using SPD matrix representation and Siamese network for continuous detection and classification of motions in streaming data.

Details

Motivation: Existing skeleton-based approaches focus on segment-based recognition and are unsuitable for online scenarios where continuous motion detection is needed from streaming data.

Method: Two-component system: detector and classifier using Semi-Positive Definite (SPD) matrix representation and Siamese network to learn semantic similarity for predicting motion intervals in unsegmented sequences.

Result: System achieves state-of-the-art performance on hand gesture and body action recognition benchmarks, outperforming existing methods in most cases.

Conclusion: The proposed online recognition system effectively handles continuous motion detection and classification in streaming skeleton data using SPD matrices and Siamese networks.

Abstract: Online continuous motion recognition is a hot topic of research since it is more practical in real life application cases. Recently, Skeleton-based approaches have become increasingly popular, demonstrating the power of using such 3D temporal data. However, most of these works have focused on segment-based recognition and are not suitable for the online scenarios. In this paper, we propose an online recognition system for skeleton sequence streaming composed from two main components: a detector and a classifier, which use a Semi-Positive Definite (SPD) matrix representation and a Siamese network. The powerful statistical representations for the skeletal data given by the SPD matrices and the learning of their semantic similarity by the Siamese network enable the detector to predict time intervals of the motions throughout an unsegmented sequence. In addition, they ensure the classifier capability to recognize the motion in each predicted interval. The proposed detector is flexible and able to identify the kinetic state continuously. We conduct extensive experiments on both hand gesture and body action recognition benchmarks to prove the accuracy of our online recognition system which in most cases outperforms state-of-the-art performances.

[119] ADPretrain: Advancing Industrial Anomaly Detection via Anomaly Representation Pretraining

Xincheng Yao, Yan Luo, Zefeng Qian, Chongyang Zhang

Main category: cs.CV

TL;DR: Proposes a novel anomaly detection representation learning framework that creates specialized pretrained features for industrial anomaly detection, addressing limitations of ImageNet-pretrained features through angle- and norm-oriented contrastive losses.

Details

Motivation: ImageNet-pretrained features are suboptimal for anomaly detection due to distribution shift between natural and industrial images, and pretraining doesn't focus on distinguishing normal vs abnormal patterns.

Method: Uses angle- and norm-oriented contrastive losses to maximize angle size and norm difference between normal and abnormal features, pretrained on large-scale AD dataset RealIAD with class-generalizable residual features.

Result: Extensive experiments on five AD datasets and five backbones show superior performance when replacing original features with the proposed pretrained representations.

Conclusion: The proposed AD-specific pretraining framework effectively addresses distribution shift and representation mismatch issues, providing robust and discriminative features for industrial anomaly detection tasks.

Abstract: The current mainstream and state-of-the-art anomaly detection (AD) methods are substantially established on pretrained feature networks yielded by ImageNet pretraining. However, regardless of supervised or self-supervised pretraining, the pretraining process on ImageNet does not match the goal of anomaly detection (i.e., pretraining in natural images doesn’t aim to distinguish between normal and abnormal). Moreover, natural images and industrial image data in AD scenarios typically have the distribution shift. The two issues can cause ImageNet-pretrained features to be suboptimal for AD tasks. To further promote the development of the AD field, pretrained representations specially for AD tasks are eager and very valuable. To this end, we propose a novel AD representation learning framework specially designed for learning robust and discriminative pretrained representations for industrial anomaly detection. Specifically, closely surrounding the goal of anomaly detection (i.e., focus on discrepancies between normals and anomalies), we propose angle- and norm-oriented contrastive losses to maximize the angle size and norm difference between normal and abnormal features simultaneously. To avoid the distribution shift from natural images to AD images, our pretraining is performed on a large-scale AD dataset, RealIAD. To further alleviate the potential shift between pretraining data and downstream AD datasets, we learn the pretrained AD representations based on the class-generalizable representation, residual features. For evaluation, based on five embedding-based AD methods, we simply replace their original features with our pretrained representations. Extensive experiments on five AD datasets and five backbones consistently show the superiority of our pretrained features. The code is available at https://github.com/xcyao00/ADPretrain.

[120] OregairuChar: A Benchmark Dataset for Character Appearance Frequency Analysis in My Teen Romantic Comedy SNAFU

Qi Sun, Dingju Zhou, Lina Zhang

Main category: cs.CV

TL;DR: OregairuChar is a benchmark dataset for character appearance frequency analysis in anime, featuring 1600 annotated frames from My Teen Romantic Comedy SNAFU with 2860 bounding boxes across 11 characters.

Details

Motivation: To enable quantitative analysis of narrative structure, character prominence, and story progression in anime through appearance frequency tracking.

Method: Created a manually annotated dataset with diverse visual challenges, benchmarked object detection models, and performed fine-grained episode-level analysis of character presence over time.

Result: The dataset captures realistic visual challenges and enables revealing patterns of character prominence evolution within the narrative.

Conclusion: OregairuChar serves as a valuable resource for exploring computational narrative dynamics and character-centric storytelling in stylized media.

Abstract: The analysis of character appearance frequency is essential for understanding narrative structure, character prominence, and story progression in anime. In this work, we introduce OregairuChar, a benchmark dataset designed for appearance frequency analysis in the anime series My Teen Romantic Comedy SNAFU. The dataset comprises 1600 manually selected frames from the third season, annotated with 2860 bounding boxes across 11 main characters. OregairuChar captures diverse visual challenges, including occlusion, pose variation, and inter-character similarity, providing a realistic basis for appearance-based studies. To enable quantitative research, we benchmark several object detection models on the dataset and leverage their predictions for fine-grained, episode-level analysis of character presence over time. This approach reveals patterns of character prominence and their evolution within the narrative. By emphasizing appearance frequency, OregairuChar serves as a valuable resource for exploring computational narrative dynamics and character-centric storytelling in stylized media.

[121] DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu

Main category: cs.CV

TL;DR: DeepEyesV2 is an agentic multimodal model that integrates tool use (code execution, web search) into reasoning through a two-stage training pipeline: cold-start for tool-use patterns and reinforcement learning for refinement.

Details

Motivation: Agentic multimodal models need to actively invoke external tools and integrate these operations into reasoning, but direct reinforcement learning alone fails to induce robust tool-use behavior.

Method: Two-stage training pipeline: cold-start stage to establish tool-use patterns, followed by reinforcement learning stage to refine tool invocation. Uses diverse training dataset with examples where tool use is beneficial.

Result: DeepEyesV2 demonstrates effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Exhibits task-adaptive tool invocation and complex tool combinations through reinforcement learning.

Conclusion: The study provides guidance for developing agentic multimodal models, showing that reinforcement learning enables selective tool invocation based on context and complex tool combinations.

Abstract: Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.

[122] What’s on Your Plate? Inferring Chinese Cuisine Intake from Wearable IMUs

Jiaxi Yin, Pengcheng Wang, Han Ding, Fei Wang

Main category: cs.CV

TL;DR: CuisineSense is a wearable system that classifies Chinese food types using smartwatch hand motion and smart glasses head dynamics, achieving high accuracy in eating detection and food classification.

Details

Motivation: Traditional food intake monitoring methods have limitations: self-reporting suffers from recall bias, camera-based approaches raise privacy concerns, and existing wearable methods only cover limited food types, failing to address the diversity of Chinese cuisine.

Method: Uses a two-stage detection pipeline: first identifies eating states by distinguishing temporal patterns from non-eating behaviors, then conducts fine-grained food type recognition using hand motion from smartwatch and head dynamics from smart glasses.

Result: Evaluated on a dataset of 27.5 hours of IMU recordings across 11 food categories and 10 participants, achieving high accuracy in both eating state detection and food classification.

Conclusion: CuisineSense offers a practical, unobtrusive solution for wearable-based dietary monitoring that addresses the diversity of Chinese cuisine while maintaining privacy.

Abstract: Accurate food intake detection is vital for dietary monitoring and chronic disease prevention. Traditional self-report methods are prone to recall bias, while camera-based approaches raise concerns about privacy. Furthermore, existing wearable-based methods primarily focus on a limited number of food types, such as hamburgers and pizza, failing to address the vast diversity of Chinese cuisine. To bridge this gap, we propose CuisineSense, a system that classifies Chinese food types by integrating hand motion cues from a smartwatch with head dynamics from smart glasses. To filter out irrelevant daily activities, we design a two-stage detection pipeline. The first stage identifies eating states by distinguishing characteristic temporal patterns from non-eating behaviors. The second stage then conducts fine-grained food type recognition based on the motions captured during food intake. To evaluate CuisineSense, we construct a dataset comprising 27.5 hours of IMU recordings across 11 food categories and 10 participants. Experiments demonstrate that CuisineSense achieves high accuracy in both eating state detection and food classification, offering a practical solution for unobtrusive, wearable-based dietary monitoring.The system code is publicly available at https://github.com/joeeeeyin/CuisineSense.git.

[123] LiveStar: Live Streaming Assistant for Real-World Online Video Understanding

Zhenyu Yang, Kairui Zhang, Yuhang Hu, Bing Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Weiming Dong, Changsheng Xu

Main category: cs.CV

TL;DR: LiveStar is a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding, addressing limitations in existing online Video-LLMs.

Details

Motivation: Existing online Video-LLMs struggle with simultaneous processing of continuous frame-by-frame inputs and determining optimal response timing, compromising real-time responsiveness and narrative coherence.

Method: LiveStar incorporates: (1) incremental video-language alignment training for variable-length video streams; (2) response-silence decoding framework for optimal proactive response timing; (3) memory-aware acceleration via peak-end memory compression and streaming key-value cache.

Result: LiveStar achieves state-of-the-art performance with 19.5% improvement in semantic correctness, 18.1% reduced timing difference, and 12.0% FPS improvement across all five OmniStar tasks compared to existing online Video-LLMs.

Conclusion: LiveStar demonstrates superior performance in online video understanding with enhanced real-time responsiveness and narrative coherence, supported by the comprehensive OmniStar dataset for training and benchmarking.

Abstract: Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1.53x faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar’s state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five OmniStar tasks. Our model and dataset can be accessed at https://github.com/yzy-bupt/LiveStar.

[124] Cross-domain EEG-based Emotion Recognition with Contrastive Learning

Rui Yan, Yibo Li, Han Ding, Fei Wang

Main category: cs.CV

TL;DR: EmotionCLIP reformulates EEG emotion recognition as an EEG-text matching task using CLIP framework, achieving state-of-the-art cross-subject and cross-time performance on SEED datasets.

Details

Motivation: EEG-based emotion recognition faces challenges in feature utilization and cross-domain generalization, requiring more robust methods.

Method: Proposes EmotionCLIP with SST-LegoViT backbone that captures spatial, spectral, and temporal EEG features using multi-scale convolution and Transformer modules, framed as EEG-text matching within CLIP framework.

Result: Achieves superior cross-subject accuracies of 88.69% (SEED) and 73.50% (SEED-IV), and cross-time accuracies of 88.46% (SEED) and 77.54% (SEED-IV), outperforming existing models.

Conclusion: Multimodal contrastive learning is effective for robust EEG emotion recognition, demonstrating the success of the EEG-text matching approach.

Abstract: Electroencephalogram (EEG)-based emotion recognition is vital for affective computing but faces challenges in feature utilization and cross-domain generalization. This work introduces EmotionCLIP, which reformulates recognition as an EEG-text matching task within the CLIP framework. A tailored backbone, SST-LegoViT, captures spatial, spectral, and temporal features using multi-scale convolution and Transformer modules. Experiments on SEED and SEED-IV datasets show superior cross-subject accuracies of 88.69% and 73.50%, and cross-time accuracies of 88.46% and 77.54%, outperforming existing models. Results demonstrate the effectiveness of multimodal contrastive learning for robust EEG emotion recognition.

[125] Rethinking Metrics and Diffusion Architecture for 3D Point Cloud Generation

Matteo Bastico, David Ryckelynck, Laurent Corté, Yannick Tillier, Etienne Decencière

Main category: cs.CV

TL;DR: The paper exposes issues with current point cloud evaluation metrics, proposes improved metrics including Surface Normal Concordance (SNC), and introduces Diffusion Point Transformer for state-of-the-art point cloud generation.

Details

Motivation: Current metrics for evaluating generated point clouds lack robustness and fail to capture geometric fidelity and local shape consistency, necessitating better evaluation methods and generation models.

Method: Proposed improved metrics with sample alignment and Density-Aware Chamfer Distance (DCD), introduced Surface Normal Concordance (SNC) for surface similarity, and developed Diffusion Point Transformer using transformer-based architecture for point cloud generation.

Result: The proposed metrics provide more comprehensive evaluation, and Diffusion Point Transformer achieves state-of-the-art performance on ShapeNet dataset, outperforming previous solutions in quality of generated point clouds.

Conclusion: Combining improved traditional metrics with novel surface-based metrics enables more robust evaluation, while transformer-based diffusion models can generate high-fidelity 3D point clouds.

Abstract: As 3D point clouds become a cornerstone of modern technology, the need for sophisticated generative models and reliable evaluation metrics has grown exponentially. In this work, we first expose that some commonly used metrics for evaluating generated point clouds, particularly those based on Chamfer Distance (CD), lack robustness against defects and fail to capture geometric fidelity and local shape consistency when used as quality indicators. We further show that introducing samples alignment prior to distance calculation and replacing CD with Density-Aware Chamfer Distance (DCD) are simple yet essential steps to ensure the consistency and robustness of point cloud generative model evaluation metrics. While existing metrics primarily focus on directly comparing 3D Euclidean coordinates, we present a novel metric, named Surface Normal Concordance (SNC), which approximates surface similarity by comparing estimated point normals. This new metric, when combined with traditional ones, provides a more comprehensive evaluation of the quality of generated samples. Finally, leveraging recent advancements in transformer-based models for point cloud analysis, such as serialized patch attention , we propose a new architecture for generating high-fidelity 3D structures, the Diffusion Point Transformer. We perform extensive experiments and comparisons on the ShapeNet dataset, showing that our model outperforms previous solutions, particularly in terms of quality of generated point clouds, achieving new state-of-the-art. Code available at https://github.com/matteo-bastico/DiffusionPointTransformer.

[126] $\mathbf{S^2LM}$: Towards Semantic Steganography via Large Language Models

Huanqi Wu, Huangbiao Xu, Runfeng Xie, Jiaxin Cai, Kaixin Zhang, Xiao Ke

Main category: cs.CV

TL;DR: Sentence-to-Image Steganography using LLMs to embed semantically rich sentence-level messages into images through a novel pipeline.

Details

Motivation: Traditional steganography struggles with embedding semantically rich, sentence-level information, but AIGC era demands higher capacity steganography.

Method: S²LM (Semantic Steganographic Language Model) uses LLMs to embed high-level textual information through a newly designed pipeline where LLM is involved throughout the entire process.

Result: Both quantitative and qualitative experiments show the method effectively unlocks new semantic steganographic capabilities for LLMs.

Conclusion: The approach enables hiding arbitrary sentence-level messages within cover images, establishing a new benchmark (IVT) for semantic steganography.

Abstract: Although steganography has made significant advancements in recent years, it still struggles to embed semantically rich, sentence-level information into carriers. However, in the era of AIGC, the capacity of steganography is more critical than ever. In this work, we present Sentence-to-Image Steganography, an instance of Semantic Steganography, a novel task that enables the hiding of arbitrary sentence-level messages within a cover image. Furthermore, we establish a benchmark named Invisible Text (IVT), comprising a diverse set of sentence-level texts as secret messages for evaluation. Finally, we present $\mathbf{S^2LM}$: Semantic Steganographic Language Model, which utilizes large language models (LLMs) to embed high-level textual information, such as sentences or even paragraphs, into images. Unlike traditional bit-level counterparts, $\mathrm{S^2LM}$ enables the integration of semantically rich content through a newly designed pipeline in which the LLM is involved throughout the entire process. Both quantitative and qualitative experiments demonstrate that our method effectively unlocks new semantic steganographic capabilities for LLMs. The source code will be released soon.

[127] Canonical Space Representation for 4D Panoptic Segmentation of Articulated Objects

Manuel Gomes, Bogdan Raducanu, Miguel Oliveira

Main category: cs.CV

TL;DR: CanonSeg4D is a novel 4D panoptic segmentation framework that uses temporal data and canonical space alignment to improve articulated object perception, outperforming state-of-the-art methods on the new Artic4D dataset.

Details

Motivation: Existing methods ignore temporal dynamics in articulated object perception, and there's a lack of 4D temporal data exploration and benchmark datasets for panoptic segmentation of dynamic objects.

Method: Proposes CanonSeg4D framework that estimates per-frame offsets to map observed object parts to a learned canonical space, enabling consistent alignment across sequential frames for enhanced part-level segmentation.

Result: Comprehensive experiments on Artic4D dataset show CanonSeg4D outperforms state-of-the-art approaches in panoptic segmentation accuracy, especially in complex scenarios.

Conclusion: Temporal modeling and canonical alignment are effective for dynamic object understanding, paving the way for advances in 4D articulated object perception.

Abstract: Articulated object perception presents significant challenges in computer vision, particularly because most existing methods ignore temporal dynamics despite the inherently dynamic nature of such objects. The use of 4D temporal data has not been thoroughly explored in articulated object perception and remains unexamined for panoptic segmentation. The lack of a benchmark dataset further hurt this field. To this end, we introduce Artic4D as a new dataset derived from PartNet Mobility and augmented with synthetic sensor data, featuring 4D panoptic annotations and articulation parameters. Building on this dataset, we propose CanonSeg4D, a novel 4D panoptic segmentation framework. This approach explicitly estimates per-frame offsets mapping observed object parts to a learned canonical space, thereby enhancing part-level segmentation. The framework employs this canonical representation to achieve consistent alignment of object parts across sequential frames. Comprehensive experiments on Artic4D demonstrate that the proposed CanonSeg4D outperforms state of the art approaches in panoptic segmentation accuracy in more complex scenarios. These findings highlight the effectiveness of temporal modeling and canonical alignment in dynamic object understanding, and pave the way for future advances in 4D articulated object perception.

[128] Dense Motion Captioning

Shiyao Xu, Benedetta Liberatori, Gül Varol, Paolo Rota

Main category: cs.CV

TL;DR: Introduces Dense Motion Captioning task and CompMo dataset with 60K complex motion sequences, plus DEMO model that outperforms existing methods.

Details

Motivation: Current 3D human motion research focuses mainly on text-to-motion generation, leaving motion understanding underexplored. Existing datasets lack detailed temporal annotations and complex sequences.

Method: Created CompMo dataset with 60,000 motion sequences containing 2-10 actions each, with precise temporal boundaries. Developed DEMO model integrating large language model with motion adapter for dense temporal captioning.

Result: DEMO substantially outperforms existing methods on CompMo and adapted benchmarks, establishing strong baseline for motion understanding.

Conclusion: The work addresses the gap in motion understanding research and provides comprehensive dataset and model for dense motion captioning, enabling future research in 3D motion analysis.

Abstract: Recent advances in 3D human motion and language integration have primarily focused on text-to-motion generation, leaving the task of motion understanding relatively unexplored. We introduce Dense Motion Captioning, a novel task that aims to temporally localize and caption actions within 3D human motion sequences. Current datasets fall short in providing detailed temporal annotations and predominantly consist of short sequences featuring few actions. To overcome these limitations, we present the Complex Motion Dataset (CompMo), the first large-scale dataset featuring richly annotated, complex motion sequences with precise temporal boundaries. Built through a carefully designed data generation pipeline, CompMo includes 60,000 motion sequences, each composed of multiple actions ranging from at least two to ten, accurately annotated with their temporal extents. We further present DEMO, a model that integrates a large language model with a simple motion adapter, trained to generate dense, temporally grounded captions. Our experiments show that DEMO substantially outperforms existing methods on CompMo as well as on adapted benchmarks, establishing a robust baseline for future research in 3D motion understanding and captioning.

[129] AI Assisted AR Assembly: Object Recognition and Computer Vision for Augmented Reality Assisted Assembly

Alexander Htet Kyaw, Haotian Ma, Sasa Zivkovic, Jenny Sabin

Main category: cs.CV

TL;DR: AI-assisted AR assembly system using deep learning for object recognition to display step-by-step instructions and component placement guidance in real-time.

Details

Motivation: To eliminate manual searching, sorting, and labeling of assembly components by connecting instructions with real-time component locations.

Method: Deep learning-based object recognition identifies assembly components and displays bounding boxes with placement instructions in augmented reality.

Result: Successfully demonstrated through a case study involving LEGO sculpture assembly, showing feasibility of the approach.

Conclusion: The system effectively bridges digital instructions with physical assembly processes, reducing manual effort and improving assembly efficiency.

Abstract: We present an AI-assisted Augmented Reality assembly workflow that uses deep learning-based object recognition to identify different assembly components and display step-by-step instructions. For each assembly step, the system displays a bounding box around the corresponding components in the physical space, and where the component should be placed. By connecting assembly instructions with the real-time location of relevant components, the system eliminates the need for manual searching, sorting, or labeling of different components before each assembly. To demonstrate the feasibility of using object recognition for AR-assisted assembly, we highlight a case study involving the assembly of LEGO sculptures.

[130] PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization

Zehui Feng, Tian Qiu, Tong Wu, Junxuan Li, Huayuan Xu, Ting Han

Main category: cs.CV

TL;DR: PreResQ-R1 is a reinforcement learning framework for visual quality assessment that unifies score regression and ranking consistency through preference-response disentangled optimization, achieving state-of-the-art results on multiple benchmarks.

Details

Motivation: Existing multimodal large language models for quality assessment rely on supervised fine-tuning or rank-only objectives, leading to shallow reasoning, poor score calibration, and limited cross-domain generalization.

Method: A Preference-Response Disentangled RL framework with dual-branch reward formulation that separates intra-sample response coherence and inter-sample preference alignment, optimized via Group Relative Policy Optimization. For video quality assessment, it uses global-temporal and local-spatial data flow strategy.

Result: Achieves state-of-the-art results across 10 IQA and 5 VQA benchmarks under SRCC and PLCC metrics, with improvements of 5.30% and 2.15% respectively in IQA task, using only 6K images and 28K videos for fine-tuning.

Conclusion: The framework enables fine-grained, stable, and interpretable chain-of-thought reasoning about perceptual quality, producing human-aligned reasoning traces that reveal perceptual cues underlying quality judgments.

Abstract: Visual Quality Assessment (QA) seeks to predict human perceptual judgments of visual fidelity. While recent multimodal large language models (MLLMs) show promise in reasoning about image and video quality, existing approaches mainly rely on supervised fine-tuning or rank-only objectives, resulting in shallow reasoning, poor score calibration, and limited cross-domain generalization. We propose PreResQ-R1, a Preference-Response Disentangled Reinforcement Learning framework that unifies absolute score regression and relative ranking consistency within a single reasoning-driven optimization scheme. Unlike prior QA methods, PreResQ-R1 introduces a dual-branch reward formulation that separately models intra-sample response coherence and inter-sample preference alignment, optimized via Group Relative Policy Optimization (GRPO). This design encourages fine-grained, stable, and interpretable chain-of-thought reasoning about perceptual quality. To extend beyond static imagery, we further design a global-temporal and local-spatial data flow strategy for Video Quality Assessment. Remarkably, with reinforcement fine-tuning on only 6K images and 28K videos, PreResQ-R1 achieves state-of-the-art results across 10 IQA and 5 VQA benchmarks under both SRCC and PLCC metrics, surpassing by margins of 5.30% and textbf2.15% in IQA task, respectively. Beyond quantitative gains, it produces human-aligned reasoning traces that reveal the perceptual cues underlying quality judgments. Code and model are available.

[131] PALM: A Dataset and Baseline for Learning Multi-subject Hand Prior

Zicong Fan, Edoardo Remelli, David Dimond, Fadime Sener, Liuhao Ge, Bugra Tekin, Cem Keskin, Shreyas Hampali

Main category: cs.CV

TL;DR: PALM is a large-scale hand dataset with 13k scans from 263 subjects and 90k multi-view images, addressing the lack of diverse hand data. PALM-Net uses this data to create realistic, relightable hand avatars from single images.

Details

Motivation: Creating personalized hand avatars from images is challenging due to complex geometry, appearance, and articulation. Limited datasets with accurate 3D geometry, high-resolution imagery, and diverse subjects have hindered progress.

Method: Created PALM dataset with 13k hand scans from 263 subjects and 90k multi-view images. Developed PALM-Net, a multi-subject prior learned via physically based inverse rendering for hand geometry and material properties.

Result: PALM provides rich variation in skin tone, age, and geometry. PALM-Net enables realistic, relightable single-image hand avatar personalization.

Conclusion: PALM’s scale and diversity make it a valuable real-world resource for hand modeling and related research, addressing previous limitations in hand avatar creation.

Abstract: The ability to grasp objects, signal with gestures, and share emotion through touch all stem from the unique capabilities of human hands. Yet creating high-quality personalized hand avatars from images remains challenging due to complex geometry, appearance, and articulation, particularly under unconstrained lighting and limited views. Progress has also been limited by the lack of datasets that jointly provide accurate 3D geometry, high-resolution multiview imagery, and a diverse population of subjects. To address this, we present PALM, a large-scale dataset comprising 13k high-quality hand scans from 263 subjects and 90k multi-view images, capturing rich variation in skin tone, age, and geometry. To show its utility, we present a baseline PALM-Net, a multi-subject prior over hand geometry and material properties learned via physically based inverse rendering, enabling realistic, relightable single-image hand avatar personalization. PALM’s scale and diversity make it a valuable real-world resource for hand modeling and related research.

Laura Alejandra Encinar Gonzalez, John Folkesson, Rudolph Triebel, Riccardo Giubilato

Main category: cs.CV

TL;DR: MPRF is a multimodal pipeline using transformer-based foundation models for robust loop closure detection in GNSS-denied environments, combining visual retrieval with explicit 6-DoF pose estimation.

Details

Motivation: Existing visual place recognition fails due to aliasing and weak textures, while LiDAR-based methods suffer from sparsity and ambiguity in unstructured environments like planetary exploration.

Method: Two-stage visual retrieval using DINOv2 features with SALAD aggregation for candidate screening, combined with SONATA-based LiDAR descriptors for geometric verification and explicit 6-DoF pose estimation.

Result: Outperforms state-of-the-art retrieval methods in precision on S3LI datasets, enhances pose estimation robustness in low-texture regions, and provides interpretable correspondences for SLAM back-ends.

Conclusion: MPRF achieves favorable trade-off between accuracy, efficiency, and reliability, demonstrating foundation models’ potential to unify place recognition and pose estimation.

Abstract: Robust loop closure detection is a critical component of Simultaneous Localization and Mapping (SLAM) algorithms in GNSS-denied environments, such as in the context of planetary exploration. In these settings, visual place recognition often fails due to aliasing and weak textures, while LiDAR-based methods suffer from sparsity and ambiguity. This paper presents MPRF, a multimodal pipeline that leverages transformer-based foundation models for both vision and LiDAR modalities to achieve robust loop closure in severely unstructured environments. Unlike prior work limited to retrieval, MPRF integrates a two-stage visual retrieval strategy with explicit 6-DoF pose estimation, combining DINOv2 features with SALAD aggregation for efficient candidate screening and SONATA-based LiDAR descriptors for geometric verification. Experiments on the S3LI dataset and S3LI Vulcano dataset show that MPRF outperforms state-of-the-art retrieval methods in precision while enhancing pose estimation robustness in low-texture regions. By providing interpretable correspondences suitable for SLAM back-ends, MPRF achieves a favorable trade-off between accuracy, efficiency, and reliability, demonstrating the potential of foundation models to unify place recognition and pose estimation. Code and models will be released at github.com/DLR-RM/MPRF.

Aupendu Kar, Krishnendu Ghosh, Prabir Kumar Biswas

Main category: cs.CV

TL;DR: A novel continual learning approach for image restoration that modifies convolution layers to adapt knowledge from previous tasks without architectural changes, maintaining performance on existing tasks while improving new ones.

Details

Motivation: Current continual learning methods require heavy architectural modifications for new tasks in image restoration, causing computational overhead. Regularization methods are unsuitable due to different restoration tasks needing different feature processing.

Method: Simple modification of convolution layers to adapt knowledge from previous restoration tasks without changing the main backbone architecture, allowing seamless application to any deep architecture.

Result: The model can increase trainable parameters without significant computational overhead or inference time degradation. New restoration tasks can be introduced without compromising existing task performance, and new task performance improves by leveraging previous task knowledge.

Conclusion: The proposed convolution layer modification provides an effective continual learning solution for image restoration that is computationally efficient, architecture-agnostic, and maintains knowledge transfer between tasks.

Abstract: Continual learning is an emerging topic in the field of deep learning, where a model is expected to learn continuously for new upcoming tasks without forgetting previous experiences. This field has witnessed numerous advancements, but few works have been attempted in the direction of image restoration. Handling large image sizes and the divergent nature of various degradation poses a unique challenge in the restoration domain. However, existing works require heavily engineered architectural modifications for new task adaptation, resulting in significant computational overhead. Regularization-based methods are unsuitable for restoration, as different restoration challenges require different kinds of feature processing. In this direction, we propose a simple modification of the convolution layer to adapt the knowledge from previous restoration tasks without touching the main backbone architecture. Therefore, it can be seamlessly applied to any deep architecture without any structural modifications. Unlike other approaches, we demonstrate that our model can increase the number of trainable parameters without significantly increasing computational overhead or inference time. Experimental validation demonstrates that new restoration tasks can be introduced without compromising the performance of existing tasks. We also show that performance on new restoration tasks improves by adapting the knowledge from the knowledge base created by previous restoration tasks. The code is available at https://github.com/aupendu/continual-restore.

[134] Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis

Dogucan Yaman, Seymanur Akti, Fevziye Irem Eyiokur, Alexander Waibel

Main category: cs.CV

TL;DR: A text-to-talking-face synthesis framework that uses latent speech representations from HierSpeech++ to generate synchronized facial animations from text input without requiring ground-truth audio.

Details

Motivation: To create a unified framework that generates both speech and facial animations from text while maintaining tight audio-visual alignment and speaker identity preservation, overcoming distribution shifts between clean and TTS-predicted features.

Method: Two-stage training approach: pretraining on Wav2Vec2 embeddings from a Text-to-Vec module, then finetuning on TTS outputs. Uses latent speech representations to jointly condition speech and face generation.

Result: The framework produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync accuracy and visual realism.

Conclusion: The proposed method successfully enables text-to-talking-face synthesis with tight audio-visual alignment and speaker identity preservation, demonstrating superior performance over traditional cascaded approaches.

Abstract: We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.

[135] How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?

Tuan Anh Tran, Duy M. H. Nguyen, Hoai-Chau Tran, Michael Barz, Khoa D. Doan, Roger Wattenhofer, Ngo Anh Vien, Mathias Niepert, Daniel Sonntag, Paul Swoboda

Main category: cs.CV

TL;DR: gitmerge3D is a globally informed graph token merging method that reduces token count by 90-95% in 3D point cloud transformers while maintaining competitive performance, challenging the assumption that more tokens yield better results.

Details

Motivation: Current 3D point cloud transformers rely on dense token representations that incur high computational and memory costs, with tokens being remarkably redundant and leading to substantial inefficiency.

Method: Introduces gitmerge3D, a globally informed graph token merging method that reduces token count by up to 90-95% while maintaining performance.

Result: The method maintains competitive performance while achieving substantial computational efficiency improvements across multiple 3D vision tasks.

Conclusion: This work challenges the prevailing assumption about token quantity and provides insights for developing more efficient 3D foundation architectures, being the first to assess redundancy in large-scale 3D transformer models.

Abstract: Recent advances in 3D point cloud transformers have led to state-of-the-art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present the finding that tokens are remarkably redundant, leading to substantial inefficiency. We introduce gitmerge3D, a globally informed graph token merging method that can reduce the token count by up to 90-95% while maintaining competitive performance. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over-tokenized and under-optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures. Our code and checkpoints are publicly available at https://gitmerge3d.github.io

[136] The Potential of Copernicus Satellites for Disaster Response: Retrieving Building Damage from Sentinel-1 and Sentinel-2

Olivier Dietrich, Merlin Alfredsson, Emilia Arens, Nando Metzger, Torben Peters, Linus Scheibenreif, Jan Dirk Wegner, Konrad Schindler

Main category: cs.CV

TL;DR: Medium-resolution Copernicus satellite imagery can effectively detect building damage for rapid disaster assessment, with simpler models performing better than complex architectures.

Details

Motivation: Natural disasters require rapid damage assessment, but very-high resolution imagery has limited availability. This research explores whether more widely available medium-resolution Earth observation data can support building damage assessment.

Method: Created xBD-S12 dataset with 10,315 pre- and post-disaster image pairs from Sentinel-1 and Sentinel-2 satellites, aligned with xBD benchmark. Tested various model architectures for damage detection.

Result: Building damage can be detected and mapped effectively despite 10m resolution. Simpler models generalize better to unseen disasters than complex architectures, and geospatial foundation models provide little practical benefit.

Conclusion: Copernicus medium-resolution imagery is a viable data source for rapid, wide-area damage assessment and can complement VHR imagery. Dataset, code, and models are released for further research.

Abstract: Natural disasters demand rapid damage assessment to guide humanitarian response. Here, we investigate whether medium-resolution Earth observation images from the Copernicus program can support building damage assessment, complementing very-high resolution imagery with often limited availability. We introduce xBD-S12, a dataset of 10,315 pre- and post-disaster image pairs from both Sentinel-1 and Sentinel-2, spatially and temporally aligned with the established xBD benchmark. In a series of experiments, we demonstrate that building damage can be detected and mapped rather well in many disaster scenarios, despite the moderate 10$,$m ground sampling distance. We also find that, for damage mapping at that resolution, architectural sophistication does not seem to bring much advantage: more complex model architectures tend to struggle with generalization to unseen disasters, and geospatial foundation models bring little practical benefit. Our results suggest that Copernicus images are a viable data source for rapid, wide-area damage assessment and could play an important role alongside VHR imagery. We release the xBD-S12 dataset, code, and trained models to support further research.

[137] Photo Dating by Facial Age Aggregation

Jakub Paplham, Vojtech Franc

Main category: cs.CV

TL;DR: A novel photo dating method that estimates when a photo was taken using facial information from multiple people in the image, leveraging a new dataset of 1.6M annotated faces with identity and birth year data.

Details

Motivation: To develop a more accurate photo dating system by utilizing facial information from multiple individuals in an image, which provides stronger temporal constraints than scene-based methods alone.

Method: Proposes a probabilistic framework that combines modern face recognition and age estimation models with career-based temporal priors, and introduces CSFD-1.6M dataset with 1.6M annotated faces from movie stills.

Result: The approach significantly outperforms scene-based baselines, with multi-face aggregation consistently improving performance, especially for images containing several identifiable individuals.

Conclusion: Leveraging multiple faces for photo dating provides more reliable temporal evidence than scene-based methods, and the publicly released dataset enables further research in multi-face information aggregation.

Abstract: We introduce a novel method for Photo Dating which estimates the year a photograph was taken by leveraging information from the faces of people present in the image. To facilitate this research, we publicly release CSFD-1.6M, a new dataset containing over 1.6 million annotated faces, primarily from movie stills, with identity and birth year annotations. Uniquely, our dataset provides annotations for multiple individuals within a single image, enabling the study of multi-face information aggregation. We propose a probabilistic framework that formally combines visual evidence from modern face recognition and age estimation models, and career-based temporal priors to infer the photo capture year. Our experiments demonstrate that aggregating evidence from multiple faces consistently improves the performance and the approach significantly outperforms strong, scene-based baselines, particularly for images containing several identifiable individuals.

[138] EventFlow: Real-Time Neuromorphic Event-Driven Classification of Two-Phase Boiling Flow Regimes

Sanghyeon Chang, Srikar Arani, Nishant Sai Nuthalapati, Youngjoon Suh, Nicholas Choi, Siavash Khodakarami, Md Rakibul Hasan Roni, Nenad Miljkovic, Aparna Chandramowlishwaran, Yoonjin Won

Main category: cs.CV

TL;DR: A real-time flow regime classification framework using neuromorphic sensors that achieves 97.6% accuracy with 0.28ms processing time, outperforming conventional optical imaging methods.

Details

Motivation: Flow boiling is efficient for heat transfer but sudden flow regime shifts disrupt thermal performance. Conventional optical imaging has high computational demands and insufficient temporal resolution for capturing transient flow behavior.

Method: Proposed a real-time framework using neuromorphic sensors that detect brightness changes at individual pixels (event-based data). Developed five classification models using both traditional image data and event-based data, with an asynchronous processing pipeline and majority voting mechanism.

Result: Event-based models outperform frame-based approaches. The event-based LSTM model achieved 97.6% classification accuracy with 0.28ms processing time, providing the best balance between accuracy and speed.

Conclusion: The framework enables reliable real-time feedback for experimental control and intelligent thermal management through continuous, low-latency predictions with stable output via majority voting.

Abstract: Flow boiling is an efficient heat transfer mechanism capable of dissipating high heat loads with minimal temperature variation, making it an ideal thermal management method. However, sudden shifts between flow regimes can disrupt thermal performance and system reliability, highlighting the need for accurate and low-latency real-time monitoring. Conventional optical imaging methods are limited by high computational demands and insufficient temporal resolution, making them inadequate for capturing transient flow behavior. To address this, we propose a real-time framework based on signals from neuromorphic sensors for flow regime classification. Neuromorphic sensors detect changes in brightness at individual pixels, which typically correspond to motion at edges, enabling fast and efficient detection without full-frame reconstruction, providing event-based information. We develop five classification models using both traditional image data and event-based data, demonstrating that models leveraging event data outperform frame-based approaches due to their sensitivity to dynamic flow features. Among these models, the event-based long short-term memory model provides the best balance between accuracy and speed, achieving 97.6% classification accuracy with a processing time of 0.28 ms. Our asynchronous processing pipeline supports continuous, low-latency predictions and delivers stable output through a majority voting mechanisms, enabling reliable real-time feedback for experimental control and intelligent thermal management.

Xian-Hong Huang, Hui-Kai Su, Chi-Chia Sun, Jun-Wei Hsieh

Main category: cs.CV

TL;DR: A novel cross-modal object detection method combining BERT language model with CNN-based PRB-FPN-Net using ELAN, MSP, and CSP backbones, achieving 52.6% AP on COCO2017 while using half the parameters of Transformer models.

Details

Motivation: To enhance tiny object detection by integrating semantic-guided natural language processing with visual recognition, improving detection precision for small and complex objects through cross-modal alignment.

Method: Integrates BERT language model with CNN-based Parallel Residual Bi-Fusion Feature Pyramid Network (PRB-FPN-Net), using ELAN, MSP, and CSP backbones with lemmatization and fine-tuning techniques to align textual semantic cues with visual features.

Result: Achieves 52.6% average precision on COCO2017 validation set, outperforming YOLO-World significantly while maintaining half the parameter consumption of Transformer-based models like GLIP. Efficiently handles multi-scale objects in resource-constrained environments.

Conclusion: Demonstrates the potential of integrating natural language understanding with advanced backbone architectures, setting new benchmarks in object detection accuracy, efficiency, and real-world adaptability.

Abstract: This paper introduces a cutting-edge approach to cross-modal interaction for tiny object detection by combining semantic-guided natural language processing with advanced visual recognition backbones. The proposed method integrates the BERT language model with the CNN-based Parallel Residual Bi-Fusion Feature Pyramid Network (PRB-FPN-Net), incorporating innovative backbone architectures such as ELAN, MSP, and CSP to optimize feature extraction and fusion. By employing lemmatization and fine-tuning techniques, the system aligns semantic cues from textual inputs with visual features, enhancing detection precision for small and complex objects. Experimental validation using the COCO and Objects365 datasets demonstrates that the model achieves superior performance. On the COCO2017 validation set, it attains a 52.6% average precision (AP), outperforming YOLO-World significantly while maintaining half the parameter consumption of Transformer-based models like GLIP. Several test on different of backbones such ELAN, MSP, and CSP further enable efficient handling of multi-scale objects, ensuring scalability and robustness in resource-constrained environments. This study underscores the potential of integrating natural language understanding with advanced backbone architectures, setting new benchmarks in object detection accuracy, efficiency, and adaptability to real-world challenges.

[140] GroupKAN: Rethinking Nonlinearity with Grouped Spline-based KAN Modeling for Efficient Medical Image Segmentation

Guojie Li, Anwar P. P. Abdul Majeed, Muhammad Ateeq, Anh Nguyen, Fan Zhang

Main category: cs.CV

TL;DR: GroupKAN is a lightweight medical image segmentation network that improves upon U-KAN by using grouped channel transformations to reduce complexity from O(C²) to O(C²/G), achieving higher accuracy with fewer parameters.

Details

Motivation: Medical image segmentation needs accurate, lightweight, and interpretable models. Convolutional networks lack adaptive nonlinearity, Transformers have quadratic complexity and opaque attention, and U-KAN has scalability limitations due to O(C²) complexity.

Method: Introduces two novel modules: (1) Grouped KAN Transform - partitions channels into G groups for multivariate spline mappings, (2) Grouped KAN Activation - applies shared spline-based mappings within channel groups for efficient token-wise nonlinearity.

Result: Achieves 79.80% average IoU on BUSI, GlaS, and CVC benchmarks, surpassing U-KAN by +1.11% while using only 47.6% of parameters (3.02M vs 6.35M), with improved interpretability.

Conclusion: GroupKAN successfully addresses scalability limitations of U-KAN through grouped channel transformations, achieving superior performance with significantly reduced parameters while maintaining interpretability for medical image segmentation.

Abstract: Medical image segmentation requires models that are accurate, lightweight, and interpretable. Convolutional architectures lack adaptive nonlinearity and transparent decision-making, whereas Transformer architectures are hindered by quadratic complexity and opaque attention mechanisms. U-KAN addresses these challenges using Kolmogorov-Arnold Networks, achieving higher accuracy than both convolutional and attention-based methods, fewer parameters than Transformer variants, and improved interpretability compared to conventional approaches. However, its O(C^2) complexity due to full-channel transformations limits its scalability as the number of channels increases. To overcome this, we introduce GroupKAN, a lightweight segmentation network that incorporates two novel, structured functional modules: (1) Grouped KAN Transform, which partitions channels into G groups for multivariate spline mappings, reducing complexity to O(C^2/G), and (2) Grouped KAN Activation, which applies shared spline-based mappings within each channel group for efficient, token-wise nonlinearity. Evaluated on three medical benchmarks (BUSI, GlaS, and CVC), GroupKAN achieves an average IoU of 79.80 percent, surpassing U-KAN by +1.11 percent while requiring only 47.6 percent of the parameters (3.02M vs 6.35M), and shows improved interpretability.

[141] TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

Junwen Pan, Qizhe Zhang, Rui Zhang, Ming Lu, Xin Wan, Yuan Zhang, Chang Liu, Qi She

Main category: cs.CV

TL;DR: TimeSearch-R reformulates temporal search as interleaved text-video thinking using reinforcement learning, introducing GRPO-CSV to improve video reasoning completeness and achieving SOTA results on multiple benchmarks.

Details

Motivation: Existing temporal search methods rely on hand-crafted search processes without end-to-end optimization, lacking optimal search strategy learning.

Method: Proposes TimeSearch-R with GRPO-CSV (Group Relative Policy Optimization with Completeness Self-Verification), which integrates searching video clips into reasoning via RL and verifies search completeness using the same policy model.

Result: Achieves significant improvements on temporal search benchmarks (Haystack-LVBench, Haystack-Ego4D) and long-form video understanding benchmarks (VideoMME, MLVU), with 4.1% improvement over Qwen2.5-VL and 2.0% over Video-R1 on LongVideoBench.

Conclusion: TimeSearch-R establishes new state-of-the-art performance by effectively integrating temporal search with video reasoning through reinforcement learning and completeness verification.

Abstract: Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text-video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves significant improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, as well as long-form video understanding benchmarks like VideoMME and MLVU. Notably, TimeSearch-R establishes a new state-of-the-art on LongVideoBench with 4.1% improvement over the base model Qwen2.5-VL and 2.0% over the advanced video reasoning model Video-R1. Our code is available at https://github.com/Time-Search/TimeSearch-R.

[142] Visual Spatial Tuning

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao

Main category: cs.CV

TL;DR: Visual Spatial Tuning (VST) enhances VLMs’ spatial abilities without extra encoders, using progressive training on large datasets (4.1M perception + 135K reasoning samples) to achieve SOTA results on spatial benchmarks.

Details

Motivation: To enhance spatial awareness in VLMs without adding extra expert encoders that cause overhead and harm general capabilities, aiming for human-like visuospatial abilities.

Method: Progressive training pipeline: supervised fine-tuning on VST-P dataset (4.1M samples across 19 spatial skills) for foundational knowledge, then reinforcement learning on VST-R dataset (135K samples) for spatial reasoning.

Result: Achieves state-of-the-art results on spatial benchmarks: 34.8% on MMSI-Bench and 61.2% on VSIBench, without compromising general capabilities.

Conclusion: VST significantly enhances Vision-Language-Action models’ spatial abilities through the proposed tuning paradigm, paving the way for more physically grounded AI.

Abstract: Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8%$ on MMSI-Bench and $61.2%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.

[143] Thera: Aliasing-Free Arbitrary-Scale Super-Resolution with Neural Heat Fields

Alexander Becker, Rodrigo Caye Daudt, Dominik Narnhofer, Torben Peters, Nando Metzger, Jan Dirk Wegner, Konrad Schindler

Main category: cs.CV

TL;DR: Neural Heat Fields (Thera) is a novel arbitrary-scale super-resolution method that uses physically accurate point spread function modeling to eliminate aliasing without additional computational cost, outperforming existing approaches.

Details

Motivation: Existing neural field-based super-resolution methods suffer from aliasing due to point-wise queries that don't match pixel point spread functions, compromising fidelity and generalization.

Method: Proposes neural heat fields that inherently model a physically exact point spread function, enabling analytically correct anti-aliasing at any output resolution without additional computational cost.

Result: Thera substantially outperforms existing arbitrary-scale super-resolution approaches while being more parameter-efficient and offering strong theoretical guarantees.

Conclusion: Neural heat fields provide a theoretically sound and computationally efficient solution for anti-aliased arbitrary-scale super-resolution, representing a significant advancement over existing methods.

Abstract: Recent approaches to arbitrary-scale single image super-resolution (ASR) use neural fields to represent continuous signals that can be sampled at arbitrary resolutions. However, point-wise queries of neural fields do not naturally match the point spread function (PSF) of pixels, which may cause aliasing in the super-resolved image. Existing methods attempt to mitigate this by approximating an integral version of the field at each scaling factor, compromising both fidelity and generalization. In this work, we introduce neural heat fields, a novel neural field formulation that inherently models a physically exact PSF. Our formulation enables analytically correct anti-aliasing at any desired output resolution, and – unlike supersampling – at no additional cost. Building on this foundation, we propose Thera, an end-to-end ASR method that substantially outperforms existing approaches, while being more parameter-efficient and offering strong theoretical guarantees. The project page is at https://therasr.github.io.

[144] FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

Barbara Toniella Corradini, Mustafa Shukor, Paul Couairon, Guillaume Couairon, Franco Scarselli, Matthieu Cord

Main category: cs.CV

TL;DR: FreeSeg-Diff is a zero-shot, training-free open-vocabulary segmentation method that leverages foundation models (BLIP, Stable Diffusion, CLIP) to generate segmentation masks without pixel-level annotations or model training.

Details

Motivation: To explore the spatial representations in image generative models beyond image generation, specifically for dense visual prediction tasks like segmentation, while avoiding the high annotation costs and training requirements of traditional methods.

Method: Pipeline uses BLIP for text description, Stable Diffusion for visual representation, clustering and binarization for class-agnostic masks, CLIP for open-vocabulary mapping, and refinement for precise segmentation.

Result: Outperforms many training-based approaches on Pascal VOC and COCO datasets, shows competitive results compared to weakly-supervised methods, and demonstrates superiority of diffusion model features over other pretrained models.

Conclusion: Foundation models can be effectively leveraged for zero-shot open-vocabulary segmentation without training, with diffusion models providing particularly powerful spatial representations for this task.

Abstract: Foundation models have exhibited unprecedented capabilities in tackling many domains and tasks. Models such as CLIP are currently widely used to bridge cross-modal representations, and text-to-image diffusion models are arguably the leading models in terms of realistic image generation. Image generative models are trained on massive datasets that provide them with powerful internal spatial representations. In this work, we explore the potential benefits of such representations, beyond image generation, in particular, for dense visual prediction tasks. We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets, with pixel-level annotations. To avoid the annotation cost or training large diffusion models, we constraint our setup to be zero-shot and training-free. In a nutshell, our pipeline leverages different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation. The pipeline is as follows: the image is passed to both a captioner model (i.e. BLIP) and a diffusion model (i.e., Stable Diffusion Model) to generate a text description and visual representation, respectively. The features are clustered and binarized to obtain class agnostic masks for each object. These masks are then mapped to a textual class, using the CLIP model to support open-vocabulary. Finally, we add a refinement step that allows to obtain a more precise segmentation mask. Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets. In addition, we show very competitive results compared to the recent weakly-supervised segmentation approaches. We provide comprehensive experiments showing the superiority of diffusion model features compared to other pretrained models. Project page: https://bcorrad.github.io/freesegdiff/

[145] On Scaling Up 3D Gaussian Splatting Training

Hexu Zhao, Haoyang Weng, Daohan Lu, Ang Li, Jinyang Li, Aurojit Panda, Saining Xie

Main category: cs.CV

TL;DR: Grendel is a distributed system that enables 3D Gaussian Splatting (3DGS) training across multiple GPUs, overcoming memory limitations of single-GPU training and allowing for larger-scale 3D reconstruction with improved quality.

Details

Motivation: Current 3DGS training is limited to single GPUs, which restricts its ability to handle high-resolution and large-scale 3D reconstruction tasks due to memory constraints.

Method: Grendel partitions 3DGS parameters across multiple GPUs, uses sparse all-to-all communication to transfer necessary Gaussians to pixel partitions, performs dynamic load balancing, and supports batched training with multiple views. It employs a sqrt(batch size) scaling rule for optimization hyperparameters.

Result: On the Rubble dataset, Grendel achieved a test PSNR of 27.28 by distributing 40.4 million Gaussians across 16 GPUs, compared to a PSNR of 26.28 using 11.2 million Gaussians on a single GPU.

Conclusion: Grendel successfully scales 3DGS training across multiple GPUs, enabling larger-scale 3D reconstruction with improved rendering quality while maintaining the efficiency of the 3DGS approach.

Abstract: 3D Gaussian Splatting (3DGS) is increasingly popular for 3D reconstruction due to its superior visual quality and rendering speed. However, 3DGS training currently occurs on a single GPU, limiting its ability to handle high-resolution and large-scale 3D reconstruction tasks due to memory constraints. We introduce Grendel, a distributed system designed to partition 3DGS parameters and parallelize computation across multiple GPUs. As each Gaussian affects a small, dynamic subset of rendered pixels, Grendel employs sparse all-to-all communication to transfer the necessary Gaussians to pixel partitions and performs dynamic load balancing. Unlike existing 3DGS systems that train using one camera view image at a time, Grendel supports batched training with multiple views. We explore various optimization hyperparameter scaling strategies and find that a simple sqrt(batch size) scaling rule is highly effective. Evaluations using large-scale, high-resolution scenes show that Grendel enhances rendering quality by scaling up 3DGS parameters across multiple GPUs. On the Rubble dataset, we achieve a test PSNR of 27.28 by distributing 40.4 million Gaussians across 16 GPUs, compared to a PSNR of 26.28 using 11.2 million Gaussians on a single GPU. Grendel is an open-source project available at: https://github.com/nyu-systems/Grendel-GS

[146] FunOTTA: On-the-Fly Adaptation on Cross-Domain Fundus Image via Stable Test-time Training

Qian Zeng, Le Zhang, Yipeng Liu, Ce Zhu, Fan Zhang

Main category: cs.CV

TL;DR: FunOTTA is a test-time adaptation framework for fundus image diagnosis that handles domain shifts from different imaging devices and locations through dynamic disambiguation and consistency regularization.

Details

Motivation: Domain shifts in fundus images from different devices and locations challenge the deployment of pre-trained models in real-world applications, requiring effective adaptation to unseen environments.

Method: Proposes FunOTTA with dynamic disambiguation in memory bank, minimization of harmful prior knowledge bias, and a new training objective with reliable class conditional estimation and consistency regularization.

Result: Superior performance compared to state-of-the-art TTA methods on cross-domain fundus image benchmarks across two diseases with different backbone networks.

Conclusion: FunOTTA effectively generalizes fundus image diagnosis models to unseen environments under strong domain shifts, demonstrating stable adaptation and reliable performance.

Abstract: Fundus images are essential for the early screening and detection of eye diseases. While deep learning models using fundus images have significantly advanced the diagnosis of multiple eye diseases, variations in images from different imaging devices and locations (known as domain shifts) pose challenges for deploying pre-trained models in real-world applications. To address this, we propose a novel Fundus On-the-fly Test-Time Adaptation (FunOTTA) framework that effectively generalizes a fundus image diagnosis model to unseen environments, even under strong domain shifts. FunOTTA stands out for its stable adaptation process by performing dynamic disambiguation in the memory bank while minimizing harmful prior knowledge bias. We also introduce a new training objective during adaptation that enables the classifier to incrementally adapt to target patterns with reliable class conditional estimation and consistency regularization. We compare our method with several state-of-the-art test-time adaptation (TTA) pipelines. Experiments on cross-domain fundus image benchmarks across two diseases demonstrate the superiority of the overall framework and individual components under different backbone networks. Code is available at https://github.com/Casperqian/FunOTTA.

[147] Dark Transformer: A Video Transformer for Action Recognition in the Dark

Anwaar Ulhaq

Main category: cs.CV

TL;DR: Dark Transformer is a video transformer-based method for action recognition in low-light conditions that uses spatiotemporal self-attention in cross-domain settings, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Existing methods handle action recognition and dark enhancement separately, limiting end-to-end learning of spatiotemporal representations for video action classification in adverse lighting conditions.

Method: Leverages spatiotemporal self-attention mechanisms in cross-domain settings and extends video transformers to learn cross-domain knowledge for action recognition in low-light environments.

Result: Achieves state-of-the-art performance on benchmark action recognition datasets including InFAR, XD145, and ARID.

Conclusion: The approach demonstrates significant promise in addressing action recognition challenges in adverse lighting conditions and offers practical implications for real-world applications like visual surveillance and nighttime driving.

Abstract: Recognizing human actions in adverse lighting conditions presents significant challenges in computer vision, with wide-ranging applications in visual surveillance and nighttime driving. Existing methods tackle action recognition and dark enhancement separately, limiting the potential for end-to-end learning of spatiotemporal representations for video action classification. This paper introduces Dark Transformer, a novel video transformer-based approach for action recognition in low-light environments. Dark Transformer leverages spatiotemporal self-attention mechanisms in cross-domain settings to enhance cross-domain action recognition. By extending video transformers to learn cross-domain knowledge, Dark Transformer achieves state-of-the-art performance on benchmark action recognition datasets, including InFAR, XD145, and ARID. The proposed approach demonstrates significant promise in addressing the challenges of action recognition in adverse lighting conditions, offering practical implications for real-world applications.

[148] SelaVPR++: Towards Seamless Adaptation of Foundation Models for Efficient Place Recognition

Feng Lu, Tong Jin, Xiangyuan Lan, Lijun Zhang, Yunpeng Liu, Yaowei Wang, Chun Yuan

Main category: cs.CV

TL;DR: SelaVPR++ extends SelaVPR with more efficient adaptation using lightweight multi-scale convolution adapters and a novel re-ranking paradigm using binary features for initial retrieval and floating-point features for re-ranking.

Details

Motivation: To improve upon SelaVPR's inefficiencies in training time, GPU memory usage, retrieval latency, and storage usage while achieving better performance.

Method: Uses MultiConv adapters for parameter-efficient adaptation without back-propagating through backbone, similarity-constrained deep hashing for binary features, and unified training protocol across datasets.

Result: Achieves higher efficiency and better performance compared to previous methods.

Conclusion: SelaVPR++ provides a more efficient and effective solution for visual place recognition with improved adaptation and re-ranking strategies.

Abstract: Recent studies show that the visual place recognition (VPR) method using pre-trained visual foundation models can achieve promising performance. In our previous work, we propose a novel method to realize seamless adaptation of foundation models to VPR (SelaVPR). This method can produce both global and local features that focus on discriminative landmarks to recognize places for two-stage VPR by a parameter-efficient adaptation approach. Although SelaVPR has achieved competitive results, we argue that the previous adaptation is inefficient in training time and GPU memory usage, and the re-ranking paradigm is also costly in retrieval latency and storage usage. In pursuit of higher efficiency and better performance, we propose an extension of the SelaVPR, called SelaVPR++. Concretely, we first design a parameter-, time-, and memory-efficient adaptation method that uses lightweight multi-scale convolution (MultiConv) adapters to refine intermediate features from the frozen foundation backbone. This adaptation method does not back-propagate gradients through the backbone during training, and the MultiConv adapter facilitates feature interactions along the spatial axes and introduces proper local priors, thus achieving higher efficiency and better performance. Moreover, we propose an innovative re-ranking paradigm for more efficient VPR. Instead of relying on local features for re-ranking, which incurs huge overhead in latency and storage, we employ compact binary features for initial retrieval and robust floating-point (global) features for re-ranking. To obtain such binary features, we propose a similarity-constrained deep hashing method, which can be easily integrated into the VPR pipeline. Finally, we improve our training strategy and unify the training protocol of several common training datasets to merge them for better training of VPR models. Extensive experiments show that ……

[149] MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments

Zhixuan Liu, Haokun Zhu, Rui Chen, Jonathan Francis, Soonmin Hwang, Ji Zhang, Jean Oh

Main category: cs.CV

TL;DR: A diffusion-based method called MOSAIC generates privacy-preserving digital twins of multi-room indoor environments using only depth images, with improved quality through multi-view optimization.

Details

Motivation: To create privacy-preserving digital twins of indoor environments while addressing limitations of existing approaches that suffer from error accumulation in sequential or single-room constraints.

Method: Uses a Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model with multi-channel inference-time optimization that explicitly considers cross-view dependencies probabilistically.

Result: MOSAIC outperforms state-of-the-art baselines on image fidelity metrics for reconstructing complex multi-room environments, scales to complex scenes without extra training, and reduces variance during denoising.

Conclusion: MOSAIC provides an effective diffusion-based approach for generating high-quality digital twins of multi-room indoor environments from depth images while preserving privacy.

Abstract: We introduce a diffusion-based approach for generating privacy-preserving digital twins of multi-room indoor environments from depth images only. Central to our approach is a novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense. MOSAIC operates through a multi-channel inference-time optimization that avoids error accumulation common in sequential or single-room constraints in panorama-based approaches. MOSAIC scales to complex scenes with zero extra training and provably reduces the variance during denoising process when more overlapping views are added, leading to improved generation quality. Experiments show that MOSAIC outperforms state-of-the-art baselines on image fidelity metrics in reconstructing complex multi-room environments. Resources and code are at https://mosaic-cmubig.github.io

[150] Consistency Trajectory Matching for One-Step Generative Super-Resolution

Weiyi You, Mingyang Zhang, Leheng Zhang, Xingyu Zhou, Kexuan Shi, Shuhang Gu

Main category: cs.CV

TL;DR: CTMSR is a distillation-free super-resolution method that achieves photo-realistic results in one step using consistency training and distribution trajectory matching, eliminating the need for pre-trained diffusion models.

Details

Motivation: Current diffusion-based SR methods have high inference overhead, and distillation techniques increase training costs while limiting student model performance by teacher model constraints.

Method: Formulates Probability Flow ODE trajectory from LR to HR images, applies Consistency Training for one-step mapping, and uses Distribution Trajectory Matching loss to align SR results with natural image distributions.

Result: Achieves comparable or superior performance on synthetic and real datasets while maintaining minimal inference latency.

Conclusion: CTMSR provides an effective distillation-free alternative for fast, high-quality super-resolution without dependency on pre-trained diffusion models.

Abstract: Current diffusion-based super-resolution (SR) approaches achieve commendable performance at the cost of high inference overhead. Therefore, distillation techniques are utilized to accelerate the multi-step teacher model into one-step student model. Nevertheless, these methods significantly raise training costs and constrain the performance of the student model by the teacher model. To overcome these tough challenges, we propose Consistency Trajectory Matching for Super-Resolution (CTMSR), a distillation-free strategy that is able to generate photo-realistic SR results in one step. Concretely, we first formulate a Probability Flow Ordinary Differential Equation (PF-ODE) trajectory to establish a deterministic mapping from low-resolution (LR) images with noise to high-resolution (HR) images. Then we apply the Consistency Training (CT) strategy to directly learn the mapping in one step, eliminating the necessity of pre-trained diffusion model. To further enhance the performance and better leverage the ground-truth during the training process, we aim to align the distribution of SR results more closely with that of the natural images. To this end, we propose to minimize the discrepancy between their respective PF-ODE trajectories from the LR image distribution by our meticulously designed Distribution Trajectory Matching (DTM) loss, resulting in improved realism of our recovered HR images. Comprehensive experimental results demonstrate that the proposed methods can attain comparable or even superior capabilities on both synthetic and real datasets while maintaining minimal inference latency.

Yikun Ji, Yan Hong, Jiahui Zhan, Haoxing Chen, jun lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang

Main category: cs.CV

TL;DR: This paper evaluates Multi-modal Large Language Models (MLLMs) for AI-generated image detection, comparing them with traditional methods and human evaluators, and proposes a framework with six distinct prompts for more robust and explainable detection.

Details

Motivation: Address public security concerns about fake images by developing transparent detection methods that ensure strong generalization and explainability, moving beyond black-box approaches.

Method: Evaluate MLLMs against traditional detection methods and human evaluators, design six distinct prompts, and propose an integrated framework for reasoning-driven detection.

Result: Highlights the strengths and limitations of MLLMs in AI-generated image detection compared to traditional methods and human evaluation.

Conclusion: MLLMs offer promising opportunities for developing more robust, explainable, and reasoning-driven fake image detection systems that address transparency concerns in AI-generated content identification.

Abstract: Progress in image generation raises significant public security concerns. We argue that fake image detection should not operate as a “black box”. Instead, an ideal approach must ensure both strong generalization and transparency. Recent progress in Multi-modal Large Language Models (MLLMs) offers new opportunities for reasoning-based AI-generated image detection. In this work, we evaluate the capabilities of MLLMs in comparison to traditional detection methods and human evaluators, highlighting their strengths and limitations. Furthermore, we design six distinct prompts and propose a framework that integrates these prompts to develop a more robust, explainable, and reasoning-driven detection system. The code is available at https://github.com/Gennadiyev/mllm-defake.

[152] TRACE: Textual Relevance Augmentation and Contextual Encoding for Multimodal Hate Detection

Girish A. Koushik, Helen Treharne, Aditya Joshi, Diptesh Kanojia

Main category: cs.CV

TL;DR: TRACE is a hierarchical multimodal framework for hateful meme detection that achieves state-of-the-art performance through visually grounded context augmentation, caption-scoring networks, and parameter-efficient fine-tuning of CLIP’s text encoder.

Details

Motivation: Social media memes present unique challenges for hate detection as they combine visual and textual elements into culturally nuanced messages that require sophisticated multimodal analysis.

Method: Hierarchical multimodal framework with visually grounded context augmentation, novel caption-scoring network to emphasize hate-relevant content, and parameter-efficient fine-tuning of CLIP’s text encoder with selective layer optimization.

Result: Achieves state-of-the-art accuracy (0.807) and F1-score (0.806) on Hateful Memes dataset, matches larger models’ performance while maintaining efficiency, and shows superior generalization on MultiOFF dataset (F1-score 0.673).

Conclusion: Robust visual grounding and nuanced text representations significantly reduce errors from benign confounders, demonstrating effective hate detection in culturally complex meme content.

Abstract: Social media memes are a challenging domain for hate detection because they intertwine visual and textual cues into culturally nuanced messages. To tackle these challenges, we introduce TRACE, a hierarchical multimodal framework that leverages visually grounded context augmentation, along with a novel caption-scoring network to emphasize hate-relevant content, and parameter-efficient fine-tuning of CLIP’s text encoder. Our experiments demonstrate that selectively fine-tuning deeper text encoder layers significantly enhances performance compared to simpler projection-layer fine-tuning methods. Specifically, our framework achieves state-of-the-art accuracy (0.807) and F1-score (0.806) on the widely-used Hateful Memes dataset, matching the performance of considerably larger models while maintaining efficiency. Moreover, it achieves superior generalization on the MultiOFF offensive meme dataset (F1-score 0.673), highlighting robustness across meme categories. Additional analyses confirm that robust visual grounding and nuanced text representations significantly reduce errors caused by benign confounders. We publicly release our code to facilitate future research.

[153] ControlGS: Consistent Structural Compression Control for Deployment-Aware Gaussian Splatting

Fengdi Zhang, Yibao Sun, Hongkun Cao, Ruqi Huang

Main category: cs.CV

TL;DR: ControlGS is a framework that provides continuous control over the trade-off between model size (Gaussian count) and rendering quality in 3D Gaussian Splatting, enabling automated deployment across different devices without scene-specific tuning.

Details

Motivation: 3DGS needs a universal control mechanism to adjust quality-compression trade-off without scene-specific tuning for automated deployment across varying device capabilities and bandwidth constraints.

Method: A control-oriented optimization framework that maps the Gaussian count vs. rendering quality trade-off to a continuous, scene-agnostic control axis using a globally unified hyperparameter.

Result: ControlGS flexibly generates models biased toward either compactness or high fidelity across diverse scene scales and types, achieving higher rendering quality with same or fewer Gaussians than competing methods.

Conclusion: ControlGS provides an effective solution for automated deployment of 3DGS models by offering continuous control over the quality-compression trade-off in a scene-agnostic manner.

Abstract: 3D Gaussian Splatting (3DGS) is a highly deployable real-time method for novel view synthesis. In practice, it requires a universal, consistent control mechanism that adjusts the trade-off between rendering quality and model compression without scene-specific tuning, enabling automated deployment across different device performances and communication bandwidths. In this work, we present ControlGS, a control-oriented optimization framework that maps the trade-off between Gaussian count and rendering quality to a continuous, scene-agnostic, and highly responsive control axis. Extensive experiments across a wide range of scene scales and types (from small objects to large outdoor scenes) demonstrate that, by adjusting a globally unified control hyperparameter, ControlGS can flexibly generate models biased toward either structural compactness or high fidelity, regardless of the specific scene scale or complexity, while achieving markedly higher rendering quality with the same or fewer Gaussians compared to potential competing methods. Project page: https://zhang-fengdi.github.io/ControlGS/

[154] Dual Teacher-Student Learning for Semi-supervised Medical Image Segmentation

Pengchen Zhang, Alan J. X. Guo, Sipin Luo, Zhe Han, Lin Guo

Main category: cs.CV

TL;DR: Proposes Dual Teacher-Student Learning (DTSL) for semi-supervised medical image segmentation, using two teacher signals to create a self-paced learning curriculum that outperforms state-of-the-art methods.

Details

Motivation: To reduce costly manual annotation in medical image segmentation by improving semi-supervised learning through better self-paced curriculum design.

Method: Dual Teacher-Student Learning (DTSL) with consensus label generator that combines temporal averaging from in-group teacher and cross-architectural signals from second model group to create pseudo-labels.

Result: Consistently outperforms state-of-the-art approaches on four benchmark datasets, with semi-supervised method surpassing fully supervised counterparts on three datasets.

Conclusion: The self-paced learning design with dual teacher signals effectively creates a learning curriculum that enables semi-supervised learning to outperform fully supervised methods with limited labeled data.

Abstract: Semi-supervised learning reduces the costly manual annotation burden in medical image segmentation. A popular approach is the mean teacher (MT) strategy, which applies consistency regularization using a temporally averaged teacher model. In this work, the MT strategy is reinterpreted as a form of self-paced learning in the context of supervised learning, where agreement between the teacher’s predictions and the ground truth implicitly guides the model from easy to hard. Extending this insight to semi-supervised learning, we propose dual teacher-student learning (DTSL). It regulates the learning pace on unlabeled data using two signals: a temporally averaged signal from an in-group teacher and a cross-architectural signal from a student in a second, distinct model group. Specifically, a novel consensus label generator (CLG) creates the pseudo-labels from the agreement between these two signals, establishing an effective learning curriculum. Extensive experiments on four benchmark datasets demonstrate that the proposed method consistently outperforms existing state-of-the-art approaches. Remarkably, on three of the four datasets, our semi-supervised method with limited labeled data surpasses its fully supervised counterparts, validating the effectiveness of our self-paced learning design.

[155] Towards Understanding the Mechanisms of Classifier-Free Guidance

Xiang Li, Rongrong Wang, Qing Qu

Main category: cs.CV

TL;DR: CFG improves image generation quality through three components: mean-shift towards class means, amplification of class-specific features, and suppression of generic features.

Details

Motivation: Classifier-free guidance (CFG) is widely used in state-of-the-art image generation but its underlying mechanisms are poorly understood.

Method: Analyzed CFG in a simplified linear diffusion model and verified insights in real-world nonlinear diffusion models across various noise levels.

Result: Linear CFG behavior closely resembles nonlinear CFG, revealing three key components: mean-shift, positive CPC for feature amplification, and negative CPC for feature suppression.

Conclusion: Linear analysis provides valuable insights into CFG’s mechanisms in nonlinear diffusion models, despite divergence at low noise levels.

Abstract: Classifier-free guidance (CFG) is a core technique powering state-of-the-art image generation systems, yet its underlying mechanisms remain poorly understood. In this work, we begin by analyzing CFG in a simplified linear diffusion model, where we show its behavior closely resembles that observed in the nonlinear case. Our analysis reveals that linear CFG improves generation quality via three distinct components: (i) a mean-shift term that approximately steers samples in the direction of class means, (ii) a positive Contrastive Principal Components (CPC) term that amplifies class-specific features, and (iii) a negative CPC term that suppresses generic features prevalent in unconditional data. We then verify that these insights in real-world, nonlinear diffusion models: over a broad range of noise levels, linear CFG resembles the behavior of its nonlinear counterpart. Although the two eventually diverge at low noise levels, we discuss how the insights from the linear analysis still shed light on the CFG’s mechanism in the nonlinear regime.

[156] Diffusion Denoised Hyperspectral Gaussian Splatting

Sunil Kumar Narayanan, Lingjun Zhao, Lu Gan, Yongsheng Chen

Main category: cs.CV

TL;DR: DD-HGS enhances 3D Gaussian Splatting with wavelength-aware spherical harmonics, spectral loss, and diffusion denoising for efficient hyperspectral scene reconstruction, achieving state-of-the-art performance.

Details

Motivation: NeRF-based hyperspectral imaging methods face limitations in training time and rendering speed, while existing approaches struggle with full-spectral reconstruction quality.

Method: Proposes Diffusion-Denoised Hyperspectral Gaussian Splatting (DD-HGS) that enhances 3DGS with wavelength-aware spherical harmonics, KL-divergence spectral loss, and diffusion-based denoising.

Result: DD-HGS achieves state-of-the-art performance on real-world hyperspectral scenes from Hyper-NeRF dataset, with improved reconstruction quality across full spectral range.

Conclusion: DD-HGS provides an efficient and high-quality solution for 3D explicit reconstruction of hyperspectral scenes, overcoming limitations of previous implicit representation methods.

Abstract: Hyperspectral imaging (HSI) has been widely used in agricultural applications for non-destructive estimation of plant nutrient composition and precise determination of nutritional elements of samples. Recently, 3D reconstruction methods have been used to create implicit neural representations of HSI scenes, which can help localize the target object’s nutrient composition spatially and spectrally. Neural Radiance Field (NeRF) is a cutting-edge implicit representation that can be used to render hyperspectral channel compositions of each spatial location from any viewing direction. However, it faces limitations in training time and rendering speed. In this paper, we propose Diffusion-Denoised Hyperspectral Gaussian Splatting (DD-HGS), which enhances the state-of-the-art 3D Gaussian Splatting (3DGS) method with wavelength-aware spherical harmonics, a Kullback-Leibler divergence-based spectral loss, and a diffusion-based denoiser to enable 3D explicit reconstruction of hyperspectral scenes across the full spectral range. We present extensive evaluations on diverse real-world hyperspectral scenes from the Hyper-NeRF dataset to show the effectiveness of DD-HGS. The results demonstrate that DD-HGS achieves new state-of-the-art performance among previously published methods. Project page: https://dragonpg2000.github.io/DDHGS-website/

Sangbum Choi, Kyeongryeol Go, Taewoong Jang

Main category: cs.CV

TL;DR: ZERO is an industry-ready vision foundation model that uses multi-modal prompting for zero-shot deployment in industrial settings, achieving competitive performance on academic benchmarks and superior results across 37 industrial datasets.

Details

Motivation: Foundation models struggle with zero-shot deployment in industrial settings due to lack of domain-specific datasets, creating a need for models that can generalize without retraining.

Method: Trained on 0.9 million annotated samples from a proprietary billion-scale industrial dataset using multi-modal prompting (textual and visual) for generalization without retraining.

Result: Competitive performance on LVIS-Val, significantly outperforms existing models across 37 industrial datasets, and achieved 2nd place in CVPR 2025 Object Instance Detection Challenge and 4th place in Foundational Few-shot Object Detection Challenge.

Conclusion: ZERO is the first vision foundation model explicitly built for domain-specific, zero-shot industrial applications, demonstrating practical deployability and generalizability with minimal adaptation.

Abstract: Foundation models have revolutionized AI, yet they struggle with zero-shot deployment in real-world industrial settings due to a lack of high-quality, domain-specific datasets. To bridge this gap, Superb AI introduces ZERO, an industry-ready vision foundation model that leverages multi-modal prompting (textual and visual) for generalization without retraining. Trained on a compact yet representative 0.9 million annotated samples from a proprietary billion-scale industrial dataset, ZERO demonstrates competitive performance on academic benchmarks like LVIS-Val and significantly outperforms existing models across 37 diverse industrial datasets. Furthermore, ZERO achieved 2nd place in the CVPR 2025 Object Instance Detection Challenge and 4th place in the Foundational Few-shot Object Detection Challenge, highlighting its practical deployability and generalizability with minimal adaptation and limited data. To the best of our knowledge, ZERO is the first vision foundation model explicitly built for domain-specific, zero-shot industrial applications.

[158] USIGAN: Unbalanced Self-Information Feature Transport for Weakly Paired Image IHC Virtual Staining

Yue Peng, Bing Xiong, Fuqiang Chen, De Eybo, RanRan Zhang, Wanming Hu, Jing Cai, Wenjian Qin

Main category: cs.CV

TL;DR: USIGAN is a novel method for IHC virtual staining that addresses spatial heterogeneity challenges in weakly paired conditions using unbalanced self-information feature transport, improving content and pathological semantic consistency.

Details

Motivation: To overcome the challenges of spatial heterogeneity between adjacent slices in weakly paired conditions for IHC virtual staining, which can lead to inaccurate mappings and inconsistent pathological semantics.

Method: Proposes USIGAN with unbalanced self-information feature transport, Unbalanced Optimal Transport Consistency (UOT-CTM) mechanism, and Pathology Self-Correspondence (PC-SCM) mechanism to extract global morphological semantics without positional correspondence.

Result: Superior performance on two publicly available datasets across multiple clinically significant metrics including IoD and Pearson-R correlation, demonstrating better clinical relevance.

Conclusion: USIGAN effectively mitigates the impact of weak pairing on joint distributions and significantly improves content consistency and pathological semantic consistency in IHC virtual staining.

Abstract: Immunohistochemical (IHC) virtual staining is a task that generates virtual IHC images from H&E images while maintaining pathological semantic consistency with adjacent slices. This task aims to achieve cross-domain mapping between morphological structures and staining patterns through generative models, providing an efficient and cost-effective solution for pathological analysis. However, under weakly paired conditions, spatial heterogeneity between adjacent slices presents significant challenges. This can lead to inaccurate one-to-many mappings and generate results that are inconsistent with the pathological semantics of adjacent slices. To address this issue, we propose a novel unbalanced self-information feature transport for IHC virtual staining, named USIGAN, which extracts global morphological semantics without relying on positional correspondence.By removing weakly paired terms in the joint marginal distribution, we effectively mitigate the impact of weak pairing on joint distributions, thereby significantly improving the content consistency and pathological semantic consistency of the generated results. Moreover, we design the Unbalanced Optimal Transport Consistency (UOT-CTM) mechanism and the Pathology Self-Correspondence (PC-SCM) mechanism to construct correlation matrices between H&E and generated IHC in image-level and real IHC and generated IHC image sets in intra-group level.. Experiments conducted on two publicly available datasets demonstrate that our method achieves superior performance across multiple clinically significant metrics, such as IoD and Pearson-R correlation, demonstrating better clinical relevance.

[159] GAITEX: Human motion dataset of impaired gait and rehabilitation exercises using inertial and optical sensors

Andreas Spilz, Heiko Oppel, Jochen Werner, Kathrin Stucke-Straub, Felix Capanni, Michael Munz

Main category: cs.CV

TL;DR: A multimodal dataset of physiotherapeutic and gait exercises recorded from 19 healthy subjects using synchronized IMUs and optical MoCap, with annotations, processed data, and tools for ML tasks.

Details

Motivation: Developing robust classification models for human movement assessment requires large, diverse datasets that are costly and time-consuming to collect.

Method: Recorded data from 19 healthy subjects using synchronized IMUs (9 units) and optical marker-based MoCap (68 markers), with four markers per IMU for direct comparison. Provided processed IMU orientations, subject-specific OpenSim models, and inverse kinematics outputs.

Result: Created a comprehensive dataset containing physiotherapeutic and gait-related exercises, including correct and clinically relevant variants, with movement quality ratings and timestamped segmentations.

Conclusion: The dataset supports various machine learning tasks and comes with code for postprocessing, alignment, and validation to promote reproducibility in human movement analysis research.

Abstract: Wearable inertial measurement units (IMUs) provide a cost-effective approach to assessing human movement in clinical and everyday environments. However, developing the associated classification models for robust assessment of physiotherapeutic exercise and gait analysis requires large, diverse datasets that are costly and time-consuming to collect. We present a multimodal dataset of physiotherapeutic and gait-related exercises, including correct and clinically relevant variants, recorded from 19 healthy subjects using synchronized IMUs and optical marker-based motion capture (MoCap). It contains data from nine IMUs and 68 markers tracking full-body kinematics. Four markers per IMU allow direct comparison between IMU- and MoCap-derived orientations. We additionally provide processed IMU orientations aligned to common segment coordinate systems, subject-specific OpenSim models, inverse kinematics outputs, and visualization tools for IMU-derived orientations. The dataset is fully annotated with movement quality ratings and timestamped segmentations. It supports various machine learning tasks such as exercise evaluation, gait classification, temporal segmentation, and biomechanical parameter estimation. Code for postprocessing, alignment, inverse kinematics, and technical validation is provided to promote reproducibility.

[160] KARMA: Efficient Structural Defect Segmentation via Kolmogorov-Arnold Representation Learning

Md Meftahul Ferdaus, Mahdi Abdelguerfi, Elias Ioup, Steven Sloan, Kendall N. Niles, Ken Pathak

Main category: cs.CV

TL;DR: KARMA is an efficient semantic segmentation framework for structural defect detection that uses Kolmogorov-Arnold representation instead of convolutions, achieving competitive performance with 97% fewer parameters than state-of-the-art methods.

Details

Motivation: Current deep learning methods for semantic segmentation of structural defects require millions of parameters, making them impractical for real-time inspection systems due to computational constraints.

Method: KARMA uses three technical innovations: 1) TiKAN module with low-rank factorization for KAN-based feature transformation, 2) optimized feature pyramid with separable convolutions for multi-scale analysis, and 3) static-dynamic prototype mechanism for handling class imbalance.

Result: KARMA achieves competitive or superior mean IoU performance while using only 0.959M parameters (vs 31.04M in SOTA, 97% reduction) and operating at 0.264 GFLOPS, enabling real-time deployment.

Conclusion: KARMA enables practical automated infrastructure inspection systems by providing efficient semantic segmentation without compromising accuracy, making real-time defect detection feasible.

Abstract: Semantic segmentation of structural defects in civil infrastructure remains challenging due to variable defect appearances, harsh imaging conditions, and significant class imbalance. Current deep learning methods, despite their effectiveness, typically require millions of parameters, rendering them impractical for real-time inspection systems. We introduce KARMA (Kolmogorov-Arnold Representation Mapping Architecture), a highly efficient semantic segmentation framework that models complex defect patterns through compositions of one-dimensional functions rather than conventional convolutions. KARMA features three technical innovations: (1) a parameter-efficient Tiny Kolmogorov-Arnold Network (TiKAN) module leveraging low-rank factorization for KAN-based feature transformation; (2) an optimized feature pyramid structure with separable convolutions for multi-scale defect analysis; and (3) a static-dynamic prototype mechanism that enhances feature representation for imbalanced classes. Extensive experiments on benchmark infrastructure inspection datasets demonstrate that KARMA achieves competitive or superior mean IoU performance compared to state-of-the-art approaches, while using significantly fewer parameters (0.959M vs. 31.04M, a 97% reduction). Operating at 0.264 GFLOPS, KARMA maintains inference speeds suitable for real-time deployment, enabling practical automated infrastructure inspection systems without compromising accuracy. The source code can be accessed at the following URL: https://github.com/faeyelab/karma.

[161] GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction

Jiahe Li, Jiawei Zhang, Youmin Zhang, Xiao Bai, Jin Zheng, Xiaohan Yu, Lin Gu

Main category: cs.CV

TL;DR: GeoSVR is a voxel-based framework that addresses representational bottlenecks in radiance field surface reconstruction by using sparse voxels with uncertainty-aware depth constraints and surface regularization for accurate, detailed surface reconstruction.

Details

Motivation: Current Gaussian Splatting approaches face representational bottlenecks, while sparse voxels offer potential for complete surface reconstruction but suffer from absent scene constraints and locality issues in surface refinement.

Method: Proposes Voxel-Uncertainty Depth Constraint to maximize monocular depth cues with voxel-oriented uncertainty, and Sparse Voxel Surface Regularization to enhance geometric consistency for tiny voxels and form sharp surfaces.

Result: Superior performance compared to existing methods across diverse challenging scenarios, excelling in geometric accuracy, detail preservation, and reconstruction completeness while maintaining high efficiency.

Conclusion: GeoSVR demonstrates that sparse voxels with proper constraints and regularization can overcome representational bottlenecks and achieve state-of-the-art surface reconstruction quality.

Abstract: Reconstructing accurate surfaces with radiance fields has achieved remarkable progress in recent years. However, prevailing approaches, primarily based on Gaussian Splatting, are increasingly constrained by representational bottlenecks. In this paper, we introduce GeoSVR, an explicit voxel-based framework that explores and extends the under-investigated potential of sparse voxels for achieving accurate, detailed, and complete surface reconstruction. As strengths, sparse voxels support preserving the coverage completeness and geometric clarity, while corresponding challenges also arise from absent scene constraints and locality in surface refinement. To ensure correct scene convergence, we first propose a Voxel-Uncertainty Depth Constraint that maximizes the effect of monocular depth cues while presenting a voxel-oriented uncertainty to avoid quality degradation, enabling effective and robust scene constraints yet preserving highly accurate geometries. Subsequently, Sparse Voxel Surface Regularization is designed to enhance geometric consistency for tiny voxels and facilitate the voxel-based formation of sharp and accurate surfaces. Extensive experiments demonstrate our superior performance compared to existing methods across diverse challenging scenarios, excelling in geometric accuracy, detail preservation, and reconstruction completeness while maintaining high efficiency. Code is available at https://github.com/Fictionarry/GeoSVR.

[162] Self-supervised Deep Unrolled Model with Implicit Neural Representation Regularization for Accelerating MRI Reconstruction

Jingran Xu, Yuanyuan Liu, Yuanbiao Yang, Zhuo-Xu Cui, Jing Cheng, Qingyong Zhu, Nannan Zhang, Yihang Zhou, Dong Liang, Yanjie Zhu

Main category: cs.CV

TL;DR: UnrollINR is a zero-shot self-supervised MRI reconstruction method that combines physics-guided unrolled architecture with implicit neural representation as regularization, achieving superior performance at high acceleration rates without requiring external training data.

Details

Motivation: MRI scan times are prolonged, limiting clinical application. While deep learning methods show promise, most require large fully-sampled training datasets that are difficult to obtain. There's a need for scan-specific reconstruction without external training data.

Method: Proposes UnrollINR which uses physics-guided unrolled reconstruction architecture and introduces implicit neural representation (INR) as a regularization prior to constrain solution space. This overcomes CNN limitations and INR instability in ill-posed scenarios.

Result: UnrollINR significantly improves MRI reconstruction performance under high acceleration rates. At 10x acceleration, it achieves superior reconstruction compared to supervised and self-supervised learning methods.

Conclusion: UnrollINR is an effective zero-shot self-supervised method that enables high-quality MRI reconstruction without external training data, demonstrating superiority over existing approaches at high acceleration rates.

Abstract: Magnetic resonance imaging (MRI) is a vital clinical diagnostic tool, yet its application is limited by prolonged scan times. Accelerating MRI reconstruction addresses this issue by reconstructing high-fidelity MR images from undersampled k-space measurements. In recent years, deep learning-based methods have demonstrated remarkable progress. However, most methods rely on supervised learning, which requires large amounts of fully-sampled training data that are difficult to obtain. This paper proposes a novel zero-shot self-supervised reconstruction method named UnrollINR, which enables scan-specific MRI reconstruction without external training data. UnrollINR adopts a physics-guided unrolled reconstruction architecture and introduces implicit neural representation (INR) as a regularization prior to effectively constrain the solution space. This method overcomes the local bias limitation of CNNs in traditional deep unrolled methods and avoids the instability associated with relying solely on INR’s implicit regularization in highly ill-posed scenarios. Consequently, UnrollINR significantly improves MRI reconstruction performance under high acceleration rates. Experimental results show that even at a high acceleration rate of 10, UnrollINR achieves superior reconstruction performance compared to supervised and self-supervised learning methods, validating its effectiveness and superiority.

[163] EditInfinity: Image Editing with Binary-Quantized Generative Models

Jiahuan Wang, Yuxin Chen, Jun Yu, Guangming Lu, Wenjie Pei

Main category: cs.CV

TL;DR: EditInfinity adapts binary-quantized generative models for precise text-driven image editing by leveraging exact intermediate representations to overcome inversion errors in diffusion models.

Details

Motivation: Existing diffusion-based image editing methods suffer from approximation errors during image inversion due to lack of exact supervision in intermediate generative steps, limiting editing performance.

Method: Proposes EditInfinity using binary-quantized generative models (Infinity) with efficient image inversion mechanism that includes text prompting rectification and style preservation, plus holistic smoothing strategy for high-fidelity editing.

Result: Extensive experiments on PIE-Bench benchmark across add, change, and delete operations show superior performance compared to state-of-the-art diffusion-based baselines.

Conclusion: Binary-quantized generative models enable more precise image inversion and editing by providing exact intermediate representations, overcoming limitations of diffusion models in text-driven image editing.

Abstract: Adapting pretrained diffusion-based generative models for text-driven image editing with negligible tuning overhead has demonstrated remarkable potential. A classical adaptation paradigm, as followed by these methods, first infers the generative trajectory inversely for a given source image by image inversion, then performs image editing along the inferred trajectory guided by the target text prompts. However, the performance of image editing is heavily limited by the approximation errors introduced during image inversion by diffusion models, which arise from the absence of exact supervision in the intermediate generative steps. To circumvent this issue, we investigate the parameter-efficient adaptation of binary-quantized generative models for image editing, and leverage their inherent characteristic that the exact intermediate quantized representations of a source image are attainable, enabling more effective supervision for precise image inversion. Specifically, we propose EditInfinity, which adapts \emph{Infinity}, a binary-quantized generative model, for image editing. We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation, enabling precise image inversion. Furthermore, we devise a holistic smoothing strategy which allows our EditInfinity to perform image editing with high fidelity to source images and precise semantic alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark across add', change’, and `delete’ editing operations, demonstrate the superior performance of our model compared to state-of-the-art diffusion-based baselines. Code available at: https://github.com/yx-chen-ust/EditInfinity.

[164] LoRA-Edge: Tensor-Train-Assisted LoRA for Practical CNN Fine-Tuning on Edge Devices

Hyunseok Kwak, Kyeongwon Lee, Jae-Jin Lee, Woojoo Lee

Main category: cs.CV

TL;DR: LoRA-Edge enables parameter-efficient on-device CNN fine-tuning using tensor-train assisted LoRA, achieving near-full fine-tuning accuracy with 1-2 orders of magnitude fewer parameters.

Details

Motivation: On-device fine-tuning is needed for edge applications like HAR to handle domain shift, but full fine-tuning is infeasible due to strict memory, compute, and energy constraints on edge devices.

Method: Applies TT-SVD to pre-trained convolutional layers, selectively updates only the output-side core with zero-initialization, and fuses updates back into dense kernels while preserving convolutional structure.

Result: Achieves accuracy within 4.7% of full fine-tuning while updating at most 1.49% of parameters, with 1.4-3.8x faster convergence on Jetson Orin Nano, outperforming prior parameter-efficient methods.

Conclusion: LoRA-Edge makes structure-aligned, parameter-efficient on-device CNN adaptation practical for edge platforms by significantly reducing trainable parameters while maintaining performance.

Abstract: On-device fine-tuning of CNNs is essential to withstand domain shift in edge applications such as Human Activity Recognition (HAR), yet full fine-tuning is infeasible under strict memory, compute, and energy budgets. We present LoRA-Edge, a parameter-efficient fine-tuning (PEFT) method that builds on Low-Rank Adaptation (LoRA) with tensor-train assistance. LoRA-Edge (i) applies Tensor-Train Singular Value Decomposition (TT-SVD) to pre-trained convolutional layers, (ii) selectively updates only the output-side core with zero-initialization to keep the auxiliary path inactive at the start, and (iii) fuses the update back into dense kernels, leaving inference cost unchanged. This design preserves convolutional structure and reduces the number of trainable parameters by up to two orders of magnitude compared to full fine-tuning. Across diverse HAR datasets and CNN backbones, LoRA-Edge achieves accuracy within 4.7% of full fine-tuning while updating at most 1.49% of parameters, consistently outperforming prior parameter-efficient baselines under similar budgets. On a Jetson Orin Nano, TT-SVD initialization and selective-core training yield 1.4-3.8x faster convergence to target F1. LoRA-Edge thus makes structure-aligned, parameter-efficient on-device CNN adaptation practical for edge platforms.

[165] Improving Diagnostic Performance on Small and Imbalanced Datasets Using Class-Based Input Image Composition

Hlali Azzeddine, Majid Ben Yakhlef, Soulaiman El Hazzat

Main category: cs.CV

TL;DR: Class-Based Image Composition creates composite images from multiple same-class images to address small, imbalanced datasets and poor image quality, achieving near-perfect accuracy (99.6%) on OCT medical imaging data.

Details

Motivation: Small, imbalanced datasets and poor input image quality lead to high false prediction rates in deep learning models, especially for medical imaging tasks like retinal disease diagnosis.

Method: Proposes Class-Based Image Composition that fuses multiple images of the same class into Composite Input Images (CoImg) using 3x1 layouts, creating a balanced dataset (Co-OCTDL) from the original imbalanced OCT dataset.

Result: Achieved near-perfect performance: 99.6% accuracy, 0.995 F1-score, 0.9996 AUC, with significantly reduced false prediction rates compared to baseline models trained on raw data.

Conclusion: The method effectively enhances intra-class variance and information density, enabling high-quality predictions even with weak datasets affected by class imbalance or small sample sizes.

Abstract: Small, imbalanced datasets and poor input image quality can lead to high false predictions rates with deep learning models. This paper introduces Class-Based Image Composition, an approach that allows us to reformulate training inputs through a fusion of multiple images of the same class into combined visual composites, named Composite Input Images (CoImg). That enhances the intra-class variance and improves the valuable information density per training sample and increases the ability of the model to distinguish between subtle disease patterns. Our method was evaluated on the Optical Coherence Tomography Dataset for Image-Based Deep Learning Methods (OCTDL) (Kulyabin et al., 2024), which contains 2,064 high-resolution optical coherence tomography (OCT) scans of the human retina, representing seven distinct diseases with a significant class imbalance. We constructed a perfectly class-balanced version of this dataset, named Co-OCTDL, where each scan is resented as a 3x1 layout composite image. To assess the effectiveness of this new representation, we conducted a comparative analysis between the original dataset and its variant using a VGG16 model. A fair comparison was ensured by utilizing the identical model architecture and hyperparameters for all experiments. The proposed approach markedly improved diagnostic results.The enhanced Dataset achieved near-perfect accuracy (99.6%) with F1-score (0.995) and AUC (0.9996), compared to a baseline model trained on raw dataset. The false prediction rate was also significantly lower, this demonstrates that the method can producehigh-quality predictions even for weak datasets affected by class imbalance or small sample size.

[166] Faithful Contouring: Near-Lossless 3D Voxel Representation Free from Iso-surface

Yihao Luo, Xianglong He, Chuanyu Pan, Yiwen Chen, Jiaqi Wu, Yangguang Li, Wanli Ouyang, Yuanming Hu, Guang Yang, ChoonHwai Yap

Main category: cs.CV

TL;DR: Faithful Contouring is a sparse voxelized representation for 3D meshes that achieves near-lossless fidelity at 2048+ resolutions without requiring field functions or isosurface extraction, outperforming existing methods in accuracy and efficiency.

Details

Motivation: Existing voxelized representations based on iso-surface rely on water-tightening or rendering optimization, which compromise geometric fidelity. There's a need for a representation that preserves sharpness and internal structures without these limitations.

Method: Proposes Faithful Contouring - a sparse voxelized representation that doesn’t require converting meshes to field functions or extracting isosurface during remeshing. Also designs a dual-mode autoencoder for scalable and detail-preserving shape reconstruction.

Result: Achieves distance errors at 10^-5 level for direct representation. For mesh reconstruction: 93% reduction in Chamfer Distance and 35% improvement in F-score over strong baselines. Preserves sharpness and internal structures even for complex geometry and topology.

Conclusion: Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction, confirming superior fidelity as a representation for 3D learning tasks with flexibility for texturing, manipulation, and editing.

Abstract: Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the $10^{-5}$ level; for mesh reconstruction, it yields a 93% reduction in Chamfer Distance and a 35% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks.

[167] THEval. Evaluation Framework for Talking Head Video Generation

Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva

Main category: cs.CV

TL;DR: Proposes a new evaluation framework with 8 metrics across quality, naturalness, and synchronization dimensions to address the gap in assessing talking head video generation, where generation advances have outpaced evaluation metrics.

Details

Motivation: The rapid progress in video generation has outpaced the development of adequate evaluation metrics, with current assessment relying on limited metrics and user studies, creating a need for more comprehensive evaluation.

Method: Developed an evaluation framework with 8 metrics focusing on three dimensions: quality, naturalness, and synchronization, emphasizing efficiency and human preference alignment. Analyzed fine-grained dynamics of head, mouth, eyebrows, and face quality.

Result: Extensive experiments on 85,000 videos from 17 state-of-the-art models revealed that while many algorithms excel in lip synchronization, they struggle with generating expressiveness and artifact-free details.

Conclusion: The proposed benchmark framework aims to evaluate improvements in generative methods, with plans to publicly release code, dataset, and leaderboards that will be regularly updated to reflect field progress.

Abstract: Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.

cs.AI

[168] A hybrid solution approach for the Integrated Healthcare Timetabling Competition 2024

Daniela Guericke, Rolf van der Hulst, Asal Karimpour, Ieke Schrader, Matthias Walter

Main category: cs.AI

TL;DR: Team Twente’s third-place solution for Integrated Healthcare Timetabling Competition 2024 uses a 3-phase approach combining mixed-integer programming, constraint programming, and simulated annealing with problem decomposition.

Details

Motivation: To develop an effective solution for healthcare timetabling competition that can handle complex scheduling constraints and optimize resource allocation in healthcare settings.

Method: 3-phase solution approach using decomposition into subproblems, combining mixed-integer programming, constraint programming, and simulated annealing techniques.

Result: Achieved third place in the competition and provided first lower bounds on optimal solution values for benchmark instances.

Conclusion: The hybrid approach was effective but there are open problems that could further improve the solution methodology.

Abstract: We report about the algorithm, implementation and results submitted to the Integrated Healthcare Timetabling Competition 2024 by Team Twente, which scored third in the competition. Our approach combines mixed-integer programming, constraint programming and simulated annealing in a 3-phase solution approach based on decomposition into subproblems. Next to describing our approach and describing our design decisions, we share our insights and, for the first time, lower bounds on the optimal solution values for the benchmark instances. We finally highlight open problems for which we think that addressing them could improve our approach even further.

[169] Epistemic Reject Option Prediction

Vojtech Franc, Jakub Paplham

Main category: cs.AI

TL;DR: This paper introduces an epistemic reject-option predictor that abstains from predictions in regions of high epistemic uncertainty caused by insufficient training data, addressing limitations of traditional approaches that only consider aleatoric uncertainty.

Details

Motivation: Traditional reject-option prediction focuses only on aleatoric uncertainty, assuming large training data makes epistemic uncertainty negligible. However, in practical scenarios with limited data, this assumption is unrealistic, creating a need for methods that can identify when training data is insufficient for reliable predictions.

Method: The approach builds on Bayesian learning and redefines the optimal predictor as one that minimizes expected regret - the performance gap between the learned model and the Bayes-optimal predictor with full knowledge of the data distribution. The model abstains when the regret for a given input exceeds a specified rejection cost.

Result: The paper presents the first principled framework that enables learning predictors capable of identifying inputs for which the training data is insufficient to make reliable decisions, addressing epistemic uncertainty in reject-option prediction.

Conclusion: The epistemic reject-option predictor provides a theoretically grounded approach for abstaining from predictions in high-uncertainty regions caused by insufficient data, offering improved reliability in high-stakes applications where both accuracy and uncertainty quantification are critical.

Abstract: In high-stakes applications, predictive models must not only produce accurate predictions but also quantify and communicate their uncertainty. Reject-option prediction addresses this by allowing the model to abstain when prediction uncertainty is high. Traditional reject-option approaches focus solely on aleatoric uncertainty, an assumption valid only when large training data makes the epistemic uncertainty negligible. However, in many practical scenarios, limited data makes this assumption unrealistic. This paper introduces the epistemic reject-option predictor, which abstains in regions of high epistemic uncertainty caused by insufficient data. Building on Bayesian learning, we redefine the optimal predictor as the one that minimizes expected regret – the performance gap between the learned model and the Bayes-optimal predictor with full knowledge of the data distribution. The model abstains when the regret for a given input exceeds a specified rejection cost. To our knowledge, this is the first principled framework that enables learning predictors capable of identifying inputs for which the training data is insufficient to make reliable decisions.

[170] DMA: Online RAG Alignment with Human Feedback

Yu Bai, Yukai Miao, Dawei Wang, Li Chen, Fei Long, Rundi Zhai, Dan Li, Yanyu Ren, Tianfeng Liu, Hongtao Xie, Ce Yang, Xuhui Cai

Main category: cs.AI

TL;DR: Dynamic Memory Alignment (DMA) is an online learning framework that uses multi-granularity human feedback to adapt retrieval-augmented generation systems in real-time, improving performance while maintaining baseline capabilities.

Details

Motivation: Traditional RAG systems use static retrieval that cannot adapt to evolving user intent and content drift, limiting their effectiveness in interactive settings.

Method: DMA organizes document-, list-, and response-level feedback into a learning pipeline with supervised training for rankers, policy optimization using response preferences, and knowledge distillation into a lightweight scorer for low-latency serving.

Result: Online deployment showed substantial improvements in human engagement, while offline tests demonstrated competitive foundational retrieval with notable gains on conversational QA benchmarks like TriviaQA and HotpotQA.

Conclusion: DMA provides a principled approach for feedback-driven, real-time adaptation in RAG systems without sacrificing baseline retrieval capabilities.

Abstract: Retrieval-augmented generation (RAG) systems often rely on static retrieval, limiting adaptation to evolving intent and content drift. We introduce Dynamic Memory Alignment (DMA), an online learning framework that systematically incorporates multi-granularity human feedback to align ranking in interactive settings. DMA organizes document-, list-, and response-level signals into a coherent learning pipeline: supervised training for pointwise and listwise rankers, policy optimization driven by response-level preferences, and knowledge distillation into a lightweight scorer for low-latency serving. Throughout this paper, memory refers to the model’s working memory, which is the entire context visible to the LLM for In-Context Learning. We adopt a dual-track evaluation protocol mirroring deployment: (i) large-scale online A/B ablations to isolate the utility of each feedback source, and (ii) few-shot offline tests on knowledge-intensive benchmarks. Online, a multi-month industrial deployment further shows substantial improvements in human engagement. Offline, DMA preserves competitive foundational retrieval while yielding notable gains on conversational QA (TriviaQA, HotpotQA). Taken together, these results position DMA as a principled approach to feedback-driven, real-time adaptation in RAG without sacrificing baseline capability.

[171] Real-Time Reasoning Agents in Evolving Environments

Yule Wen, Yixin Ye, Yanzhe Zhang, Diyi Yang, Hao Zhu

Main category: cs.AI

TL;DR: Introduces real-time reasoning for agents in dynamic environments, proposes AgileThinker that combines reactive and planning paradigms, and shows it outperforms single-paradigm approaches under time pressure.

Details

Motivation: Real-world agents need to make timely judgments in dynamic environments where hazards emerge and opportunities arise while reasoning is still unfolding, but existing language model approaches fail to account for this dynamic nature.

Method: Built Real-Time Reasoning Gym to study two paradigms: reactive agents (bounded reasoning for rapid responses) and planning agents (extended reasoning for complex problems). Proposed AgileThinker that simultaneously engages both reasoning paradigms.

Result: State-of-the-art models struggle with making logical and timely judgments in either paradigm. AgileThinker consistently outperforms single-paradigm agents as task difficulty and time pressure rise, effectively balancing reasoning depth and response latency.

Conclusion: Establishes real-time reasoning as a critical testbed for practical agents and provides foundation for temporally constrained AI systems, highlighting a path toward real-time capable agents.

Abstract: Agents in the real world must make not only logical but also timely judgments. This requires continuous awareness of the dynamic environment: hazards emerge, opportunities arise, and other agents act, while the agent’s reasoning is still unfolding. Despite advances in language model reasoning, existing approaches fail to account for this dynamic nature. We introduce real-time reasoning as a new problem formulation for agents in evolving environments and build Real-Time Reasoning Gym to demonstrate it. We study two paradigms for deploying language models in agents: (1) reactive agents, which employ language models with bounded reasoning computation for rapid responses, and (2) planning agents, which allow extended reasoning computation for complex problems. Our experiments show that even state-of-the-art models struggle with making logical and timely judgments in either paradigm. To address this limitation, we propose AgileThinker, which simultaneously engages both reasoning paradigms. AgileThinker consistently outperforms agents engaging only one reasoning paradigm as the task difficulty and time pressure rise, effectively balancing reasoning depth and response latency. Our work establishes real-time reasoning as a critical testbed for developing practical agents and provides a foundation for research in temporally constrained AI systems, highlighting a path toward real-time capable agents.

[172] ORCHID: Orchestrated Retrieval-Augmented Classification with Human-in-the-Loop Intelligent Decision-Making for High-Risk Property

Maria Mahbub, Vanessa Lama, Sanjay Das, Brian Starks, Christopher Polchek, Saffell Silvers, Lauren Deck, Prasanna Balaprakash, Tirthankar Ghosal

Main category: cs.AI

TL;DR: ORCHID is a modular agentic system for High-Risk Property classification that combines retrieval-augmented generation with human oversight to create auditable, policy-based decisions for DOE compliance workflows.

Details

Motivation: Traditional expert-only workflows for HRP classification are time-consuming, backlog-prone, and struggle to keep pace with evolving export control policies at DOE sites.

Method: Uses small cooperating agents (retrieval, description refiner, classifier, validator, feedback logger) coordinated via agent-to-agent messaging and Model Context Protocol (MCP) for model-agnostic on-premise operation. Follows Item to Evidence to Decision loop with step-by-step reasoning and on-policy citations.

Result: In preliminary tests on real HRP cases, ORCHID improves accuracy and traceability over non-agentic baseline while deferring uncertain items to Subject Matter Experts.

Conclusion: ORCHID demonstrates a practical path to trustworthy LLM assistance in sensitive DOE compliance workflows through grounded citations, SME feedback capture, and exportable audit artifacts.

Abstract: High-Risk Property (HRP) classification is critical at U.S. Department of Energy (DOE) sites, where inventories include sensitive and often dual-use equipment. Compliance must track evolving rules designated by various export control policies to make transparent and auditable decisions. Traditional expert-only workflows are time-consuming, backlog-prone, and struggle to keep pace with shifting regulatory boundaries. We demo ORCHID, a modular agentic system for HRP classification that pairs retrieval-augmented generation (RAG) with human oversight to produce policy-based outputs that can be audited. Small cooperating agents, retrieval, description refiner, classifier, validator, and feedback logger, coordinate via agent-to-agent messaging and invoke tools through the Model Context Protocol (MCP) for model-agnostic on-premise operation. The interface follows an Item to Evidence to Decision loop with step-by-step reasoning, on-policy citations, and append-only audit bundles (run-cards, prompts, evidence). In preliminary tests on real HRP cases, ORCHID improves accuracy and traceability over a non-agentic baseline while deferring uncertain items to Subject Matter Experts (SMEs). The demonstration shows single item submission, grounded citations, SME feedback capture, and exportable audit artifacts, illustrating a practical path to trustworthy LLM assistance in sensitive DOE compliance workflows.

[173] Autonomous generation of different courses of action in mechanized combat operations

Johan Schubert, Patrik Hansen, Pontus Hörling, Ronnie Johansson

Main category: cs.AI

TL;DR: A methodology for generating and evaluating military action recommendations for mechanized battalions during combat operations.

Details

Motivation: To support decision-making in military ground combat operations by systematically producing and assessing alternative courses of action.

Method: Generates thousands of individual action alternatives, evaluates them based on opponent status/actions, unit composition, force ratios, and uses field manuals to assess battle outcomes and advancement rates. Works concurrently with generation and evaluation processes.

Result: Produces alternative courses of action with superior outcomes that can be managed and revised as combat conditions evolve.

Conclusion: The approach facilitates real-time decision support for military commanders by providing continuously updated action recommendations within a sequential decision-making framework.

Abstract: In this paper, we propose a methodology designed to support decision-making during the execution phase of military ground combat operations, with a focus on one’s actions. This methodology generates and evaluates recommendations for various courses of action for a mechanized battalion, commencing with an initial set assessed by their anticipated outcomes. It systematically produces thousands of individual action alternatives, followed by evaluations aimed at identifying alternative courses of action with superior outcomes. These alternatives are appraised in light of the opponent’s status and actions, considering unit composition, force ratios, types of offense and defense, and anticipated advance rates. Field manuals evaluate battle outcomes and advancement rates. The processes of generation and evaluation work concurrently, yielding a variety of alternative courses of action. This approach facilitates the management of new course generation based on previously evaluated actions. As the combat unfolds and conditions evolve, revised courses of action are formulated for the decision-maker within a sequential decision-making framework.

[174] Cleaning Maintenance Logs with LLM Agents for Improved Predictive Maintenance

Valeriu Dimidov, Faisal Hawlader, Sasan Jafarnejad, Raphaël Frank

Main category: cs.AI

TL;DR: LLM-based agents show promise for cleaning automotive maintenance logs, effectively handling generic errors but struggling with domain-specific noise, offering potential for industrial PdM applications.

Details

Motivation: Economic constraints, limited datasets, and expertise shortages hinder predictive maintenance adoption in automotive sector; LLMs present opportunity to overcome these barriers and accelerate PdM transition from research to practice.

Method: Evaluate LLM agents on cleaning maintenance logs affected by six types of noise (typos, missing fields, near-duplicates, incorrect dates, etc.) to support PdM data cleaning pipelines.

Result: LLMs are effective at handling generic cleaning tasks and offer promising foundation for industrial applications, though domain-specific errors remain challenging.

Conclusion: LLM-based agents show potential for automotive PdM data cleaning, with future improvements possible through specialized training and enhanced agentic capabilities.

Abstract: Economic constraints, limited availability of datasets for reproducibility and shortages of specialized expertise have long been recognized as key challenges to the adoption and advancement of predictive maintenance (PdM) in the automotive sector. Recent progress in large language models (LLMs) presents an opportunity to overcome these barriers and speed up the transition of PdM from research to industrial practice. Under these conditions, we explore the potential of LLM-based agents to support PdM cleaning pipelines. Specifically, we focus on maintenance logs, a critical data source for training well-performing machine learning (ML) models, but one often affected by errors such as typos, missing fields, near-duplicate entries, and incorrect dates. We evaluate LLM agents on cleaning tasks involving six distinct types of noise. Our findings show that LLMs are effective at handling generic cleaning tasks and offer a promising foundation for future industrial applications. While domain-specific errors remain challenging, these results highlight the potential for further improvements through specialized training and enhanced agentic capabilities.

[175] Reasoning Is All You Need for Urban Planning AI

Sijie Yang, Jiatong Li, Filip Biljecki

Main category: cs.AI

TL;DR: This position paper introduces an Agentic Urban Planning AI Framework that integrates reasoning capabilities with multi-agent collaboration to assist human planners in decision-making, verification, and trade-off analysis.

Details

Motivation: Current AI excels at pattern recognition but lacks reasoning capabilities needed for urban planning decisions that require constraint satisfaction, value-based principles, and transparent justifications.

Method: Proposes a framework with three cognitive layers (Perception, Foundation, Reasoning) and six logic components (Analysis, Generation, Verification, Evaluation, Collaboration, Decision) through multi-agent collaboration.

Result: The framework enables AI agents to systematically explore solution spaces, verify regulatory compliance, and deliberate over trade-offs transparently while augmenting human judgment.

Conclusion: AI agents can amplify human planners’ capabilities through computational reasoning rather than replacing human judgment, addressing limitations of statistical learning alone in planning decisions.

Abstract: AI has proven highly successful at urban planning analysis – learning patterns from data to predict future conditions. The next frontier is AI-assisted decision-making: agents that recommend sites, allocate resources, and evaluate trade-offs while reasoning transparently about constraints and stakeholder values. Recent breakthroughs in reasoning AI – CoT prompting, ReAct, and multi-agent collaboration frameworks – now make this vision achievable. This position paper presents the Agentic Urban Planning AI Framework for reasoning-capable planning agents that integrates three cognitive layers (Perception, Foundation, Reasoning) with six logic components (Analysis, Generation, Verification, Evaluation, Collaboration, Decision) through a multi-agents collaboration framework. We demonstrate why planning decisions require explicit reasoning capabilities that are value-based (applying normative principles), rule-grounded (guaranteeing constraint satisfaction), and explainable (generating transparent justifications) – requirements that statistical learning alone cannot fulfill. We compare reasoning agents with statistical learning, present a comprehensive architecture with benchmark evaluation metrics, and outline critical research challenges. This framework shows how AI agents can augment human planners by systematically exploring solution spaces, verifying regulatory compliance, and deliberating over trade-offs transparently – not replacing human judgment but amplifying it with computational reasoning capabilities.

[176] Outbidding and Outbluffing Elite Humans: Mastering Liar’s Poker via Self-Play and Reinforcement Learning

Richard Dewey, Janos Botyanszki, Ciamac C. Moallemi, Andrew T. Zheng

Main category: cs.AI

TL;DR: Solly is the first AI agent to achieve elite human performance in multi-player Liar’s Poker, outperforming both human experts and large language models through deep reinforcement learning.

Details

Motivation: Previous AI breakthroughs in poker-like games focused on two-player scenarios with subdued multi-player dynamics, while real-world games often involve extensive multi-player engagement that requires more complex reasoning.

Method: Used self-play with a model-free, actor-critic, deep reinforcement learning algorithm to train the Solly agent.

Result: Solly achieved elite human level performance, winning over 50% of hands and showing positive equity in both heads-up and multi-player Liar’s Poker. It outperformed LLMs and developed novel strategies that were not easily exploitable by world-class human players.

Conclusion: The research demonstrates that deep reinforcement learning can successfully tackle complex multi-player games with extensive engagement, advancing AI capabilities beyond two-player scenarios to more realistic multi-agent environments.

Abstract: AI researchers have long focused on poker-like games as a testbed for environments characterized by multi-player dynamics, imperfect information, and reasoning under uncertainty. While recent breakthroughs have matched elite human play at no-limit Texas hold’em, the multi-player dynamics are subdued: most hands converge quickly with only two players engaged through multiple rounds of bidding. In this paper, we present Solly, the first AI agent to achieve elite human play in reduced-format Liar’s Poker, a game characterized by extensive multi-player engagement. We trained Solly using self-play with a model-free, actor-critic, deep reinforcement learning algorithm. Solly played at an elite human level as measured by win rate (won over 50% of hands) and equity (money won) in heads-up and multi-player Liar’s Poker. Solly also outperformed large language models (LLMs), including those with reasoning abilities, on the same metrics. Solly developed novel bidding strategies, randomized play effectively, and was not easily exploitable by world-class human players.

Yunhao Yang, Neel P. Bhatt, William Ward, Zichao Hu, Joydeep Biswas, Ufuk Topcu

Main category: cs.AI

TL;DR: A method for verifying LLM-generated robot programs against safety specifications using automaton-based representation, with a theorem ensuring composition safety and automated fine-tuning that improves training efficiency.

Details

Motivation: LLM-generated programs for robotic tasks often contain errors violating specifications, making reliable deployment infeasible without effective verification methods.

Method: Convert generated robot programs into automaton-based representation for verification against safety specifications, with automated fine-tuning using verification outcomes.

Result: 30% increase in probability of generating specification-compliant programs with training time reduced by half compared to fine-tuning on full programs.

Conclusion: The method enables reliable deployment of LLMs in real-world robotic systems by ensuring program safety and improving training efficiency through compositional verification.

Abstract: Large language models possess impressive capabilities in generating programs (e.g., Python) from natural language descriptions to execute robotic tasks. However, these generated programs often contain errors that violate externally given task specifications. Without an effective method to verify their correctness, the reliable deployment of language models in real-world systems is practically infeasible. We develop a method that converts generated robot programs into an automaton-based representation and verifies them against task-relevant safety specifications. We establish a theorem that any arbitrary combination of the verified programs will also satisfy the safety specifications. Hence, the method eliminates the need to verify complex programs composed of multiple simpler ones, reducing computation complexity. We then introduce an automated fine-tuning procedure that leverages verification outcomes for supervision. By applying the theorem, this procedure only requires training the model to generate safe sub-components, thereby improving training efficiency. Empirical results on robot applications show a 30 percent increase in the probability of generating specification-compliant programs, with training time reduced by half compared to fine-tuning on generating full programs.

[178] Retrieval Augmented Diffusion Model for Structure-informed Antibody Design and Optimization

Zichen Wang, Yaokun Ji, Jianing Tian, Shuangjia Zheng

Main category: cs.AI

TL;DR: RADAb is a retrieval-augmented diffusion framework for antibody design that uses structural homologous motifs to guide generation, achieving state-of-the-art performance in antibody inverse folding and optimization.

Details

Motivation: Existing antibody design methods create antibodies from scratch without template constraints, leading to model optimization challenges and unnatural sequences.

Method: Proposes a retrieval-augmented diffusion framework with structure-informed retrieval mechanism and dual-branch denoising module that integrates structural and evolutionary information, plus a conditional diffusion model for iterative refinement.

Result: Empirical experiments demonstrate state-of-the-art performance in multiple antibody inverse folding and optimization tasks.

Conclusion: The method offers a new perspective on biomolecular generative models by effectively incorporating structural constraints and evolutionary information.

Abstract: Antibodies are essential proteins responsible for immune responses in organisms, capable of specifically recognizing antigen molecules of pathogens. Recent advances in generative models have significantly enhanced rational antibody design. However, existing methods mainly create antibodies from scratch without template constraints, leading to model optimization challenges and unnatural sequences. To address these issues, we propose a retrieval-augmented diffusion framework, termed RADAb, for efficient antibody design. Our method leverages a set of structural homologous motifs that align with query structural constraints to guide the generative model in inversely optimizing antibodies according to desired design criteria. Specifically, we introduce a structure-informed retrieval mechanism that integrates these exemplar motifs with the input backbone through a novel dual-branch denoising module, utilizing both structural and evolutionary information. Additionally, we develop a conditional diffusion model that iteratively refines the optimization process by incorporating both global context and local evolutionary conditions. Our approach is agnostic to the choice of generative models. Empirical experiments demonstrate that our method achieves state-of-the-art performance in multiple antibody inverse folding and optimization tasks, offering a new perspective on biomolecular generative models.

[179] Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search

Yuichi Inoue, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, Takuya Akiba

Main category: cs.AI

TL;DR: AB-MCTS is an inference-time framework that combines LLM response diversity with multi-turn refinement using external feedback, outperforming repeated sampling and standard MCTS on complex coding tasks.

Details

Motivation: While repeated sampling improves LLM reasoning, it doesn't leverage available external feedback signals for refinement in tasks like coding.

Method: Adaptive Branching Monte Carlo Tree Search that dynamically decides between expanding new candidate responses (going wider) or revisiting existing ones (going deeper) based on external feedback.

Result: Consistently outperforms both repeated sampling and standard MCTS on complex coding and engineering tasks using frontier models.

Conclusion: Combining LLM response diversity with multi-turn solution refinement enables effective inference-time scaling for complex reasoning tasks.

Abstract: Recent advances demonstrate that increasing inference-time computation can significantly boost the reasoning capabilities of large language models (LLMs). Although repeated sampling (i.e., generating multiple candidate outputs) is a highly effective strategy, it does not leverage external feedback signals for refinement, which are often available in tasks like coding. In this work, we propose Adaptive Branching Monte Carlo Tree Search (AB-MCTS), a novel inference-time framework that generalizes repeated sampling with principled multi-turn exploration and exploitation. At each node in the search tree, AB-MCTS dynamically decides whether to “go wider” by expanding new candidate responses or “go deeper” by revisiting existing ones based on external feedback signals. We evaluate our method on complex coding and engineering tasks using frontier models. Empirical results show that AB-MCTS consistently outperforms both repeated sampling and standard MCTS, underscoring the importance of combining the response diversity of LLMs with multi-turn solution refinement for effective inference-time scaling. Code is available at https://github.com/SakanaAI/treequest .

[180] AI Through the Human Lens: Investigating Cognitive Theories in Machine Psychology

Akash Kundu, Rishika Goswami

Main category: cs.AI

TL;DR: LLMs exhibit human-like cognitive patterns across four psychological frameworks: coherent narratives (TAT), framing bias susceptibility, Liberty/Oppression moral judgments (MFT), and rationalized self-contradictions (Cognitive Dissonance).

Details

Motivation: To investigate whether LLMs demonstrate human-like cognitive patterns using established psychological frameworks to understand AI behavior and its implications.

Method: Evaluated multiple proprietary and open-source LLMs using structured prompts and automated scoring across four psychological frameworks: TAT, Framing Bias, MFT, and Cognitive Dissonance.

Result: Models produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments focused on Liberty/Oppression, and demonstrate self-contradictions with extensive rationalization.

Conclusion: LLMs mirror human cognitive tendencies but these behaviors are shaped by training data and alignment methods, with implications for AI transparency, ethical deployment, and bridging cognitive psychology with AI safety.

Abstract: We investigate whether Large Language Models (LLMs) exhibit human-like cognitive patterns under four established frameworks from psychology: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance. We evaluated several proprietary and open-source models using structured prompts and automated scoring. Our findings reveal that these models often produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization. Such behaviors mirror human cognitive tendencies yet are shaped by their training data and alignment methods. We discuss the implications for AI transparency, ethical deployment, and future work that bridges cognitive psychology and AI safety

[181] Introducing LongCat-Flash-Thinking: A Technical Report

Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chao Zhang, Chengcheng Han, Chenhui Yang, Chi Zhang, Chong Peng, Chuyu Zhang, Cong Chen, Fengcun Li, Gang Xu, Guoyuan Lin, Hao Jiang, Hao Liang, Haomin Fu, Haoxiang Ma, Hong Liu, Hongyan Hao, Hongyin Tang, Hongyu Zang, Hongzhi Ni, Hui Su, Jiahao Liu, Jiahuan Li, Jialin Liu, Jianfei Zhang, Jianhao Xu, Jianing Wang, Jiaqi Sun, Jiaqi Zhang, Jiarong Shi, Jiawei Yang, Jingang Wang, Jinrui Ding, Jun Kuang, Jun Xu, Ke He, Kefeng Zhang, Keheng Wang, Keqing He, Li Wei, Liang Shi, Lin Qiu, Lingbin Kong, Lingchuan Liu, Linsen Guo, Longfei An, Mai Xia, Meng Zhou, Mengshen Zhu, Peng Pei, Pengcheng Jia, Qi Gu, Qi Guo, Qiong Huang, Quan Chen, Quanchi Weng, Rongxiang Weng, Ruichen Shao, Rumei Li, Shanglin Lei, Shuai Du, Shuaikang Liu, Shuang Zhou, Shuhao Hu, Siyu Xu, Songshan Gong, Tao Liang, Tianhao Hu, Wei He, Wei Shi, Wei Wang, Wei Wu, Wei Zhuo, Weifeng Tang, Wenjie Shi, Wenlong Zhu, Xi Su, Xiangcheng Liu, Xiangyu Xi, Xiangzhou Huang, Xiao Liu, Xiaochen Jiang, Xiaowei Shi, Xiaowen Shi, Xiaoyu Li, Xin Chen, Xinyue Zhao, Xuan Huang, Xuemiao Zhang, Xuezhi Cao, Xunliang Cai, Yajie Zhang, Yang Chen, Yang Liu, Yang Liu, Yang Zheng, Yaoming Wang, Yaqi Huo, Yerui Sun, Yifan Lu, Yiyang Li, Youshao Xiao, Yuanzhe Lei, Yuchen Xie, Yueqing Sun, Yufei Zhang, Yuhuai Wei, Yulei Qian, Yunke Zhao, Yuqing Ding, Yuwei Jiang, Zhaohua Yang, Zhengyu Chen, Zhijian Liu, Zhikang Xia, Zhongda Su, Ziran Li, Ziwen Wang, Ziyuan Zhuang, Zongyu Wang, Zunyuan Yang

Main category: cs.AI

TL;DR: LongCat-Flash-Thinking is a 560B parameter open-source MoE reasoning model trained with CoT data cold-start and large-scale RL, achieving state-of-the-art performance with 64.5% token reduction in agentic reasoning.

Details

Motivation: To develop an efficient large-scale reasoning model that combines specialized domain expertise with high computational efficiency, addressing the need for advanced reasoning capabilities in AI systems.

Method: Uses long Chain-of-Thought data cold-start training followed by domain-parallel training across STEM, Code, and Agentic domains, fused into a single model using the DORA RL framework for 3x training speedup.

Result: Achieves SOTA performance among open-source models on complex reasoning tasks, with 64.5% token reduction (from 19,653 to 6,965 tokens) on AIME-25 while maintaining accuracy.

Conclusion: LongCat-Flash-Thinking demonstrates efficient large-scale reasoning capabilities and is released to advance reasoning systems and agentic AI research.

Abstract: We present LongCat-Flash-Thinking, an efficient 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model. Its advanced capabilities are cultivated through a meticulously crafted training process, beginning with long Chain-of-Thought (CoT) data cold-start and culminating in large-scale Reinforcement Learning (RL). We first employ a well-designed cold-start training strategy, which significantly enhances the reasoning potential and equips the model with specialized skills in both formal and agentic reasoning. Then, a core innovation is our domain-parallel training scheme, which decouples optimization across distinct domains (e.g., STEM, Code, Agentic) and subsequently fuses the resulting expert models into a single, nearly Pareto-optimal model. This entire process is powered by our Dynamic ORchestration for Asynchronous rollout (DORA) system, a large-scale RL framework that delivers a greater than threefold training speedup over synchronous methods on tens of thousands of accelerators. As a result, LongCat-Flash-Thinking achieves state-of-the-art performance among open-source models on a suite of complex reasoning tasks. The model exhibits exceptional efficiency in agentic reasoning, reducing average token consumption by 64.5% (from 19, 653 to 6, 965) on AIME-25, without degrading task accuracy. We release LongCat-Flash-Thinking to promote further advances in reasoning systems and agentic AI research.

[182] Bilinear relational structure fixes reversal curse and enables consistent model editing

Dong-Kyum Kim, Minsung Kim, Jea Kwon, Nakyeong Yang, Meeyoung Cha

Main category: cs.AI

TL;DR: The reversal curse in language models is not inherent but stems from knowledge encoding. Training on relational knowledge graphs induces bilinear representations that alleviate the curse and enable consistent model editing.

Details

Motivation: To challenge the view that the reversal curse is a fundamental limitation of language models and investigate whether proper knowledge representation can overcome it.

Method: Train language models from scratch on synthetic relational knowledge graphs and analyze the emergence of bilinear relational structure in hidden representations.

Result: Models with bilinear structure can infer unseen reverse facts and propagate edits consistently to logically dependent facts, while models lacking this structure fail to generalize edits and introduce inconsistencies.

Conclusion: The success of model editing depends on the underlying representational geometry, and training on relational knowledge induces bilinear representations that enable logically consistent behavior.

Abstract: The reversal curse – a language model’s (LM) inability to infer an unseen fact B is A'' from a learned fact A is B’’ – is widely considered a fundamental limitation. We show that this is not an inherent failure but an artifact of how models encode knowledge. By training LMs from scratch on a synthetic dataset of relational knowledge graphs, we demonstrate that bilinear relational structure emerges in their hidden representations. This structure substantially alleviates the reversal curse, enabling LMs to infer unseen reverse facts. Crucially, we also find that this bilinear structure plays a key role in consistent model editing. When a fact is updated in a LM with this structure, the edit correctly propagates to its reverse and other logically dependent facts. In contrast, models lacking this representation not only suffer from the reversal curse but also fail to generalize edits, further introducing logical inconsistencies. Our results establish that training on a relational knowledge dataset induces the emergence of bilinear internal representations, which in turn enable LMs to behave in a logically consistent manner after editing. This implies that the success of model editing depends critically not just on editing algorithms but on the underlying representational geometry of the knowledge being modified.

[183] Open Agent Specification (Agent Spec): A Unified Representation for AI Agents

Soufiane Amini, Yassine Benajiba, Cesare Bernardis, Paul Cayet, Hassan Chafi, Abderrahim Fathan, Louis Faucon, Damien Hilloulin, Sungpack Hong, Ingo Kossyk, Tran Minh Son Le, Rhicheek Patra, Sujith Ravi, Jonas Schweizer, Jyotika Singh, Shailender Singh, Weiyi Sun, Kartik Talamadupula, Jerry Xu

Main category: cs.AI

TL;DR: Open Agent Specification (Agent Spec) is a declarative language that standardizes AI agent definitions and workflows across different frameworks, enabling cross-framework compatibility, reusability, and consistent evaluation.

Details

Motivation: The proliferation of diverse agent frameworks has created fragmentation in how agents are defined, executed, and evaluated, making it difficult to share or reproduce workflows across different systems.

Method: Agent Spec defines a common set of components, control and data flow semantics, and schemas that allow agents to be defined once and executed across different runtimes. It includes a standardized evaluation harness and provides tools like Python SDK, reference runtime, and framework adapters.

Result: The system was demonstrated using four distinct runtimes (LangGraph, CrewAI, AutoGen, WayFlow) evaluated over three benchmarks (SimpleQA Verified, τ²-Bench, BIRD-SQL), showing consistent performance comparison across frameworks.

Conclusion: Agent Spec bridges the gap between model-centric and agent-centric standardization and evaluation, laying the groundwork for reliable, reusable, and portable agentic systems.

Abstract: The proliferation of agent frameworks has led to fragmentation in how agents are defined, executed, and evaluated. Existing systems differ in their abstractions, data flow semantics, and tool integrations, making it difficult to share or reproduce workflows. We introduce Open Agent Specification (Agent Spec), a declarative language that defines AI agents and agentic workflows in a way that is compatible across frameworks, promoting reusability, portability and interoperability of AI agents. Agent Spec defines a common set of components, control and data flow semantics, and schemas that allow an agent to be defined once and executed across different runtimes. Agent Spec also introduces a standardized Evaluation harness to assess agent behavior and agentic workflows across runtimes - analogous to how HELM and related harnesses standardized LLM evaluation - so that performance, robustness, and efficiency can be compared consistently across frameworks. We demonstrate this using four distinct runtimes (LangGraph, CrewAI, AutoGen, and WayFlow) evaluated over three different benchmarks (SimpleQA Verified, $\tau^2$-Bench and BIRD-SQL). We provide accompanying toolsets: a Python SDK (PyAgentSpec), a reference runtime (WayFlow), and adapters for popular frameworks (e.g., LangGraph, AutoGen, CrewAI). Agent Spec bridges the gap between model-centric and agent-centric standardization & evaluation, laying the groundwork for reliable, reusable, and portable agentic systems.

[184] Internal World Models as Imagination Networks in Cognitive Agents

Saurabh Ranjan, Brian Odegaard

Main category: cs.AI

TL;DR: This paper investigates imagination’s computational purpose, proposing it accesses internal world models (IWMs). Using psychological network analysis, it compares IWMs in humans vs LLMs, finding significant differences in network structure and centrality correlations.

Details

Motivation: To understand the computational objective of imagination and challenge classical views that it's primarily for reward maximization. The study aims to compare internal world models between humans and AI systems.

Method: Used psychological network analysis with imagination vividness ratings from questionnaires. Constructed imagination networks from human reports and compared them with LLM-generated networks under various prompts and memory conditions.

Result: Human imagination networks showed strong correlations between centrality measures, while LLM networks lacked clustering and had lower centrality correlations. This indicates fundamental differences in internal world model structure between humans and LLMs.

Conclusion: The study demonstrates a novel method for comparing internal representations and reveals that current LLMs lack human-like imagination capabilities, providing insights for developing more human-like AI imagination.

Abstract: What is the computational objective of imagination? While classical interpretations suggest imagination is useful for maximizing rewards, recent findings challenge this view. In this study, we propose that imagination serves to access an internal world model (IWM) and use psychological network analysis to explore IWMs in humans and large language models (LLMs). Specifically, we assessed imagination vividness ratings using two questionnaires and constructed imagination networks from these reports. Imagination networks from human groups showed correlations between different centrality measures, including expected influence, strength, and closeness. However, imagination networks from LLMs showed a lack of clustering and lower correlations between centrality measures under different prompts and conversational memory conditions. Together, these results indicate a lack of similarity between IWMs in human and LLM agents. Overall, our study offers a novel method for comparing internally-generated representations in humans and AI, providing insights for developing human-like imagination in artificial intelligence.

[185] HugAgent: Benchmarking LLMs for Simulation of Individualized Human Reasoning

Chance Jiajie Li, Zhenze Mo, Yuhan Tang, Ao Qu, Jiayi Wu, Kaiya Ivy Zhao, Yulu Gan, Jie Fan, Jiangbo Yu, Hang Jiang, Paul Pu Liang, Jinhua Zhao, Luis Alberto Alonso Pastor, Kent Larson

Main category: cs.AI

TL;DR: HugAgent is a benchmark for evaluating machine reasoning alignment with individual human thought, moving beyond population-level consensus to capture personalized reasoning styles and belief trajectories.

Details

Motivation: Current large language models only approximate human responses at scale but erase individual reasoning styles and belief trajectories, failing to simulate truly human-like reasoning.

Method: Dual-track design: human track automates think-aloud method for ecologically valid data collection, and synthetic track for scalability and stress testing. Evaluates model’s ability to predict specific individuals’ behavioral responses and reasoning dynamics in out-of-distribution scenarios.

Result: Experiments with state-of-the-art language models reveal persistent adaptation gaps, showing current models struggle to align with individual human reasoning patterns.

Conclusion: HugAgent positions as the first extensible benchmark for aligning machine reasoning with the individuality of human thought, providing tools for low-cost expansion to new tasks and populations.

Abstract: Simulating human reasoning in open-ended tasks has long been a central aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), which rethinks human reasoning simulation along three dimensions: (i) from averaged to individualized reasoning, (ii) from behavioral mimicry to cognitive alignment, and (iii) from vignette-based to open-ended data. The benchmark evaluates whether a model can predict a specific person’s behavioral responses and the underlying reasoning dynamics in out-of-distribution scenarios, given partial evidence of their prior views. HugAgent adopts a dual-track design: a human track that automates and scales the think-aloud method to collect ecologically valid human reasoning data, and a synthetic track for further scalability and systematic stress testing. This architecture enables low-cost, extensible expansion to new tasks and populations. Experiments with state-of-the-art language models reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. The benchmark, along with its complete data collection pipeline and companion chatbot, is open-sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace-your-thinking).

[186] String Seed of Thought: Prompting LLMs for Distribution-Faithful and Diverse Generation

Kou Misaki, Takuya Akiba

Main category: cs.AI

TL;DR: SSoT is a novel prompting method that improves LLMs’ ability to follow probabilistic instructions by having them generate random strings first to create entropy, then extract randomness from these strings to produce diverse answers that match target probability distributions.

Details

Motivation: LLMs struggle with Probabilistic Instruction Following (PIF) - selecting answers from predefined options with specific probabilities. This causes biases in applications requiring non-deterministic behaviors like human-behavior simulation, content diversification, and multiplayer games, and reduces response diversity in test-time scaling.

Method: String Seed of Thought (SSoT) prompts LLMs to first output a random string to generate entropy, then manipulate this string to extract randomness and derive final answers, preserving diversity while adhering to probability constraints.

Result: SSoT significantly improves PIF performance, approaching ideal pseudo-random number generator performance. Experiments on NoveltyBench show SSoT also enhances response diversity in open-ended tasks beyond closed-set tasks.

Conclusion: SSoT effectively addresses LLMs’ limitations in probabilistic instruction following by leveraging string-based randomness generation, improving both closed-set PIF performance and open-ended task diversity.

Abstract: We introduce String Seed of Thought (SSoT), a novel prompting method for LLMs that improves Probabilistic Instruction Following (PIF). We define PIF as a task requiring an LLM to select its answer from a predefined set of options, each associated with a specific probability, such that the empirical distribution of the generated answers aligns with the target distribution when prompted multiple times. While LLMs excel at tasks with single, deterministic answers, they often fail at PIF, exhibiting biases problematic for applications requiring non-deterministic behaviors, such as human-behavior simulation, content diversification, and multiplayer games. It also harms the diversity of generated responses, a crucial factor in test-time scaling, by causing the outputs to collapse into a limited set of answers. To address this, we propose SSoT, a simple prompting method that instructs an LLM to first output a random string to generate sufficient entropy. SSoT also instructs the LLM to extract randomness by manipulating this string to derive a final answer, thereby preserving diversity while adhering to specific constraints. We demonstrate that SSoT significantly improves the PIF performance of LLMs, approaching the ideal performance of a pseudo-random number generator. Furthermore, our experiments on NoveltyBench show SSoT’s benefits extend beyond closed-set tasks to open-ended tasks by enhancing response diversity.

[187] How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations

Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, Diyi Yang

Main category: cs.AI

TL;DR: Direct comparison of human vs AI agents across multiple work skills reveals agents are faster and cheaper but produce inferior quality work using programmatic approaches, while humans use UI-centric methods.

Details

Motivation: To understand how AI agents perform human work compared to humans, revealing agent capabilities and their potential roles in workflows.

Method: Introduced a scalable toolkit to induce interpretable, structured workflows from computer-use activities, comparing humans and agents across data analysis, engineering, computation, writing, and design tasks.

Result: Agents are 88.3% faster and 90.4-96.2% cheaper but produce inferior quality work using programmatic approaches, while humans use UI-centric methods; agents often fabricate data and misuse tools.

Conclusion: Agents show promise for efficient collaboration by handling programmable tasks, but need improvement in quality and approach to match human workflows.

Abstract: AI agents are continually optimized for tasks related to human work, such as software engineering and professional writing, signaling a pressing trend with significant impacts on the human workforce. However, these agent developments have often not been grounded in a clear understanding of how humans execute work, to reveal what expertise agents possess and the roles they can play in diverse workflows. In this work, we study how agents do human work by presenting the first direct comparison of human and agent workers across multiple essential work-related skills: data analysis, engineering, computation, writing, and design. To better understand and compare heterogeneous computer-use activities of workers, we introduce a scalable toolkit to induce interpretable, structured workflows from either human or agent computer-use activities. Using such induced workflows, we compare how humans and agents perform the same tasks and find that: (1) While agents exhibit promise in their alignment to human workflows, they take an overwhelmingly programmatic approach across all work domains, even for open-ended, visually dependent tasks like design, creating a contrast with the UI-centric methods typically used by humans. (2) Agents produce work of inferior quality, yet often mask their deficiencies via data fabrication and misuse of advanced tools. (3) Nonetheless, agents deliver results 88.3% faster and cost 90.4-96.2% less than humans, highlighting the potential for enabling efficient collaboration by delegating easily programmable tasks to agents.

[188] From Observability Data to Diagnosis: An Evolving Multi-agent System for Incident Management in Cloud Systems

Yu Luo, Jiamin Jiang, Jingfei Feng, Lei Tao, Qingliang Zhang, Xidao Wen, Yongqian Sun, Shenglin Zhang, Dan Pei

Main category: cs.AI

TL;DR: OpsAgent is a lightweight, self-evolving multi-agent system for automated incident management in cloud systems that converts heterogeneous observability data into structured text and uses transparent multi-agent collaboration for diagnostics.

Details

Motivation: Manual incident management is labor-intensive and error-prone with massive observability data, while existing automated approaches struggle with generalization, interpretability, and high deployment costs.

Method: Uses training-free data processor to convert observability data into structured textual descriptions, multi-agent collaboration framework for transparent diagnostics, and dual self-evolution mechanism for continual capability growth.

Result: State-of-the-art performance on OPENRCA benchmark, demonstrating generalizability, interpretability, cost-efficiency, and self-evolution capabilities.

Conclusion: OpsAgent is a practically deployable and sustainable solution for long-term operation in real-world cloud systems, addressing key limitations of existing approaches.

Abstract: Incident management (IM) is central to the reliability of large-scale cloud systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world cloud systems.

[189] Inverse Knowledge Search over Verifiable Reasoning: Synthesizing a Scientific Encyclopedia from a Long Chains-of-Thought Knowledge Base

Yu Li, Yuan Huang, Tao Wang, Caiyu Fan, Xiansheng Cai, Sihan Hu, Xinzijian Liu, Cheng Shi, Mingjun Xu, Zhen Wang, Yan Wang, Xiangqi Jin, Tianhan Zhang, Linfeng Zhang, Lei Wang, Youjin Deng, Pan Zhang, Weijie Sun, Xingyu Li, Weinan E, Linfeng Zhang, Zhiyuan Yao, Kun Chen

Main category: cs.AI

TL;DR: A framework that decompresses scientific reasoning by creating verifiable Long Chain-of-Thought (LCoT) knowledge base and projecting it into SciencePedia encyclopedia through automated generation, verification, and synthesis pipelines.

Details

Motivation: Scientific materials often compress reasoning by presenting conclusions while omitting derivational chains, which hinders verification and inhibits cross-domain connections between concepts.

Method: Uses Socratic agent with 200-course curriculum to generate 3M first-principles questions, multiple solver models create LCoTs, rigorous filtering via prompt sanitization and cross-model consensus, Brainstorm Search Engine for inverse knowledge retrieval, and Plato synthesizer to narrate chains into articles.

Result: Created SciencePedia with ~200,000 entries across 6 disciplines; Plato-synthesized articles show higher knowledge-point density and lower factual error rates than baseline; enables trustworthy cross-domain scientific synthesis at scale.

Conclusion: The reasoning-centric approach establishes foundation for ever-expanding encyclopedia and enables verifiable, cross-domain scientific knowledge synthesis.

Abstract: Most scientific materials compress reasoning, presenting conclusions while omitting the derivational chains that justify them. This compression hinders verification by lacking explicit, step-wise justifications and inhibits cross-domain links by collapsing the very pathways that establish the logical and causal connections between concepts. We introduce a scalable framework that decompresses scientific reasoning, constructing a verifiable Long Chain-of-Thought (LCoT) knowledge base and projecting it into an emergent encyclopedia, SciencePedia. Our pipeline operationalizes an endpoint-driven, reductionist strategy: a Socratic agent, guided by a curriculum of around 200 courses, generates approximately 3 million first-principles questions. To ensure high fidelity, multiple independent solver models generate LCoTs, which are then rigorously filtered by prompt sanitization and cross-model answer consensus, retaining only those with verifiable endpoints. This verified corpus powers the Brainstorm Search Engine, which performs inverse knowledge search – retrieving diverse, first-principles derivations that culminate in a target concept. This engine, in turn, feeds the Plato synthesizer, which narrates these verified chains into coherent articles. The initial SciencePedia comprises approximately 200,000 fine-grained entries spanning mathematics, physics, chemistry, biology, engineering, and computation. In evaluations across six disciplines, Plato-synthesized articles (conditioned on retrieved LCoTs) exhibit substantially higher knowledge-point density and significantly lower factual error rates than an equally-prompted baseline without retrieval (as judged by an external LLM). Built on this verifiable LCoT knowledge base, this reasoning-centric approach enables trustworthy, cross-domain scientific synthesis at scale and establishes the foundation for an ever-expanding encyclopedia.

[190] Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, Vinay Kumar Sankarapu

Main category: cs.AI

TL;DR: Orion-MSP is a new tabular in-context learning architecture that addresses limitations of existing methods through multi-scale processing, block-sparse attention, and cross-component communication, achieving state-of-the-art performance while scaling efficiently to high-dimensional tables.

Details

Motivation: Current tabular ICL architectures have limitations including single-scale feature processing, quadratic attention scaling, and sequential component processing that prevents iterative refinement and cross-component communication.

Method: Orion-MSP introduces three key innovations: (1) multi-scale processing for hierarchical feature interactions, (2) block-sparse attention combining windowed, global, and random patterns for scalable efficiency, and (3) Perceiver-style memory for bidirectional information flow across components.

Result: Across diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance while scaling effectively to high-dimensional tables.

Conclusion: Orion-MSP establishes a new standard for efficient tabular in-context learning and is publicly available as an open-source implementation.

Abstract: Tabular data remain the predominant format for real-world applications. Yet, developing effective neural models for tabular data remains challenging due to heterogeneous feature types and complex interactions occurring at multiple scales. Recent advances in tabular in-context learning (ICL), such as TabPFN and TabICL, have achieved state-of-the-art performance comparable to gradient-boosted trees (GBTs) without task-specific fine-tuning. However, current architectures exhibit key limitations: (1) single-scale feature processing that overlooks hierarchical dependencies, (2) dense attention with quadratic scaling in table width, and (3) strictly sequential component processing that prevents iterative representation refinement and cross-component communication. To address these challenges, we introduce Orion-MSP, a tabular ICL architecture featuring three key innovations: (1) multi-scale processing to capture hierarchical feature interactions; (2) block-sparse attention combining windowed, global, and random patterns for scalable efficiency and long-range connectivity; and (3) a Perceiver-style memory enabling safe bidirectional information flow across components. Across diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance while scaling effectively to high-dimensional tables, establishing a new standard for efficient tabular in-context learning. The model is publicly available at https://github.com/Lexsi-Labs/Orion-MSP .

[191] A Proprietary Model-Based Safety Response Framework for AI Agents

Qi Li, Jianjun Xu, Pingtao Wei, Jiu Li, Peiqiang Zhao, Jiwei Shi, Xuan Zhang, Yanhui Yang, Xiaodong Hui, Peng Xu, Wenqin Shao

Main category: cs.AI

TL;DR: A novel safety response framework for LLMs that protects at both input and output levels, achieving 99.3% risk recall and perfect safety scores on high-risk tests.

Details

Motivation: Security issues in LLMs constrain their trustworthy deployment in critical domains, requiring systematic protection mechanisms.

Method: Input-level: supervised fine-tuning with 4-tier safety classification (Safe, Unsafe, Conditionally Safe, Focused Attention). Output-level: RAG integration with fine-tuned interpretation model for knowledge-grounded responses.

Result: 99.3% risk recall rate, significantly higher safety scores than baseline, and 100% safety score on proprietary high-risk test set.

Conclusion: Provides effective engineering pathway for building high-security, high-trust LLM applications with systematic protection capabilities.

Abstract: With the widespread application of Large Language Models (LLMs), their associated security issues have become increasingly prominent, severely constraining their trustworthy deployment in critical domains. This paper proposes a novel safety response framework designed to systematically safeguard LLMs at both the input and output levels. At the input level, the framework employs a supervised fine-tuning-based safety classification model. Through a fine-grained four-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused Attention), it performs precise risk identification and differentiated handling of user queries, significantly enhancing risk coverage and business scenario adaptability, and achieving a risk recall rate of 99.3%. At the output level, the framework integrates Retrieval-Augmented Generation (RAG) with a specifically fine-tuned interpretation model, ensuring all responses are grounded in a real-time, trustworthy knowledge base. This approach eliminates information fabrication and enables result traceability. Experimental results demonstrate that our proposed safety control model achieves a significantly higher safety score on public safety evaluation benchmarks compared to the baseline model, TinyR1-Safety-8B. Furthermore, on our proprietary high-risk test set, the framework’s components attained a perfect 100% safety score, validating their exceptional protective capabilities in complex risk scenarios. This research provides an effective engineering pathway for building high-security, high-trust LLM applications.

[192] Monitor-Generate-Verify (MGV): Formalising Metacognitive Theory for Language Model Reasoning

Nick Oh, Fernand Gobet

Main category: cs.AI

TL;DR: The paper proposes a Monitor-Generate-Verify (MGV) framework that extends existing Generate-Verify paradigms by adding explicit monitoring processes based on metacognitive theories, aiming to address the prefix dominance trap where models commit early to suboptimal reasoning paths.

Details

Motivation: Current test-time reasoning architectures exclude monitoring processes that determine when and how reasoning should begin, leading to the prefix dominance trap where models commit early to suboptimal paths and rarely recover, causing ~20% accuracy loss.

Method: Formalizing Flavell’s and Nelson and Narens’ metacognitive theories into computational specifications to create the MGV framework, which adds explicit monitoring that captures metacognitive experiences before generation and refines monitoring through verification feedback.

Result: No empirical validation presented, but the work provides the first systematic computational translation of foundational metacognitive theories for reasoning systems.

Conclusion: The MGV framework offers a principled vocabulary for understanding reasoning system failures and suggests specific architectural interventions for future test-time reasoning designs by incorporating metacognitive monitoring processes.

Abstract: Test-time reasoning architectures such as those following the Generate-Verify paradigm – where a model iteratively refines or verifies its own generated outputs – prioritise generation and verification but exclude the monitoring processes that determine when and how reasoning should begin. This omission may contribute to the prefix dominance trap, in which models commit early to suboptimal reasoning paths and seldom recover, yielding roughly 20% accuracy loss. We address this architectural gap by formalising Flavell’s and Nelson and Narens’ metacognitive theories into computational specifications, proposing the Monitor-Generate-Verify (MGV) framework. MGV extends the Generate-Verify paradigm by adding explicit monitoring that captures metacognitive experiences (from difficulty assessments to confidence judgements) before generation begins and refines future monitoring through verification feedback. Though we present no empirical validation, this work provides the first systematic computational translation of foundational metacognitive theories, offering a principled vocabulary for understanding reasoning system failures and suggesting specific architectural interventions for future test-time reasoning designs.

cs.SD

[193] A Penny for Your Thoughts: Decoding Speech from Inexpensive Brain Signals

Quentin Auster, Kateryna Shapovalenko, Chuang Ma, Demaio Sun

Main category: cs.SD

TL;DR: Neural networks can decode EEG brain activity into speech using personalized architectures with subject-specific attention and dual-path RNNs, achieving improved performance over state-of-the-art methods.

Details

Motivation: To explore whether neural networks can decode brain activity into speech, enabling brain-computer interfaces by mapping EEG recordings to audio representations.

Method: Used EEG data from subjects listening to natural speech, trained with contrastive CLIP loss to align EEG embeddings with pre-trained transformer speech model embeddings. Introduced three architectural modifications: subject-specific attention layers, personalized spatial attention, and dual-path RNN with attention.

Result: Two of three modifications improved performance: subject-specific attention (+0.15% WER improvement), personalized spatial attention (+0.45%), and dual-path RNN with attention (-1.87%).

Conclusion: Personalized architectures show promise for brain-to-speech decoding and brain-computer interface applications, with specific attention mechanisms providing performance improvements.

Abstract: We explore whether neural networks can decode brain activity into speech by mapping EEG recordings to audio representations. Using EEG data recorded as subjects listened to natural speech, we train a model with a contrastive CLIP loss to align EEG-derived embeddings with embeddings from a pre-trained transformer-based speech model. Building on the state-of-the-art EEG decoder from Meta, we introduce three architectural modifications: (i) subject-specific attention layers (+0.15% WER improvement), (ii) personalized spatial attention (+0.45%), and (iii) a dual-path RNN with attention (-1.87%). Two of the three modifications improved performance, highlighting the promise of personalized architectures for brain-to-speech decoding and applications in brain-computer interfaces.

[194] EMO100DB: An Open Dataset of Improvised Songs with Emotion Data

Daeun Hwang, Saebyul Park

Main category: cs.SD

TL;DR: Emo100DB is a dataset of improvised songs with emotion annotations based on Russell’s circumplex model, containing lyrics, MIDI melodies, and audio recordings organized by emotional quadrants.

Details

Motivation: To create a comprehensive dataset that enables diverse exploration of the relationship between music and emotion through integrated composition data and analysis.

Method: Collected improvised songs (melody, lyrics, instrumental accompaniment) from 20 young adults, with participants reporting emotional states using arousal and valence axes before recording. Dataset organized into four emotion quadrants.

Result: Developed Emo100DB dataset containing lyrics text, MIDI melody files, original WAV audio recordings, and emotion annotations based on Russell’s circumplex model.

Conclusion: The study provides a valuable dataset that integrates multiple data types to facilitate comprehensive analysis of music-emotion relationships.

Abstract: In this study, we introduce Emo100DB: a dataset consisting of improvised songs that were recorded and transcribed with emotion data based on Russell’s circumplex model of emotion. The dataset was developed by collecting improvised songs that consist of melody, lyrics, and an instrumental accompaniment played, sung, and recorded by 20 young adults. Before recording each song, the participants were asked to report their emotional state, with the axes representing arousal and valence based on Russell’s circumplex model of emotions. The dataset is organized into four emotion quadrants, and it includes the lyrics text and MIDI file of the melody extracted from the participant recordings, along with the original audio in WAV format. By providing an integrated composition of data and analysis, this study aims to offer a comprehensive dataset that allows for a diverse exploration of the relationship between music and emotion.

[195] MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages

Hardik B. Sailor, Aw Ai Ti, Chen Fang Yih Nancy, Chiu Ying Lay, Ding Yang, He Yingxu, Jiang Ridong, Li Jingtao, Liao Jingyi, Liu Zhuohan, Lu Yanfeng, Ma Yi, Manas Gupta, Muhammad Huzaifah Bin Md Shahrin, Nabilah Binte Md Johan, Nattadaporn Lertcheva, Pan Chunlei, Pham Minh Duc, Siti Maryam Binte Ahmad Subaidi, Siti Umairah Binte Mohammad Salleh, Sun Shuo, Tarun Kumar Vangani, Wang Qiongqiong, Won Cheng Yi Lewis, Wong Heng Meng Jeremy, Wu Jinyang, Zhang Huayun, Zhang Longyin, Zou Xunlong

Main category: cs.SD

TL;DR: MERaLiON-SER is a multilingual speech emotion recognition model that achieves state-of-the-art performance across English and Southeast Asian languages using hybrid discrete and dimensional emotion modeling.

Details

Motivation: To create a robust speech emotion recognition system that works across multiple languages, particularly focusing on English and Southeast Asian languages, and to bridge the gap between discrete emotion categories and fine-grained dimensional emotion analysis.

Method: Uses a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discrete and dimensional emotion modeling, capturing both emotion categories (happy, angry) and fine-grained dimensions (arousal, valence, dominance).

Result: Extensive evaluations show MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs across multilingual Singaporean languages (English, Chinese, Malay, Tamil) and other public benchmarks.

Conclusion: The model demonstrates the importance of specialized speech-only models for accurate paralinguistic understanding and cross-lingual generalization, providing a foundation for integrating emotion-aware perception into future agentic audio systems.

Abstract: We present MERaLiON-SER, a robust speech emotion recognition model de- signed for English and Southeast Asian languages. The model is trained using a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discrete and dimensional emotion modelling. This dual approach enables the model to capture both the distinct categories of emotion (like happy or angry) and the fine-grained, such as arousal (intensity), valence (positivity/negativity), and dominance (sense of control), lead- ing to a more comprehensive and robust representation of human affect. Extensive evaluations across multilingual Singaporean languages (English, Chinese, Malay, and Tamil ) and other public benchmarks show that MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs. These results underscore the importance of specialised speech-only models for accurate paralin- guistic understanding and cross-lingual generalisation. Furthermore, the proposed framework provides a foundation for integrating emotion-aware perception into future agentic audio systems, enabling more empathetic and contextually adaptive multimodal reasoning.

[196] Passive Acoustic Monitoring of Noisy Coral Reefs

Hari Vishnu, Yuen Min Too, Mandar Chitre, Danwei Huang, Teong Beng Koay, Sudhanshi S. Jain

Main category: cs.SD

TL;DR: Passive acoustic monitoring of coral reefs using underwater recorders and CNN denoising reveals correlations between acoustic indices and reef health parameters, with shrimp snap rate showing robust temporal and spatial correlations.

Details

Motivation: To explore passive acoustic monitoring as a method for long-term, spatially extensive assessments of coral reef health, overcoming persistent biological noise that masks low-frequency reef soundscapes.

Method: Deployed underwater acoustic recorders at 10 coral reef sites in Singapore waters over 2 years, trained a convolutional neural network denoiser to mitigate biological noise, and analyzed acoustic data including sound pressure level, acoustic complexity index, and shrimp snap rate.

Result: Denoised data showed correlations between acoustic activity indices and diver-based reef health assessments (live coral richness, cover, and algal cover). Shrimp snap rate from high-frequency band was robustly correlated with reef parameters both temporally and spatially. Distinct morning and evening choruses were identified.

Conclusion: Passive acoustics contains valuable information for reef monitoring when data is effectively denoised and interpreted. The methodology can be extended to other marine environments hindered by persistent noise.

Abstract: Passive acoustic monitoring offers the potential to enable long-term, spatially extensive assessments of coral reefs. To explore this approach, we deployed underwater acoustic recorders at ten coral reef sites around Singapore waters over two years. To mitigate the persistent biological noise masking the low-frequency reef soundscape, we trained a convolutional neural network denoiser. Analysis of the acoustic data reveals distinct morning and evening choruses. Though the correlation with environmental variates was obscured in the low-frequency part of the noisy recordings, the denoised data showed correlations of acoustic activity indices such as sound pressure level and acoustic complexity index with diver-based assessments of reef health such as live coral richness and cover, and algal cover. Furthermore, the shrimp snap rate, computed from the high-frequency acoustic band, is robustly correlated with the reef parameters, both temporally and spatially. This study demonstrates that passive acoustics holds valuable information that can help with reef monitoring, provided the data is effectively denoised and interpreted. This methodology can be extended to other marine environments where acoustic monitoring is hindered by persistent noise.

[197] Perceptually Aligning Representations of Music via Noise-Augmented Autoencoders

Mathias Rose Bjare, Giorgia Cantisani, Marco Pasini, Stefan Lattner, Gerhard Widmer

Main category: cs.SD

TL;DR: Training autoencoders with noised encodings and perceptual losses creates hierarchical representations where perceptually important information is captured in coarser structures, improving latent diffusion decoding for music analysis and EEG prediction.

Details

Motivation: To develop autoencoders that produce encodings structured according to perceptual hierarchy, where perceptually salient information is captured in coarser representations than with conventional training.

Method: Train autoencoders to reconstruct inputs from noised versions of their encodings, combined with perceptual losses.

Result: The approach yields hierarchical encodings where perceptually salient information is captured in coarser structures, and improves latent diffusion decoding for estimating music pitch surprisal and predicting EEG responses to music.

Conclusion: Training autoencoders with noised encodings and perceptual losses successfully creates perceptual hierarchies in representations, enhancing performance in music analysis and brain response prediction tasks.

Abstract: We argue that training autoencoders to reconstruct inputs from noised versions of their encodings, when combined with perceptual losses, yields encodings that are structured according to a perceptual hierarchy. We demonstrate the emergence of this hierarchical structure by showing that, after training an audio autoencoder in this manner, perceptually salient information is captured in coarser representation structures than with conventional training. Furthermore, we show that such perceptual hierarchies improve latent diffusion decoding in the context of estimating surprisal in music pitches and predicting EEG-brain responses to music listening. Pretrained weights are available on github.com/CPJKU/pa-audioic.

[198] Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

Rui Zhou, Akinori Ito, Takashi Nose

Main category: cs.SD

TL;DR: This paper proposes a self-supervised pretraining method for speech-to-speech translation that enhances speaker information preservation while maintaining translation quality and inference speed.

Details

Motivation: Traditional Speech-to-Discrete Unit Translation (S2UT) methods fail to retain speaker-specific characteristics from the source speech, as discrete units primarily capture content information only.

Method: Building on SC-S2UT framework, the authors introduce self-supervised pretraining to enrich information extraction by both speaker adapter and unit-to-mel structure, and investigate different feature fusion strategies for better integration of speaker and content features.

Result: On CVSS-T dataset for ES-EN and FR-EN tasks, the method achieves 1.14 BLEU score improvement over SC-S2UT, with significant enhancements in MOS and speaker similarity, while maintaining comparable translation quality to traditional S2UT with only 0.04s per utterance increase in inference time.

Conclusion: The proposed self-supervised pretraining method effectively preserves speaker characteristics in speech-to-speech translation while maintaining high translation quality and acceptable inference speed.

Abstract: Speech-to-Speech Translation (S2ST) refers to the conversion of speech in one language into semantically equivalent speech in another language, facilitating communication between speakers of different languages. Speech-to-Discrete Unit Translation (S2UT), a mainstream approach for end-to-end S2ST, addresses challenges such as error propagation across modules and slow inference speed often encountered in traditional cascade systems. However, as discrete units primarily capture content information, conventional S2UT methods fail to retain speaker-specific characteristics from the source. Our previous work, SC-S2UT, introduced a speaker adapter and a unit-to-mel structure, enabling the preservation of speaker information and non-autoregressive speech generation. Building on this foundation, this study proposes a self-supervised pretraining method to enrich the information extracted by both the speaker adapter and the unit-to-mel structure. Additionally, we investigate different feature fusion strategies to further improve the integration of speaker and content features. Experiments conducted on the CVSS-T dataset for ES-EN and FR-EN tasks demonstrate that our proposed method achieves a BLEU score improvement of 1.14 compared to SC-S2UT, along with significant enhancements in MOS and speaker similarity. Furthermore, our approach achieves translation quality comparable to traditional S2UT, with only a minimal increase of 0.04s per utterance in inference time, while maintaining high speaker similarity. These results validate the effectiveness of the proposed method.

[199] Robust Neural Audio Fingerprinting using Music Foundation Models

Shubhr Singh, Kiran Bhat, Xavier Riley, Benjamin Resnick, John Thickstun, Walter De Brouwer

Main category: cs.SD

TL;DR: New neural audio fingerprinting using music foundation models and extensive data augmentation outperforms state-of-the-art methods in robustness against audio manipulations.

Details

Motivation: The proliferation of distorted, compressed, and manipulated music on platforms like TikTok requires more robust audio fingerprinting techniques to identify music sources.

Method: Uses pretrained music foundation models (MuQ, MERT) as backbone architecture and expands data augmentation to train under various audio manipulations including time stretching, pitch modulation, compression, and filtering.

Result: Fingerprints from music foundation models consistently outperform models trained from scratch or pretrained on non-musical audio, with segment-level evaluation showing accurate fingerprint match localization.

Conclusion: Music foundation models significantly improve neural audio fingerprinting robustness and practical utility for catalog management applications.

Abstract: The proliferation of distorted, compressed, and manipulated music on modern media platforms like TikTok motivates the development of more robust audio fingerprinting techniques to identify the sources of musical recordings. In this paper, we develop and evaluate new neural audio fingerprinting techniques with the aim of improving their robustness. We make two contributions to neural fingerprinting methodology: (1) we use a pretrained music foundation model as the backbone of the neural architecture and (2) we expand the use of data augmentation to train fingerprinting models under a wide variety of audio manipulations, including time streching, pitch modulation, compression, and filtering. We systematically evaluate our methods in comparison to two state-of-the-art neural fingerprinting models: NAFP and GraFPrint. Results show that fingerprints extracted with music foundation models (e.g., MuQ, MERT) consistently outperform models trained from scratch or pretrained on non-musical audio. Segment-level evaluation further reveals their capability to accurately localize fingerprint matches, an important practical feature for catalog management.

cs.LG

[200] Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity

Pratik Poudel

Main category: cs.LG

TL;DR: KV cache management in LLMs must preserve positional encoding integrity to avoid performance degradation, as eviction strategies that disrupt positional coherence worsen generation quality more than cache size alone.

Details

Motivation: Address the challenges of unbounded KV cache growth in stateful multi-turn LLM scenarios and examine how cache management strategies interact with model architectural limits and positional encoding integrity.

Method: Empirical analysis using a stateful benchmarking framework to test KV cache management strategies, focusing on how eviction approaches affect positional coherence and generation quality when cache approaches model context limits.

Result: LLM generation quality degrades sharply when accumulated KV cache exceeds trained context window; common eviction strategies worsen performance by disrupting positional coherence; simple strategies preserving contiguous context blocks yield more coherent generations.

Conclusion: KV cache eviction techniques should respect architectural limits, preserve positional structure, and consider “cache health” holistically beyond mere size to maintain generation quality.

Abstract: The Key-Value (KV) cache is integral to efficient autoregressive inference in large language models (LLMs), yet its unbounded growth in stateful multi-turn scenarios presents major challenges. This paper examines the interplay between KV cache management strategies, the architectural context limits of models like meta-llama/Meta-Llama-3-8b-instruct, and the often-overlooked integrity of positional encodings. Through empirical analysis using a stateful benchmarking framework, we show that LLM generation quality degrades sharply when the accumulated KV cache approaches or exceeds the model’s trained context window (e.g., 8192 tokens for Llama 3), a failure mode distinct from GPU memory exhaustion. Common eviction strategies, even high-retention ones (e.g., 99% via AttentionTop), can worsen performance if they disrupt positional coherence. Because LLMs rely on consistent positional signals (e.g., RoPE), compacting a cache by removing non-contiguous tokens can scramble these signals and lead to degenerative outputs. We further show that simple strategies preserving contiguous context blocks (e.g., keeping an initial “gist”) can yield more coherent generations than complex or positionally disruptive ones. We advocate for eviction techniques that respect architectural limits, preserve positional structure, and view “cache health” holistically beyond mere size.

[201] Ada-FCN: Adaptive Frequency-Coupled Network for fMRI-Based Brain Disorder Classification

Yue Xun, Jiaxing Xu, Wenbo Gao, Chen Yang, Shujun Wang

Main category: cs.LG

TL;DR: Proposes a novel fMRI analysis framework with adaptive frequency decomposition and frequency-coupled connectivity learning for improved brain disorder diagnosis.

Details

Motivation: Existing fMRI models overlook multi-frequency nature of neuronal oscillations and treat BOLD signals as monolithic, limiting diagnostic sensitivity for disorders that manifest in specific frequency bands.

Method: Uses Adaptive Cascade Decomposition to learn task-relevant frequency sub-bands per brain region, Frequency-Coupled Connectivity Learning to capture intra- and cross-band interactions, and Unified-GCN with novel message-passing for diagnostic prediction.

Result: Demonstrates superior performance on ADNI and ABIDE datasets compared to existing methods.

Conclusion: The framework effectively captures frequency-specific alterations in brain disorders and provides improved diagnostic capabilities through adaptive frequency analysis.

Abstract: Resting-state fMRI has become a valuable tool for classifying brain disorders and constructing brain functional connectivity networks by tracking BOLD signals across brain regions. However, existing mod els largely neglect the multi-frequency nature of neuronal oscillations, treating BOLD signals as monolithic time series. This overlooks the cru cial fact that neurological disorders often manifest as disruptions within specific frequency bands, limiting diagnostic sensitivity and specificity. While some methods have attempted to incorporate frequency informa tion, they often rely on predefined frequency bands, which may not be optimal for capturing individual variability or disease-specific alterations. To address this, we propose a novel framework featuring Adaptive Cas cade Decomposition to learn task-relevant frequency sub-bands for each brain region and Frequency-Coupled Connectivity Learning to capture both intra- and nuanced cross-band interactions in a unified functional network. This unified network informs a novel message-passing mecha nism within our Unified-GCN, generating refined node representations for diagnostic prediction. Experimental results on the ADNI and ABIDE datasets demonstrate superior performance over existing methods. The code is available at https://github.com/XXYY20221234/Ada-FCN.

[202] AWEMixer: Adaptive Wavelet-Enhanced Mixer Network for Long-Term Time Series Forecasting

Qianyang Li, Xingjun Zhang, Peng Tao, Shaoxun Wang, Yancheng Pan, Jia Wei

Main category: cs.LG

TL;DR: AWEMixer is an Adaptive Wavelet-Enhanced Mixer Network that addresses long-term time series forecasting challenges in IoT by combining global frequency patterns with localized wavelet analysis to improve accuracy while reducing error accumulation.

Details

Motivation: Traditional time series forecasting methods struggle with non-stationary, multi-scale IoT sensor signals and suffer from error accumulation over long-term predictions. Existing frequency-based approaches using Fourier transform treat signals as stationary, blurring temporal patterns of transient events.

Method: AWEMixer uses two key components: 1) Frequency Router that leverages global periodicity from FFT to adaptively weight localized wavelet subbands, and 2) Coherent Gated Fusion Block that selectively integrates frequency features with multi-scale temporal representations using cross-attention and gating mechanisms.

Result: The model was validated on seven public benchmarks and consistently outperformed state-of-the-art transformer-based and MLP-based models in long-sequence time series forecasting, demonstrating improved accuracy and robustness to noise.

Conclusion: AWEMixer effectively addresses the limitations of traditional methods by achieving accurate time-frequency localization while maintaining robustness to noise, making it particularly suitable for IoT time series forecasting applications.

Abstract: Forecasting long-term time series in IoT environments remains a significant challenge due to the non-stationary and multi-scale characteristics of sensor signals. Furthermore, error accumulation causes a decrease in forecast quality when predicting further into the future. Traditional methods are restricted to operate in time-domain, while the global frequency information achieved by Fourier transform would be regarded as stationary signals leading to blur the temporal patterns of transient events. We propose AWEMixer, an Adaptive Wavelet-Enhanced Mixer Network including two innovative components: 1) a Frequency Router designs to utilize the global periodicity pattern achieved by Fast Fourier Transform to adaptively weight localized wavelet subband, and 2) a Coherent Gated Fusion Block to achieve selective integration of prominent frequency features with multi-scale temporal representation through cross-attention and gating mechanism, which realizes accurate time-frequency localization while remaining robust to noise. Seven public benchmarks validate that our model is more effective than recent state-of-the-art models. Specifically, our model consistently achieves performance improvement compared with transformer-based and MLP-based state-of-the-art models in long-sequence time series forecasting. Code is available at https://github.com/hit636/AWEMixer

[203] Model Merging Improves Zero-Shot Generalization in Bioacoustic Foundation Models

Davide Marincione, Donato Crisostomi, Roberto Dessi, Emanuele Rodolà, Emanuele Rossi

Main category: cs.LG

TL;DR: NatureLM, a bioacoustic foundation model, loses instruction-following flexibility after domain-specific fine-tuning. A simple model merging strategy with its base language model recovers these capabilities while maintaining domain expertise and significantly improves zero-shot generalization.

Details

Motivation: Foundation models like NatureLM show promise for bioacoustics but face trade-offs between domain expertise and instruction-following flexibility after fine-tuning.

Method: Applied model merging strategy that interpolates NatureLM with its base language model to balance domain expertise and instruction-following capabilities.

Result: Recovered instruction-following flexibility with minimal domain knowledge loss, achieving over 200% relative improvement in zero-shot classification of unseen species and setting new state-of-the-art.

Conclusion: Simple model merging effectively addresses the trade-off between domain-specific performance and general instruction-following capabilities in bioacoustic foundation models.

Abstract: Foundation models capable of generalizing across species and tasks represent a promising new frontier in bioacoustics, with NatureLM being one of the most prominent examples. While its domain-specific fine-tuning yields strong performance on bioacoustic benchmarks, we observe that it also introduces trade-offs in instruction-following flexibility. For instance, NatureLM achieves high accuracy when prompted for either the common or scientific name individually, but its accuracy drops significantly when both are requested in a single prompt. We address this by applying a simple model merging strategy that interpolates NatureLM with its base language model, recovering instruction-following capabilities with minimal loss of domain expertise. Finally, we show that the merged model exhibits markedly stronger zero-shot generalization, achieving over a 200% relative improvement and setting a new state-of-the-art in closed-set zero-shot classification of unseen species.

[204] Multi-Agent Craftax: Benchmarking Open-Ended Multi-Agent Reinforcement Learning at the Hyperscale

Bassel Al Omari, Michael Matthews, Alexander Rutherford, Jakob Nicolaus Foerster

Main category: cs.LG

TL;DR: Craftax-MA and Craftax-Coop are new multi-agent RL benchmarks that address limitations of existing benchmarks by testing long-term dependencies and generalization in open-ended environments with fast JAX implementation.

Details

Motivation: Existing MARL benchmarks are too narrow and short-horizon, failing to adequately test long-term dependencies and generalization capabilities needed for real multi-agent systems.

Method: Extended the popular Craftax environment to support multiple agents (Craftax-MA) and added heterogeneous agents, trading, and complex cooperation mechanics (Craftax-Coop), implemented in JAX for exceptional speed.

Result: Training runs with 250 million environment interactions complete in under an hour. Analysis shows existing algorithms struggle with long-horizon credit assignment, exploration, and cooperation in these benchmarks.

Conclusion: Craftax-MA and Craftax-Coop provide compelling challenges that can drive long-term research in multi-agent reinforcement learning by exposing limitations of current methods.

Abstract: Progress in multi-agent reinforcement learning (MARL) requires challenging benchmarks that assess the limits of current methods. However, existing benchmarks often target narrow short-horizon challenges that do not adequately stress the long-term dependencies and generalization capabilities inherent in many multi-agent systems. To address this, we first present \textit{Craftax-MA}: an extension of the popular open-ended RL environment, Craftax, that supports multiple agents and evaluates a wide range of general abilities within a single environment. Written in JAX, \textit{Craftax-MA} is exceptionally fast with a training run using 250 million environment interactions completing in under an hour. To provide a more compelling challenge for MARL, we also present \textit{Craftax-Coop}, an extension introducing heterogeneous agents, trading and more mechanics that require complex cooperation among agents for success. We provide analysis demonstrating that existing algorithms struggle with key challenges in this benchmark, including long-horizon credit assignment, exploration and cooperation, and argue for its potential to drive long-term research in MARL.

[205] Temporal convolutional and fusional transformer model with Bi-LSTM encoder-decoder for multi-time-window remaining useful life prediction

Mohamadreza Akbari Pour, Mohamad Sadeq Karimi, Amir Hossein Mazloumi

Main category: cs.LG

TL;DR: Novel framework combining Temporal Convolutional Networks with modified Temporal Fusion Transformer and Bi-LSTM for improved Remaining Useful Life prediction, achieving 5.5% RMSE reduction.

Details

Motivation: Existing RUL prediction models struggle with capturing fine-grained temporal dependencies and dynamically prioritizing critical features across time for robust prognostics in industrial systems.

Method: Integrates Temporal Convolutional Networks for localized temporal feature extraction with a modified Temporal Fusion Transformer enhanced by Bi-LSTM encoder-decoder, using multi-time-window methodology for adaptability.

Result: Extensive evaluations show the model reduces average RMSE by up to 5.5% compared to state-of-the-art methods, demonstrating improved predictive accuracy.

Conclusion: The framework advances industrial prognostic systems by closing critical gaps in current approaches and highlights the potential of advanced time-series transformers for RUL prediction.

Abstract: Health prediction is crucial for ensuring reliability, minimizing downtime, and optimizing maintenance in industrial systems. Remaining Useful Life (RUL) prediction is a key component of this process; however, many existing models struggle to capture fine-grained temporal dependencies while dynamically prioritizing critical features across time for robust prognostics. To address these challenges, we propose a novel framework that integrates Temporal Convolutional Networks (TCNs) for localized temporal feature extraction with a modified Temporal Fusion Transformer (TFT) enhanced by Bi-LSTM encoder-decoder. This architecture effectively bridges short- and long-term dependencies while emphasizing salient temporal patterns. Furthermore, the incorporation of a multi-time-window methodology improves adaptability across diverse operating conditions. Extensive evaluations on benchmark datasets demonstrate that the proposed model reduces the average RMSE by up to 5.5%, underscoring its improved predictive accuracy compared to state-of-the-art methods. By closing critical gaps in current approaches, this framework advances the effectiveness of industrial prognostic systems and highlights the potential of advanced time-series transformers for RUL prediction.

[206] Regularized GLISp for sensor-guided human-in-the-loop optimization

Matteo Cercola, Michele Lomuscio, Dario Piga, Simone Formentin

Main category: cs.LG

TL;DR: A sensor-guided extension of GLISp that integrates sensor measurements into preference-based optimization, combining subjective human feedback with quantitative sensor data for faster convergence and better solutions.

Details

Motivation: Existing preference-based optimization methods treat systems as black boxes and ignore informative sensor measurements, missing opportunities to leverage quantitative data alongside subjective human preferences.

Method: Extends GLISp with sensor-guided regularization using a physics-informed hypothesis function and least-squares regularization term to integrate measurable descriptors into the preference-learning loop.

Result: Numerical evaluations on analytical benchmarks and human-in-the-loop vehicle suspension tuning show faster convergence and superior final solutions compared to baseline GLISp.

Conclusion: The proposed sensor-guided regularized GLISp effectively combines subjective human feedback with quantitative sensor information, providing faster convergence and better solutions while preserving preference-based search flexibility.

Abstract: Human-in-the-loop calibration is often addressed via preference-based optimization, where algorithms learn from pairwise comparisons rather than explicit cost evaluations. While effective, methods such as Preferential Bayesian Optimization or Global optimization based on active preference learning with radial basis functions (GLISp) treat the system as a black box and ignore informative sensor measurements. In this work, we introduce a sensor-guided regularized extension of GLISp that integrates measurable descriptors into the preference-learning loop through a physics-informed hypothesis function and a least-squares regularization term. This injects grey-box structure, combining subjective feedback with quantitative sensor information while preserving the flexibility of preference-based search. Numerical evaluations on an analytical benchmark and on a human-in-the-loop vehicle suspension tuning task show faster convergence and superior final solutions compared to baseline GLISp.

[207] When Data Falls Short: Grokking Below the Critical Threshold

Vaibhav Singh, Eugene Belilovsky, Rahaf Aljundi

Main category: cs.LG

TL;DR: Knowledge Distillation from grokked models enables and accelerates grokking in data-scarce regimes and distribution shift scenarios, overcoming limitations of standard training.

Details

Motivation: To address the challenge of grokking (delayed generalization) in data-scarce regimes and practical scenarios with distribution shift, where standard training fails due to insufficient data below critical thresholds.

Method: Used Knowledge Distillation (KD) from models that have already grokked on one distribution to induce grokking on different distributions, studied joint distribution training, and examined continual pretraining setups with distribution transitions.

Result: KD successfully induced and accelerated grokking on new distributions even with data below critical thresholds, enabled generalization in joint distribution training where standard methods failed, and mitigated catastrophic forgetting in continual learning while achieving strong performance with only 10% of data.

Conclusion: Knowledge Distillation plays a central role in enabling generalization in low-data and evolving distribution settings, providing new insights into grokking mechanics under knowledge transfer.

Abstract: In this paper, we investigate the phenomenon of grokking, where models exhibit delayed generalization following overfitting on training data. We focus on data-scarce regimes where the number of training samples falls below the critical threshold, making grokking unobservable, and on practical scenarios involving distribution shift. We first show that Knowledge Distillation (KD) from a model that has already grokked on a distribution (p1) can induce and accelerate grokking on a different distribution (p2), even when the available data lies below the critical threshold. This highlights the value of KD for deployed models that must adapt to new distributions under limited data. We then study training on the joint distribution (p1, p2) and demonstrate that while standard supervised training fails when either distribution has insufficient data, distilling from models grokked on the individual distributions enables generalization. Finally, we examine a continual pretraining setup, where a grokked model transitions from p1 to p2, and find that KD both accelerates generalization and mitigates catastrophic forgetting, achieving strong performance even with only 10% of the data. Together, our results provide new insights into the mechanics of grokking under knowledge transfer and underscore the central role of KD in enabling generalization in low-data and evolving distribution settings.

[208] FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow

Rubens Lacouture, Nathan Zhang, Ritvik Sharma, Marco Siracusa, Fredrik Kjolstad, Kunle Olukotun, Olivia Hsu

Main category: cs.LG

TL;DR: FuseFlow is a compiler that converts sparse PyTorch models to fused sparse dataflow graphs for reconfigurable dataflow architectures, supporting cross-expression fusion and achieving ~2.7x speedup for GPT-3 with BigBird attention.

Details

Motivation: As deep learning models scale, sparse computation and specialized dataflow hardware are needed to address efficiency challenges in large-scale machine learning.

Method: FuseFlow compiler converts sparse PyTorch models to fused sparse dataflow graphs, supporting cross-expression fusion, parallelization, dataflow ordering, and sparsity blocking. It targets a cycle-accurate dataflow simulator for microarchitectural analysis.

Result: FuseFlow enables design-space exploration showing full fusion isn’t always optimal - fusion granularity depends on the model. Achieves ~2.7x speedup over unfused baseline for GPT-3 with BigBird block-sparse attention.

Conclusion: FuseFlow provides a systematic approach to sparse computation optimization with cross-expression fusion, demonstrating that optimal fusion strategies are model-dependent and offering heuristics to prune suboptimal configurations.

Abstract: As deep learning models scale, sparse computation and specialized dataflow hardware have emerged as powerful solutions to address efficiency. We propose FuseFlow, a compiler that converts sparse machine learning models written in PyTorch to fused sparse dataflow graphs for reconfigurable dataflow architectures (RDAs). FuseFlow is the first compiler to support general cross-expression fusion of sparse operations. In addition to fusion across kernels (expressions), FuseFlow also supports optimizations like parallelization, dataflow ordering, and sparsity blocking. It targets a cycle-accurate dataflow simulator for microarchitectural analysis of fusion strategies. We use FuseFlow for design-space exploration across four real-world machine learning applications with sparsity, showing that full fusion (entire cross-expression fusion across all computation in an end-to-end model) is not always optimal for sparse models-fusion granularity depends on the model itself. FuseFlow also provides a heuristic to identify and prune suboptimal configurations. Using Fuseflow, we achieve performance improvements, including a ~2.7x speedup over an unfused baseline for GPT-3 with BigBird block-sparse attention.

[209] SLOFetch: Compressed-Hierarchical Instruction Prefetching for Cloud Microservices

Liu Jiang, Zerui Bao, Shiqi Sheng, Di Zhu

Main category: cs.LG

TL;DR: This paper introduces an enhanced instruction prefetcher for cloud workloads that reduces on-chip storage while maintaining performance through compressed entries, hierarchical metadata storage, and an online ML controller.

Details

Motivation: Large-scale networked services with deep software stacks and microservice orchestration create increased instruction footprints and frontend stalls, leading to inflated tail latency and energy consumption in cloud environments.

Method: The design builds on EIP with three key innovations: 1) Compressed Entry that captures up to eight destinations using 36 bits by exploiting spatial clustering, 2) Hierarchical Metadata Storage that keeps only L1 resident and frequently queried entries on-chip while virtualizing bulk metadata, 3) Lightweight Online ML Controller that scores prefetch profitability using context features and bandit-adjusted threshold.

Result: The approach preserves EIP-like speedups while using smaller on-chip state and improves efficiency for networked services in the ML era.

Conclusion: The proposed instruction prefetching design effectively addresses the challenges of cloud workloads by reducing storage overhead while maintaining performance through intelligent compression and adaptive control mechanisms.

Abstract: Large-scale networked services rely on deep soft-ware stacks and microservice orchestration, which increase instruction footprints and create frontend stalls that inflate tail latency and energy. We revisit instruction prefetching for these cloud workloads and present a design that aligns with SLO driven and self optimizing systems. Building on the Entangling Instruction Prefetcher (EIP), we introduce a Compressed Entry that captures up to eight destinations around a base using 36 bits by exploiting spatial clustering, and a Hierarchical Metadata Storage scheme that keeps only L1 resident and frequently queried entries on chip while virtualizing bulk metadata into lower levels. We further add a lightweight Online ML Controller that scores prefetch profitability using context features and a bandit adjusted threshold. On data center applications, our approach preserves EIP like speedups with smaller on chip state and improves efficiency for networked services in the ML era.

[210] Learning from Delayed Feedback in Games via Extra Prediction

Yuma Fujimoto, Kenshi Abe, Kaito Ariu

Main category: cs.LG

TL;DR: This paper addresses time-delayed feedback in multi-agent learning games by proposing Weighted Optimistic Follow-the-Regularized-Leader (WOFTRL), which uses weighted predictions to overcome performance degradation caused by time delays.

Details

Motivation: Time-delayed feedback in multi-agent learning causes performance degradation in existing algorithms like OFTRL, even with single-step delays, worsening social regret and convergence.

Method: Proposed Weighted OFTRL (WOFTRL) where the prediction vector of next reward in OFTRL is weighted n times, with the intuition that optimistic weight cancels out time delay.

Result: When optimistic weight exceeds time delay, WOFTRL achieves constant social regret in general-sum normal-form games and last-iterate convergence to Nash equilibrium in poly-matrix zero-sum games.

Conclusion: WOFTRL effectively overcomes time-delayed feedback issues in multi-agent learning, with theoretical results supported by experiments showing recovery of good performance when optimistic weight exceeds delay.

Abstract: This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of social regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that social regret is constant in general-sum normal-form games, and the strategies last-iterate converge to the Nash equilibrium in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.

[211] Conditional Neural ODE for Longitudinal Parkinson’s Disease Progression Forecasting

Xiaoda Wang, Yuji Zhao, Kaiqiao Han, Xiao Luo, Sanne van Rooij, Jennifer Stevens, Lifang He, Liang Zhan, Yizhou Sun, Wei Wang, Carl Yang

Main category: cs.LG

TL;DR: CNODE (Conditional Neural ODE) is a novel framework that models Parkinson’s disease progression as continuous temporal processes using neural ODEs, enabling individualized forecasting of brain morphological changes despite irregular and sparse MRI data.

Details

Motivation: Existing methods struggle with irregular and sparse MRI data in PD cohorts and have difficulty capturing individual heterogeneity in disease onset, progression rate, and symptom severity, which is a hallmark of Parkinson's disease.

Method: The core method uses neural ODEs to model morphological brain changes as continuous temporal processes, while jointly learning patient-specific initial time and progress speed to align individual trajectories into a shared progression trajectory.

Result: Experimental validation on the Parkinson’s Progression Markers Initiative (PPMI) dataset shows that CNODE outperforms state-of-the-art baselines in forecasting longitudinal PD progression.

Conclusion: CNODE provides an effective framework for continuous, individualized PD progression forecasting that addresses the limitations of existing methods in handling irregular data and capturing patient heterogeneity.

Abstract: Parkinson’s disease (PD) shows heterogeneous, evolving brain-morphometry patterns. Modeling these longitudinal trajectories enables mechanistic insight, treatment development, and individualized ‘digital-twin’ forecasting. However, existing methods usually adopt recurrent neural networks and transformer architectures, which rely on discrete, regularly sampled data while struggling to handle irregular and sparse magnetic resonance imaging (MRI) in PD cohorts. Moreover, these methods have difficulty capturing individual heterogeneity including variations in disease onset, progression rate, and symptom severity, which is a hallmark of PD. To address these challenges, we propose CNODE (Conditional Neural ODE), a novel framework for continuous, individualized PD progression forecasting. The core of CNODE is to model morphological brain changes as continuous temporal processes using a neural ODE model. In addition, we jointly learn patient-specific initial time and progress speed to align individual trajectories into a shared progression trajectory. We validate CNODE on the Parkinson’s Progression Markers Initiative (PPMI) dataset. Experimental results show that our method outperforms state-of-the-art baselines in forecasting longitudinal PD progression.

[212] Causal Structure and Representation Learning with Biomedical Applications

Caroline Uhler, Jiaqi Zhang

Main category: cs.LG

TL;DR: The paper proposes integrating representation learning with causal inference to address limitations of current representation learning in causal tasks, using multi-modal biomedical data for causal discovery and optimal perturbation design.

Details

Motivation: Current representation learning excels in predictive tasks but fails in causal tasks like predicting intervention effects. The growing availability of multi-modal biomedical data presents an opportunity to bridge representation learning and causal inference.

Method: A statistical and computational framework that leverages multi-modal data (observational and perturbational, various biological levels) for causal structure learning, causal variable discovery, and optimal perturbation design.

Result: The paper outlines a framework but does not present specific experimental results in this abstract.

Conclusion: There is a critical need to combine representation learning with causal inference, particularly in biomedical contexts where multi-modal data can enable effective causal discovery and intervention analysis.

Abstract: Massive data collection holds the promise of a better understanding of complex phenomena and, ultimately, better decisions. Representation learning has become a key driver of deep learning applications, as it allows learning latent spaces that capture important properties of the data without requiring any supervised annotations. Although representation learning has been hugely successful in predictive tasks, it can fail miserably in causal tasks including predicting the effect of a perturbation/intervention. This calls for a marriage between representation learning and causal inference. An exciting opportunity in this regard stems from the growing availability of multi-modal data (observational and perturbational, imaging-based and sequencing-based, at the single-cell level, tissue-level, and organism-level). We outline a statistical and computational framework for causal structure and representation learning motivated by fundamental biomedical questions: how to effectively use observational and perturbational data to perform causal discovery on observed causal variables; how to use multi-modal views of the system to learn causal variables; and how to design optimal perturbations.

[213] DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

Lei Gao, Chaoyi Jiang, Hossein Entezari Zarch, Daniel Wong, Murali Annavaram

Main category: cs.LG

TL;DR: DuetServe is a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU by dynamically activating SM-level spatial multiplexing when needed, improving throughput by 1.3x while maintaining low latency.

Details

Motivation: Existing LLM serving systems either aggregate prefill and decode phases on shared GPUs (causing interference and degraded latency) or disaggregate them across GPUs (wasting resources through duplicated models and KV cache transfers).

Method: DuetServe operates in aggregated mode by default and dynamically activates SM-level GPU spatial multiplexing when TBT degradation is predicted. It uses an attention-aware roofline model, partitioning optimizer, and interruption-free execution engine to provide fine-grained, adaptive SM partitioning.

Result: DuetServe improves total throughput by up to 1.3x while maintaining low generation latency compared to state-of-the-art frameworks.

Conclusion: DuetServe successfully achieves disaggregation-level isolation within a single GPU, providing the benefits of disaggregation without resource waste, making it an effective unified LLM serving framework.

Abstract: Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades time-between-tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources through duplicated models and KV cache transfers. We present DuetServe, a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU. DuetServe operates in aggregated mode by default and dynamically activates SM-level GPU spatial multiplexing when TBT degradation is predicted. Its key idea is to decouple prefill and decode execution only when needed through fine-grained, adaptive SM partitioning that provides phase isolation only when contention threatens latency service level objectives (SLOs). DuetServe integrates (1) an attention-aware roofline model to forecast iteration latency, (2) a partitioning optimizer that selects the optimal SM split to maximize throughput under TBT constraints, and (3) an interruption-free execution engine that eliminates CPU-GPU synchronization overhead. Evaluations show that DuetServe improves total throughput by up to 1.3x while maintaining low generation latency compared to state-of-the-art frameworks.

[214] Simplex-FEM Networks (SiFEN): Learning A Triangulated Function Approximator

Chaymae Yahyati, Ismail Lamaakal, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine Maleh

Main category: cs.LG

TL;DR: SiFEN is a learned piecewise-polynomial predictor using finite-element fields on learned simplicial meshes, providing explicit locality, controllable smoothness, and cache-friendly sparsity while matching or surpassing MLPs and KANs in performance.

Details

Motivation: To create a compact, interpretable, and theoretically grounded alternative to dense MLPs and edge-spline networks with explicit locality, controllable smoothness, and better computational efficiency.

Method: Uses degree-m Bernstein-Bezier polynomials on learned simplicial meshes with light invertible warp, trained end-to-end with shape regularization, semi-discrete OT coverage, and differentiable edge flips.

Result: Achieves classic FEM approximation rate M^(-m/d) with M mesh vertices, matches or surpasses MLPs and KANs at matched parameter budgets, improves calibration (lower ECE/Brier), and reduces inference latency.

Conclusion: SiFEN provides a compact, interpretable, and theoretically grounded alternative to dense MLPs with better computational efficiency and performance characteristics.

Abstract: We introduce Simplex-FEM Networks (SiFEN), a learned piecewise-polynomial predictor that represents f: R^d -> R^k as a globally C^r finite-element field on a learned simplicial mesh in an optionally warped input space. Each query activates exactly one simplex and at most d+1 basis functions via barycentric coordinates, yielding explicit locality, controllable smoothness, and cache-friendly sparsity. SiFEN pairs degree-m Bernstein-Bezier polynomials with a light invertible warp and trains end-to-end with shape regularization, semi-discrete OT coverage, and differentiable edge flips. Under standard shape-regularity and bi-Lipschitz warp assumptions, SiFEN achieves the classic FEM approximation rate M^(-m/d) with M mesh vertices. Empirically, on synthetic approximation tasks, tabular regression/classification, and as a drop-in head on compact CNNs, SiFEN matches or surpasses MLPs and KANs at matched parameter budgets, improves calibration (lower ECE/Brier), and reduces inference latency due to geometric locality. These properties make SiFEN a compact, interpretable, and theoretically grounded alternative to dense MLPs and edge-spline networks.

[215] PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference

Yushu Zhao, Zheng Wang, Minjia Zhang

Main category: cs.LG

TL;DR: PuzzleMoE is a training-free compression method for Mixture-of-Experts models that achieves 50% compression while maintaining accuracy through sparse expert merging and bit-packed encoding.

Details

Motivation: MoE models have high memory overhead due to storing all expert parameters, limiting deployment despite their efficiency benefits. Existing compression methods suffer from performance drops at high compression ratios.

Method: Uses dual-mask approach to identify element-wise weight redundancy and specialization, then employs bit-packed encoding that reuses underutilized exponent bits to avoid storage overhead.

Result: Achieves up to 50% compression while maintaining accuracy across tasks, outperforms prior methods by up to 16.7% on MMLU at 50% compression, and achieves 1.28× inference speedup.

Conclusion: PuzzleMoE enables efficient MoE inference on GPUs with significant compression and speed improvements while preserving model accuracy.

Abstract: Mixture-of-Experts (MoE) models have shown strong potential in scaling language models efficiently by activating only a small subset of experts per input. However, their widespread deployment remains limited due to the high memory overhead associated with storing all expert parameters, particularly as the number of experts increases. To address this challenge, prior works have explored expert dropping and merging strategies, yet they often suffer from performance drop at high compression ratios. In this paper, we introduce PuzzleMoE, a training-free MoE compression method that achieves both high accuracy and efficient inference through two key innovations: First, PuzzleMoE performs sparse expert merging by identifying element-wise weight redundancy and specialization. It uses a dual-mask to capture both shared and expert-specific parameters. Second, to avoid the overhead of storing binary masks and signs, PuzzleMoE introduces a bit-packed encoding scheme that reuses underutilized exponent bits, enabling efficient MoE inference on GPUs. Extensive experiments demonstrate that PuzzleMoE can compress MoE models by up to 50% while maintaining accuracy across various tasks. Specifically, it outperforms prior MoE compression methods by up to 16.7% on MMLU at 50% compression ratio, and achieves up to 1.28\times inference speedup.

[216] Autoencoding Dynamics: Topological Limitations and Capabilities

Matthew D. Kvalheim, Eduardo D. Sontag

Main category: cs.LG

TL;DR: The paper analyzes topological limitations and capabilities of autoencoders for data manifolds, and extends the framework to dynamical systems with invariant manifolds.

Details

Motivation: To understand the fundamental topological constraints and possibilities when constructing autoencoders that map between data manifolds and latent spaces, particularly for applications in dynamical systems.

Method: Theoretical analysis of continuous mappings (encoder E and decoder D) between data manifold M in R^n and latent space R^l, with focus on topological properties and the requirement that D∘E approximates the identity on M.

Result: Identifies various topological limitations that constrain autoencoder design, while also revealing capabilities for representing data manifolds and extending the framework to handle dynamical systems with invariant manifolds.

Conclusion: Autoencoders face inherent topological constraints in their mapping structure, but these theoretical insights enable applications to dynamical systems where the data manifold serves as an invariant manifold.

Abstract: Given a “data manifold” $M\subset \mathbb{R}^n$ and “latent space” $\mathbb{R}^\ell$, an autoencoder is a pair of continuous maps consisting of an “encoder” $E\colon \mathbb{R}^n\to \mathbb{R}^\ell$ and “decoder” $D\colon \mathbb{R}^\ell\to \mathbb{R}^n$ such that the “round trip” map $D\circ E$ is as close as possible to the identity map $\mbox{id}_M$ on $M$. We present various topological limitations and capabilites inherent to the search for an autoencoder, and describe capabilities for autoencoding dynamical systems having $M$ as an invariant manifold.

[217] Sharp Minima Can Generalize: A Loss Landscape Perspective On Data

Raymond Fan, Bryce Sandlund, Lin Myat Ko

Main category: cs.LG

TL;DR: The paper challenges the volume hypothesis by showing that sharp minima can generalize well but are hard to find due to small volumes. Large datasets make these good minima more accessible by changing the loss landscape.

Details

Motivation: To investigate why large datasets improve generalization in deep learning, challenging the conventional volume hypothesis that only flat minima generalize well.

Method: Measuring minima volumes under varying amounts of training data and analyzing how data quantity affects the loss landscape structure.

Result: Found that sharp minima can generalize well but are unlikely to be found due to small volumes. Large datasets transform the loss landscape, making previously small generalizing minima become relatively large and more accessible.

Conclusion: Large datasets improve generalization not just by providing more examples, but by fundamentally reshaping the loss landscape to make good minima more discoverable.

Abstract: The volume hypothesis suggests deep learning is effective because it is likely to find flat minima due to their large volumes, and flat minima generalize well. This picture does not explain the role of large datasets in generalization. Measuring minima volumes under varying amounts of training data reveals sharp minima which generalize well exist, but are unlikely to be found due to their small volumes. Increasing data changes the loss landscape, such that previously small generalizing minima become (relatively) large.

[218] A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification

Sebastian Ojeda, Rafael Velasquez, Nicolás Aparicio, Juanita Puentes, Paula Cárdenas, Nicolás Andrade, Gabriel González, Sergio Rincón, Carolina Muñoz-Camargo, Pablo Arbeláez

Main category: cs.LG

TL;DR: ESCAPE is a comprehensive framework integrating over 80,000 peptides from 27 repositories with standardized annotations and multilabel functional classification, enabling improved AI-driven antimicrobial peptide discovery.

Details

Motivation: To address challenges in antimicrobial peptide research including fragmented datasets, inconsistent annotations, and lack of standardized benchmarks that hinder computational approaches and slow down discovery of new candidates.

Method: Created ESCAPE framework integrating peptides from validated repositories, separating antimicrobial peptides from negative sequences, and incorporating functional annotations into a biologically coherent multilabel hierarchy. Developed a transformer-based model leveraging sequence and structural information to predict multiple functional activities.

Result: The method achieves up to 2.56% relative average improvement in mean Average Precision over the second-best method adapted for this task, establishing new state-of-the-art multilabel peptide classification.

Conclusion: ESCAPE provides a comprehensive and reproducible evaluation framework to advance AI-driven antimicrobial peptide research by addressing key challenges in data standardization and functional annotation.

Abstract: Antimicrobial peptides have emerged as promising molecules to combat antimicrobial resistance. However, fragmented datasets, inconsistent annotations, and the lack of standardized benchmarks hinder computational approaches and slow down the discovery of new candidates. To address these challenges, we present the Expanded Standardized Collection for Antimicrobial Peptide Evaluation (ESCAPE), an experimental framework integrating over 80.000 peptides from 27 validated repositories. Our dataset separates antimicrobial peptides from negative sequences and incorporates their functional annotations into a biologically coherent multilabel hierarchy, capturing activities across antibacterial, antifungal, antiviral, and antiparasitic classes. Building on ESCAPE, we propose a transformer-based model that leverages sequence and structural information to predict multiple functional activities of peptides. Our method achieves up to a 2.56% relative average improvement in mean Average Precision over the second-best method adapted for this task, establishing a new state-of-the-art multilabel peptide classification. ESCAPE provides a comprehensive and reproducible evaluation framework to advance AI-driven antimicrobial peptide research.

[219] Persistent reachability homology in machine learning applications

Luigi Caputi, Nicholas Meadows, Henri Riihimäki

Main category: cs.LG

TL;DR: Persistent reachability homology (PRH) outperforms traditional directed flag complex persistent homology (DPH) for epilepsy detection from directed graph data, using smaller condensed digraphs for computation.

Details

Motivation: To improve network classification in neuroscience, specifically epilepsy detection, by developing a more efficient persistent homology method for directed graph data.

Method: Used persistent reachability homology (PRH) which considers condensations of digraphs in persistent filtrations, computed from smaller digraphs. Compared PRH with directed flag complex persistent homology (DPH) using Betti curves and their integrals as features with SVM classifier.

Result: PRH outperformed DPH in the classification task for epilepsy detection from directed graph data.

Conclusion: Persistent reachability homology is more effective than traditional directed flag complex persistent homology for network classification tasks in neuroscience applications like epilepsy detection.

Abstract: We explore the recently introduced persistent reachability homology (PRH) of digraph data, i.e. data in the form of directed graphs. In particular, we study the effectiveness of PRH in network classification task in a key neuroscience problem: epilepsy detection. PRH is a variation of the persistent homology of digraphs, more traditionally based on the directed flag complex (DPH). A main advantage of PRH is that it considers the condensations of the digraphs appearing in the persistent filtration and thus is computed from smaller digraphs. We compare the effectiveness of PRH to that of DPH and we show that PRH outperforms DPH in the classification task. We use the Betti curves and their integrals as topological features and implement our pipeline on support vector machine.

[220] Prompt-Based Safety Guidance Is Ineffective for Unlearned Text-to-Image Diffusion Models

Jiwoo Shin, Byeonghu Na, Mina Kang, Wonhyeok Choi, Il-chul Moon

Main category: cs.LG

TL;DR: The paper proposes replacing negative prompts with implicit negative embeddings via concept inversion to improve compatibility between fine-tuning and training-free methods for defending against harmful content in text-to-image models.

Details

Motivation: Current defense approaches (fine-tuning and training-free guidance) show incompatibility when combined, leading to degraded performance in preventing harmful content generation from malicious text prompts.

Method: Replace negative prompts used in training-free methods with implicit negative embeddings obtained through concept inversion, requiring no modification to existing approaches.

Result: Experimental validation on nudity and violence benchmarks shows consistent improvements in defense success rate while preserving input prompt semantics.

Conclusion: The proposed method effectively addresses incompatibility between orthogonal defense approaches and can be easily integrated into existing pipelines to enhance protection against harmful content generation.

Abstract: Recent advances in text-to-image generative models have raised concerns about their potential to produce harmful content when provided with malicious input text prompts. To address this issue, two main approaches have emerged: (1) fine-tuning the model to unlearn harmful concepts and (2) training-free guidance methods that leverage negative prompts. However, we observe that combining these two orthogonal approaches often leads to marginal or even degraded defense performance. This observation indicates a critical incompatibility between two paradigms, which hinders their combined effectiveness. In this work, we address this issue by proposing a conceptually simple yet experimentally robust method: replacing the negative prompts used in training-free methods with implicit negative embeddings obtained through concept inversion. Our method requires no modification to either approach and can be easily integrated into existing pipelines. We experimentally validate its effectiveness on nudity and violence benchmarks, demonstrating consistent improvements in defense success rate while preserving the core semantics of input prompts.

[221] SPECTRA: Spectral Target-Aware Graph Augmentation for Imbalanced Molecular Property Regression

Brenda Nogueira, Meng Jiang, Nitesh V. Chawla, Nuno Moniz

Main category: cs.LG

TL;DR: SPECTRA is a spectral graph augmentation framework that generates realistic molecular graphs to address data imbalance in molecular property prediction, particularly for rare high-potency compounds.

Details

Motivation: Standard GNNs underperform on rare but critical molecular compounds due to data imbalance, and existing oversampling methods often distort molecular topology.

Method: SPECTRA reconstructs molecular graphs from SMILES, aligns molecules via Gromov-Wasserstein couplings, interpolates Laplacian eigenvalues/eigenvectors/features in stable share-basis, reconstructs edges, and uses rarity-aware budgeting for targeted augmentation.

Result: SPECTRA consistently improves error in relevant target ranges while maintaining competitive overall MAE, and generates interpretable synthetic molecules that reflect spectral geometry.

Conclusion: Spectral, geometry-aware augmentation is an effective and efficient strategy for imbalanced molecular property regression.

Abstract: In molecular property prediction, the most valuable compounds (e.g., high potency) often occupy sparse regions of the target space. Standard Graph Neural Networks (GNNs) commonly optimize for the average error, underperforming on these uncommon but critical cases, with existing oversampling methods often distorting molecular topology. In this paper, we introduce SPECTRA, a Spectral Target-Aware graph augmentation framework that generates realistic molecular graphs in the spectral domain. SPECTRA (i) reconstructs multi-attribute molecular graphs from SMILES; (ii) aligns molecule pairs via (Fused) Gromov-Wasserstein couplings to obtain node correspondences; (iii) interpolates Laplacian eigenvalues, eigenvectors and node features in a stable share-basis; and (iv) reconstructs edges to synthesize physically plausible intermediates with interpolated targets. A rarity-aware budgeting scheme, derived from a kernel density estimation of labels, concentrates augmentation where data are scarce. Coupled with a spectral GNN using edge-aware Chebyshev convolutions, SPECTRA densifies underrepresented regions without degrading global accuracy. On benchmarks, SPECTRA consistently improves error in relevant target ranges while maintaining competitive overall MAE, and yields interpretable synthetic molecules whose structure reflects the underlying spectral geometry. Our results demonstrate that spectral, geometry-aware augmentation is an effective and efficient strategy for imbalanced molecular property regression.

[222] Sublinear iterations can suffice even for DDPMs

Matthew S. Zhang, Stephen Huan, Jerry Huang, Nicholas M. Boffi, Sitan Chen, Sinho Chewi

Main category: cs.LG

TL;DR: The paper introduces DDRaM, a new SDE-based sampling method for DDPMs that achieves sublinear complexity in dimension for convergence, improving upon prior linear dependency bounds.

Details

Motivation: Prior DDPM analyses showed convergence guarantees that depend linearly on dimension or initial Fisher information, which is suboptimal. The authors aim to develop a method with better discretization properties and sublinear complexity.

Method: Proposed DDRaM (denoising diffusion randomized midpoint method), an integrator that uses an additional randomized midpoint to better approximate the SDE. Analyzed using the “shifted composition rule” framework under smoothness assumptions.

Result: DDRaM achieves sublinear O(√d) score evaluations for convergence, which is the first sublinear complexity bound for pure DDPM sampling. Experimental validation shows it performs well with pre-trained image synthesis models.

Conclusion: DDRaM provides improved discretization properties and sublinear complexity for DDPM sampling, offering practical advantages over prior methods while maintaining compatibility with existing models.

Abstract: SDE-based methods such as denoising diffusion probabilistic models (DDPMs) have shown remarkable success in real-world sample generation tasks. Prior analyses of DDPMs have been focused on the exponential Euler discretization, showing guarantees that generally depend at least linearly on the dimension or initial Fisher information. Inspired by works in log-concave sampling (Shen and Lee, 2019), we analyze an integrator – the denoising diffusion randomized midpoint method (DDRaM) – that leverages an additional randomized midpoint to better approximate the SDE. Using a recently-developed analytic framework called the “shifted composition rule”, we show that this algorithm enjoys favorable discretization properties under appropriate smoothness assumptions, with sublinear $\widetilde{O}(\sqrt{d})$ score evaluations needed to ensure convergence. This is the first sublinear complexity bound for pure DDPM sampling – prior works which obtained such bounds worked instead with ODE-based sampling and had to make modifications to the sampler which deviate from how they are used in practice. We also provide experimental validation of the advantages of our method, showing that it performs well in practice with pre-trained image synthesis models.

[223] Investigating U.S. Consumer Demand for Food Products with Innovative Transportation Certificates Based on Stated Preferences and Machine Learning Approaches

Jingchen Bi, Rodrigo Mesa-Arango

Main category: cs.LG

TL;DR: Machine learning model used to analyze consumer preferences for food products with innovative transportation certificates in the U.S., identifying safety and energy certificates as most valued.

Details

Motivation: To understand consumer behavior regarding food products with transportation certificates, building on previous research that identified transportation factors as significant in food purchasing choices.

Method: Applied machine learning model to analyze stated preference data from experiments that tested five transportation certificates (Transportation Mode, IoT, Safety, Energy Source, MABDs) along with product-specific and decision-maker control factors.

Result: Consumers showed strong preference for safety and energy certificates in transportation; study also revealed how price, product type, certificates, and decision-maker factors influence purchasing choices.

Conclusion: Provides data-driven recommendations for improving food supply chain systems based on consumer preferences for transportation certificates.

Abstract: This paper utilizes a machine learning model to estimate the consumer’s behavior for food products with innovative transportation certificates in the U.S. Building on previous research that examined demand for food products with supply chain traceability using stated preference analysis, transportation factors were identified as significant in consumer food purchasing choices. Consequently, a second experiment was conducted to pinpoint the specific transportation attributes valued by consumers. A machine learning model was applied, and five innovative certificates related to transportation were proposed: Transportation Mode, Internet of Things (IoT), Safety measures, Energy Source, and Must Arrive By Dates (MABDs). The preference experiment also incorporated product-specific and decision-maker factors for control purposes. The findings reveal a notable inclination toward safety and energy certificates within the transportation domain of the U.S. food supply chain. Additionally, the study examined the influence of price, product type, certificates, and decision-maker factors on purchasing choices. Ultimately, the study offers data-driven recommendations for improving food supply chain systems.

[224] You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Shuvendu Roy, Hossein Hajimirsadeghi, Mengyao Zhai, Golnoosh Samei

Main category: cs.LG

TL;DR: Label-free RL for reasoning enhancement depends heavily on base model capability, with smaller models (0.5B-7B) performing worse than baseline. Proposed curriculum learning with progressive difficulty and data curation improves performance across all model sizes.

Details

Motivation: To investigate the generalizability of unsupervised RL methods to smaller base models with limited reasoning capabilities, as previous work focused on large models.

Method: Curriculum learning with progressive difficulty introduction, masking no-majority rollouts during training, and a data curation pipeline for samples with predefined difficulty.

Result: Label-free RL performance degrades below baseline for weaker models due to insufficient chain-of-thought reasoning. Proposed method shows consistent improvements across all model sizes (0.5B-7B).

Conclusion: Curriculum-based label-free RL provides a path toward robust unsupervised RL that can bootstrap reasoning in resource-constrained models.

Abstract: Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model’s pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at https://github.com/BorealisAI/CuMa

[225] Grounded Test-Time Adaptation for LLM Agents

Arthur Chen, Zuxin Liu, Jianguo Zhang, Akshara Prabhakar, Zhiwei Liu, Shelby Heinecke, Silvio Savarese, Victor Zhong, Caiming Xiong

Main category: cs.LG

TL;DR: The paper proposes two complementary strategies for adapting LLM agents to novel environments: online distributional adaptation for syntactic alignment and deployment-time dynamics grounding for semantic understanding of state transitions.

Details

Motivation: LLM-based agents struggle to generalize to novel environments due to mismatches between pre-training and test-time conditions, particularly with syntactic misunderstandings of observation formats and semantic misunderstandings of state-transition dynamics.

Method: Two strategies: 1) Online distributional adaptation learns lightweight adaptation vectors to bias model outputs for environment response format alignment; 2) Deployment-time dynamics grounding uses persona-driven exploration to systematically probe and learn environment causal dynamics before task execution.

Result: Both strategies show effectiveness across diverse benchmarks with minimal computational cost. Dynamics grounding is particularly effective in complex environments, increasing success rate from 2% to 23% on WebArena multi-site split.

Conclusion: The proposed methods provide a robust path toward more generalizable and capable LLM-based agents, with dynamics grounding especially valuable for handling unpredictable dynamics in complex environments.

Abstract: Large language model (LLM)-based agents struggle to generalize to novel and complex environments, such as unseen websites or new sets of functions, due to a fundamental mismatch between their pre-training and test-time conditions. This challenge stems from two distinct failure modes: a syntactic misunderstanding of environment-specific components like observation formats, and a semantic misunderstanding of state-transition dynamics, which are only revealed at test time. To address these issues, we propose two distinct and complementary strategies for adapting LLM agents by leveraging environment-specific information available during deployment. First, an online distributional adaptation method parameterizes environmental nuances by learning a lightweight adaptation vector that biases the model’s output distribution, enabling rapid alignment with an environment response format. Second, a deployment-time dynamics grounding method employs a persona-driven exploration phase to systematically probe and learn the environment’s causal dynamics before task execution, equipping the agent with a nonparametric world model. We evaluate these strategies across diverse agentic benchmarks, including function calling and web navigation. Our empirical results show the effectiveness of both strategies across all benchmarks with minimal computational cost. We find that dynamics grounding is particularly effective in complex environments where unpredictable dynamics pose a major obstacle, demonstrating a robust path toward more generalizable and capable LLM-based agents. For example, on the WebArena multi-site split, this method increases the agent’s success rate from 2% to 23%.

[226] A Dual Perspective on Decision-Focused Learning: Scalable Training via Dual-Guided Surrogates

Paula Rodriguez-Diaz, Kirk Bansak Elisabeth Paulson

Main category: cs.LG

TL;DR: Dual-Guided Loss (DGL) is a scalable decision-focused learning method that uses dual variables from optimization problems to guide model training, reducing solver dependence while maintaining decision alignment.

Details

Motivation: Current decision-focused learning methods either differentiate through solvers or use task-specific surrogates, both requiring frequent expensive optimizer calls, making scaling challenging.

Method: Leverages dual variables from downstream combinatorial selection problems to shape learning, using dual-adjusted targets with simple differentiable surrogate losses and solving the optimization problem only periodically.

Result: DGL matches or exceeds state-of-the-art DFL methods while using far fewer solver calls and substantially less training time, with proven asymptotically diminishing decision regret.

Conclusion: DGL provides a scalable approach to decision-focused learning that drives training cost toward standard supervised learning while retaining strong decision alignment.

Abstract: Many real-world decisions are made under uncertainty by solving optimization problems using predicted quantities. This predict-then-optimize paradigm has motivated decision-focused learning, which trains models with awareness of how the optimizer uses predictions, improving the performance of downstream decisions. Despite its promise, scaling is challenging: state-of-the-art methods either differentiate through a solver or rely on task-specific surrogates, both of which require frequent and expensive calls to an optimizer, often a combinatorial one. In this paper, we leverage dual variables from the downstream problem to shape learning and introduce Dual-Guided Loss (DGL), a simple, scalable objective that preserves decision alignment while reducing solver dependence. We construct DGL specifically for combinatorial selection problems with natural one-of-many constraints, such as matching, knapsack, and shortest path. Our approach (a) decouples optimization from gradient updates by solving the downstream problem only periodically; (b) between refreshes, trains on dual-adjusted targets using simple differentiable surrogate losses; and (c) as refreshes become less frequent, drives training cost toward standard supervised learning while retaining strong decision alignment. We prove that DGL has asymptotically diminishing decision regret, analyze runtime complexity, and show on two problem classes that DGL matches or exceeds state-of-the-art DFL methods while using far fewer solver calls and substantially less training time. Code is available at https://github.com/paularodr/Dual-Guided-Learning.

[227] SigmaDock: Untwisting Molecular Docking With Fragment-Based SE(3) Diffusion

Alvaro Prat, Leo Zhang, Charlotte M. Deane, Yee Whye Teh, Garrett M. Morris

Main category: cs.LG

TL;DR: SigmaDock is a novel SE(3) Riemannian diffusion model that uses a fragmentation scheme to decompose ligands into rigid-body fragments, achieving state-of-the-art performance in molecular docking with 79.9% Top-1 success rate.

Details

Motivation: To address limitations of generative approaches in molecular docking, including chemically implausible outputs, poor generalizability, and high computational cost, by leveraging structural chemistry principles.

Method: Introduces a fragmentation scheme to decompose ligands into rigid-body fragments, then uses an SE(3) Riemannian diffusion model to reassemble these fragments within binding pockets, exploiting geometric priors while avoiding complex diffusion processes.

Result: Achieves 79.9% Top-1 success rate (RMSD<2 & PB-valid) on PoseBusters set, significantly outperforming recent deep learning approaches (12.7-30.8%) and becoming the first deep learning method to surpass classical physics-based docking under PB train-test split.

Conclusion: SigmaDock represents a significant leap forward in deep learning for molecular modeling, demonstrating reliable generalization to unseen proteins and establishing new state-of-the-art performance in molecular docking.

Abstract: Determining the binding pose of a ligand to a protein, known as molecular docking, is a fundamental task in drug discovery. Generative approaches promise faster, improved, and more diverse pose sampling than physics-based methods, but are often hindered by chemically implausible outputs, poor generalisability, and high computational cost. To address these challenges, we introduce a novel fragmentation scheme, leveraging inductive biases from structural chemistry, to decompose ligands into rigid-body fragments. Building on this decomposition, we present SigmaDock, an SE(3) Riemannian diffusion model that generates poses by learning to reassemble these rigid bodies within the binding pocket. By operating at the level of fragments in SE(3), SigmaDock exploits well-established geometric priors while avoiding overly complex diffusion processes and unstable training dynamics. Experimentally, we show SigmaDock achieves state-of-the-art performance, reaching Top-1 success rates (RMSD<2 & PB-valid) above 79.9% on the PoseBusters set, compared to 12.7-30.8% reported by recent deep learning approaches, whilst demonstrating consistent generalisation to unseen proteins. SigmaDock is the first deep learning approach to surpass classical physics-based docking under the PB train-test split, marking a significant leap forward in the reliability and feasibility of deep learning for molecular modelling.

[228] Quantum Boltzmann Machines for Sample-Efficient Reinforcement Learning

Thore Gerlach, Michael Schenk, Verena Kain

Main category: cs.LG

TL;DR: Continuous Semi-Quantum Boltzmann Machines (CSQBMs) enable continuous-action reinforcement learning by combining classical visible units with quantum hidden units, reducing qubit needs while maintaining expressiveness and enabling analytical gradient computation.

Details

Motivation: To overcome instability issues in continuous control reinforcement learning and reduce qubit requirements while maintaining strong model expressiveness for quantum-enhanced machine learning.

Method: Combine exponential-family priors over visible units with quantum Boltzmann distributions over hidden units to create hybrid quantum-classical models, and propose a continuous Q-learning framework that replaces global maximization with efficient sampling from CSQBM distributions.

Result: CSQBMs support continuous-action reinforcement learning, enable analytical gradient computation for continuous variables, and can be directly integrated into Actor-Critic algorithms.

Conclusion: The proposed CSQBMs provide a theoretically grounded approach for continuous control reinforcement learning that reduces quantum resource requirements while overcoming instability issues through efficient sampling-based optimization.

Abstract: We introduce theoretically grounded Continuous Semi-Quantum Boltzmann Machines (CSQBMs) that supports continuous-action reinforcement learning. By combining exponential-family priors over visible units with quantum Boltzmann distributions over hidden units, CSQBMs yield a hybrid quantum-classical model that reduces qubit requirements while retaining strong expressiveness. Crucially, gradients with respect to continuous variables can be computed analytically, enabling direct integration into Actor-Critic algorithms. Building on this, we propose a continuous Q-learning framework that replaces global maximization by efficient sampling from the CSQBM distribution, thereby overcoming instability issues in continuous control.

[229] FoodRL: A Reinforcement Learning Ensembling Framework For In-Kind Food Donation Forecasting

Esha Sharma, Lauren Davis, Julie Ivy, Min Chi

Main category: cs.LG

TL;DR: FoodRL is a reinforcement learning-based metalearning framework that clusters and dynamically weights forecasting models to improve accuracy for food bank donation predictions, especially during disruptions like hurricanes and wildfires.

Details

Motivation: Food banks need accurate forecasting of volatile in-kind donations for equitable resource distribution, but traditional models fail due to unpredictable fluctuations and concept drift from seasonal variations and natural disasters.

Method: Proposed FoodRL - a reinforcement learning-based metalearning framework that clusters and dynamically weights diverse forecasting models based on recent performance and contextual information.

Result: FoodRL consistently outperforms baseline methods, particularly during disruption periods, and can facilitate redistribution equivalent to 1.7 million additional meals annually.

Conclusion: FoodRL demonstrates significant potential for social impact and adaptive ensemble learning for humanitarian supply chains by providing more reliable and adaptive forecasts.

Abstract: Food banks are crucial for alleviating food insecurity, but their effectiveness hinges on accurately forecasting highly volatile in-kind donations to ensure equitable and efficient resource distribution. Traditional forecasting models often fail to maintain consistent accuracy due to unpredictable fluctuations and concept drift driven by seasonal variations and natural disasters such as hurricanes in the Southeastern U.S. and wildfires in the West Coast. To address these challenges, we propose FoodRL, a novel reinforcement learning (RL) based metalearning framework that clusters and dynamically weights diverse forecasting models based on recent performance and contextual information. Evaluated on multi-year data from two structurally distinct U.S. food banks-one large regional West Coast food bank affected by wildfires and another state-level East Coast food bank consistently impacted by hurricanes, FoodRL consistently outperforms baseline methods, particularly during periods of disruption or decline. By delivering more reliable and adaptive forecasts, FoodRL can facilitate the redistribution of food equivalent to 1.7 million additional meals annually, demonstrating its significant potential for social impact as well as adaptive ensemble learning for humanitarian supply chains.

[230] BiPETE: A Bi-Positional Embedding Transformer Encoder for Risk Assessment of Alcohol and Substance Use Disorder with Electronic Health Records

Daniel S. Lee, Mayra S. Haedo-Cruz, Chen Jiang, Oshin Miranda, LiRong Wang

Main category: cs.LG

TL;DR: BiPETE model combines rotary and sinusoidal positional embeddings for EHR-based disease risk prediction, achieving significant performance improvements for alcohol and substance use disorders in mental health cohorts.

Details

Motivation: Transformer models show promise for EHR-based disease prediction but struggle with temporal dependencies due to irregular visit intervals and lack of uniform structure in healthcare data.

Method: Proposed Bi-Positional Embedding Transformer Encoder (BiPETE) that integrates rotary positional embeddings for relative visit timing and sinusoidal embeddings for visit order, trained on EHR data from depression and PTSD cohorts.

Result: BiPETE outperforms baselines, improving AUPRC by 34% in depression cohort and 50% in PTSD cohort. Integrated Gradients identified key clinical features including abnormal inflammatory, hematologic, metabolic markers, medications, and comorbidities.

Conclusion: The study presents a practical and interpretable framework for EHR-based disease risk prediction that achieves strong performance and provides deeper understanding of risk assessment processes.

Abstract: Transformer-based deep learning models have shown promise for disease risk prediction using electronic health records(EHRs), but modeling temporal dependencies remains a key challenge due to irregular visit intervals and lack of uniform structure. We propose a Bi-Positional Embedding Transformer Encoder or BiPETE for single-disease prediction, which integrates rotary positional embeddings to encode relative visit timing and sinusoidal embeddings to preserve visit order. Without relying on large-scale pretraining, BiPETE is trained on EHR data from two mental health cohorts-depressive disorder and post-traumatic stress disorder (PTSD)-to predict the risk of alcohol and substance use disorders (ASUD). BiPETE outperforms baseline models, improving the area under the precision-recall curve (AUPRC) by 34% and 50% in the depression and PTSD cohorts, respectively. An ablation study further confirms the effectiveness of the dual positional encoding strategy. We apply the Integrated Gradients method to interpret model predictions, identifying key clinical features associated with ASUD risk and protection, such as abnormal inflammatory, hematologic, and metabolic markers, as well as specific medications and comorbidities. Overall, these key clinical features identified by the attribution methods contribute to a deeper understanding of the risk assessment process and offer valuable clues for mitigating potential risks. In summary, our study presents a practical and interpretable framework for disease risk prediction using EHR data, which can achieve strong performance.

[231] Self-Interest and Systemic Benefits: Emergence of Collective Rationality in Mixed Autonomy Traffic Through Deep Reinforcement Learning

Di Chen, Jia Li, Michael Zhang

Main category: cs.LG

TL;DR: Self-interested autonomous vehicles can achieve collective rationality (cooperative behavior) in mixed autonomy traffic through deep reinforcement learning, benefiting all agents without explicit system-level objectives.

Details

Motivation: To understand if self-interested AVs can benefit all driving agents in mixed autonomy traffic, exploring whether collective cooperation emerges naturally from individual interests.

Method: Used deep reinforcement learning with simple reward design to train self-interested traffic agents, examining collective rationality emergence in various scenarios.

Result: Collective rationality consistently emerged across different scenarios, demonstrating robustness. A mechanism explaining CR emergence in microscopic dynamic environments was verified through simulation evidence.

Conclusion: Advanced learning methods like federated learning could potentially achieve collective cooperation among self-interested driving agents in mixed-autonomy systems.

Abstract: Autonomous vehicles (AVs) are expected to be commercially available in the near future, leading to mixed autonomy traffic consisting of both AVs and human-driven vehicles (HVs). Although numerous studies have shown that AVs can be deployed to benefit the overall traffic system performance by incorporating system-level goals into their decision making, it is not clear whether the benefits still exist when agents act out of self-interest – a trait common to all driving agents, both human and autonomous. This study aims to understand whether self-interested AVs can bring benefits to all driving agents in mixed autonomy traffic systems. The research is centered on the concept of collective rationality (CR). This concept, originating from game theory and behavioral economics, means that driving agents may cooperate collectively even when pursuing individual interests. Our recent research has proven the existence of CR in an analytical game-theoretical model and empirically in mixed human-driven traffic. In this paper, we demonstrate that CR can be attained among driving agents trained using deep reinforcement learning (DRL) with a simple reward design. We examine the extent to which self-interested traffic agents can achieve CR without directly incorporating system-level objectives. Results show that CR consistently emerges in various scenarios, which indicates the robustness of this property. We also postulate a mechanism to explain the emergence of CR in the microscopic and dynamic environment and verify it based on simulation evidence. This research suggests the possibility of leveraging advanced learning methods (such as federated learning) to achieve collective cooperation among self-interested driving agents in mixed-autonomy systems.

[232] Efficient Swap Multicalibration of Elicitable Properties

Lunjia Hu, Haipeng Luo, Spandan Senapati, Vatsal Sharan

Main category: cs.LG

TL;DR: This paper improves multicalibration algorithms for elicitable properties, introducing swap multicalibration and achieving better error bounds through oracle-efficient methods.

Details

Motivation: To address the inefficiency of previous multicalibration algorithms and improve error bounds, particularly resolving an open question about achieving optimal error rates with oracle-efficient methods.

Method: Generalizes multicalibration to arbitrary bounded hypothesis classes, introduces swap multicalibration, and proposes an oracle-efficient algorithm using online agnostic learning with bounded sequential Rademacher complexity.

Result: Achieves T^(1/(r+1)) ℓ_r-swap multicalibration error for r≥2, with T^(1/3) ℓ_2-swap multicalibration as a special case, significantly improving previous bounds.

Conclusion: The paper provides oracle-efficient algorithms that achieve substantially improved multicalibration error bounds, completely resolving an important open question in the field.

Abstract: Multicalibration [HJKRR18] is an algorithmic fairness perspective that demands that the predictions of a predictor are correct conditional on themselves and membership in a collection of potentially overlapping subgroups of a population. The work of [NR23] established a surprising connection between multicalibration for an arbitrary property $\Gamma$ (e.g., mean or median) and property elicitation: a property $\Gamma$ can be multicalibrated if and only if it is elicitable, where elicitability is the notion that the true property value of a distribution can be obtained by solving a regression problem over the distribution. In the online setting, [NR23] proposed an inefficient algorithm that achieves $\sqrt T$ $\ell_2$-multicalibration error for a hypothesis class of group membership functions and an elicitable property $\Gamma$, after $T$ rounds of interaction between a forecaster and adversary. In this paper, we generalize multicalibration for an elicitable property $\Gamma$ from group membership functions to arbitrary bounded hypothesis classes and introduce a stronger notion – swap multicalibration, following [GKR23]. Subsequently, we propose an oracle-efficient algorithm which, when given access to an online agnostic learner, achieves $T^{1/(r+1)}$ $\ell_r$-swap multicalibration error with high probability (for $r\ge2$) for a hypothesis class with bounded sequential Rademacher complexity and an elicitable property $\Gamma$. For the special case of $r=2$, this implies an oracle-efficient algorithm that achieves $T^{1/3}$ $\ell_2$-swap multicalibration error, which significantly improves on the previously established bounds for the problem [NR23, GMS25, LSS25a], and completely resolves an open question raised in [GJRR24] on the possibility of an oracle-efficient algorithm that achieves $\sqrt{T}$ $\ell_2$-mean multicalibration error by answering it in a strongly affirmative sense.

[233] Multi-agent Coordination via Flow Matching

Dongsu Lee, Daehee Lee, Amy Zhang

Main category: cs.LG

TL;DR: MAC-Flow is a framework that balances rich multi-agent coordination representation with fast execution by learning flow-based joint behaviors and distilling them into decentralized one-step policies.

Details

Motivation: Existing approaches sacrifice either representation quality (diffusion methods are slow) or efficiency (Gaussian policies are brittle), creating a trade-off between coordination complexity and computational speed.

Method: First learns a flow-based representation of joint behaviors from offline data, then distills it into decentralized one-step policies that preserve coordination while enabling fast execution.

Result: Achieves ~14.5x faster inference than diffusion-based methods while maintaining good performance, with inference speed similar to prior Gaussian policy-based offline MARL methods across 12 environments and 34 datasets.

Conclusion: MAC-Flow successfully addresses the performance-computation trade-off in multi-agent coordination, providing both expressive coordination modeling and efficient real-time execution.

Abstract: This work presents MAC-Flow, a simple yet expressive framework for multi-agent coordination. We argue that requirements of effective coordination are twofold: (i) a rich representation of the diverse joint behaviors present in offline data and (ii) the ability to act efficiently in real time. However, prior approaches often sacrifice one for the other, i.e., denoising diffusion-based solutions capture complex coordination but are computationally slow, while Gaussian policy-based solutions are fast but brittle in handling multi-agent interaction. MAC-Flow addresses this trade-off by first learning a flow-based representation of joint behaviors, and then distilling it into decentralized one-step policies that preserve coordination while enabling fast execution. Across four different benchmarks, including $12$ environments and $34$ datasets, MAC-Flow alleviates the trade-off between performance and computational cost, specifically achieving about $\boldsymbol{\times14.5}$ faster inference compared to diffusion-based MARL methods, while maintaining good performance. At the same time, its inference speed is similar to that of prior Gaussian policy-based offline multi-agent reinforcement learning (MARL) methods.

[234] Machine Learning Algorithms in Statistical Modelling Bridging Theory and Application

A. Ganapathi Rao, Sathish Krishna Anumula, Aditya Kumar Singh, Renukhadevi M, Y. Jeevan Nagendra Kumar, Tammineni Rama Tulasi

Main category: cs.LG

TL;DR: This paper explores novel integrations between machine learning algorithms and traditional statistical modeling, demonstrating how ML enriches conventional models to improve performance, scalability, flexibility, and robustness.

Details

Motivation: To understand how modern ML algorithms can enhance traditional statistical models and transform data analysis, predictive analytics, and decision-making processes.

Method: Studying connections between ML and statistical models, demonstrating how new algorithms enrich conventional models through hybrid approaches.

Result: Hybrid models show significant improvements in predictive accuracy, robustness, and interpretability compared to traditional statistical models alone.

Conclusion: Integrating ML with traditional statistical modeling creates powerful hybrid approaches that substantially enhance model performance across multiple dimensions including accuracy, robustness, and interpretability.

Abstract: It involves the completely novel ways of integrating ML algorithms with traditional statistical modelling that has changed the way we analyze data, do predictive analytics or make decisions in the fields of the data. In this paper, we study some ML and statistical model connections to understand ways in which some modern ML algorithms help ’enrich’ conventional models; we demonstrate how new algorithms improve performance, scale, flexibility and robustness of the traditional models. It shows that the hybrid models are of great improvement in predictive accuracy, robustness, and interpretability

[235] Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Hadi Reisizadeh, Jiajun Ruan, Yiwei Chen, Soumyadeep Pal, Sijia Liu, Mingyi Hong

Main category: cs.LG

TL;DR: Most existing LLM unlearning methods fail to achieve true forgetting - sensitive information reliably resurfaces under probabilistic decoding despite appearing forgotten under deterministic decoding.

Details

Motivation: Unlearning is critical for regulatory compliance and ethical AI systems to avoid producing private, toxic, illegal, or copyrighted content, but current methods are unreliable.

Method: Introduces leak@k metric to quantify forgotten knowledge reappearing when generating k samples under realistic decoding strategies, and conducts systematic study across TOFU, MUSE, and WMDP benchmarks.

Result: Knowledge leakage persists across all methods and tasks - current state-of-the-art unlearning techniques provide only limited forgetting.

Conclusion: Urgent need for more robust approaches to LLM unlearning as existing methods fail to achieve true forgetting in practice.

Abstract: Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing private, toxic, illegal, or copyrighted content. Despite rapid progress, in this work we show that \textit{almost all} existing unlearning methods fail to achieve true forgetting in practice. Specifically, while evaluations of these `unlearned’ models under deterministic (greedy) decoding often suggest successful knowledge removal using standard benchmarks (as has been done in the literature), we show that sensitive information reliably resurfaces when models are sampled with standard probabilistic decoding. To rigorously capture this vulnerability, we introduce \texttt{leak@$k$}, a new meta-evaluation metric that quantifies the likelihood of forgotten knowledge reappearing when generating $k$ samples from the model under realistic decoding strategies. Using three widely adopted benchmarks, TOFU, MUSE, and WMDP, we conduct the first large-scale, systematic study of unlearning reliability using our newly defined \texttt{leak@$k$} metric. Our findings demonstrate that knowledge leakage persists across methods and tasks, underscoring that current state-of-the-art unlearning techniques provide only limited forgetting and highlighting the urgent need for more robust approaches to LLM unlearning.

[236] OvA-LP: A Simple and Efficient Framework for Federated Learning on Non-IID Data

Dongjin Park, Hasung Yeo, Joon-Woo Lee

Main category: cs.LG

TL;DR: OvA-LP is a minimalist framework that suppresses local drift in federated fine-tuning by combining linear probing on frozen encoders with one-vs-all heads and a two-stage procedure, achieving 95.9% IID accuracy retention under extreme non-IID conditions.

Details

Motivation: Federated fine-tuning remains fragile under heterogeneous client distributions due to local drift, where client-level update divergences cause systematic bias and amplified variance in the global model. Existing methods that correct drift post hoc prove brittle under extreme non-IID conditions.

Method: OvA-LP combines linear probing on a frozen encoder with a one-vs-all head and a simple two-stage procedure. It preserves pretrained feature geometry and decouples logits to prevent mechanisms that amplify drift. Precomputing encoder features makes per-round cost nearly independent of encoder size.

Result: On CIFAR-100 with 100 clients, OvA-LP retains 95.9% of its IID accuracy across various partitions, while state-of-the-art baselines retain only 10.1% (PFPT) and 34.5% (FFT-MoE). It maintains resilience under both symmetric and asymmetric label noise.

Conclusion: OvA-LP provides a principled and efficient basis for robust federated fine-tuning under heterogeneity by suppressing drift at its source within the PEFT-based paradigm.

Abstract: Federated fine-tuning (FFT) adapts foundation models to decentralized data but remains fragile under heterogeneous client distributions due to local drift, i.e., client-level update divergences that induce systematic bias and amplified variance in the global model. Existing aggregation and personalization methods largely correct drift post hoc, which proves brittle under extreme non-IID conditions. We introduce OvA-LP, a minimalist framework that is, to our knowledge, the first explicitly designed to suppress drift at its source within the PEFT-based FFT paradigm. OvA-LP combines linear probing on a frozen encoder with a one-vs-all head and a simple two-stage procedure, preserving pretrained feature geometry and decoupling logits to prevent the mechanisms that amplify drift. On CIFAR-100 with 100 clients, averaged over shard-1, shard-2, and Bernoulli-Dirichlet partitions, OvA-LP retains 95.9% of its IID accuracy, whereas state-of-the-art FFT baselines retain only 10.1% (PFPT) and 34.5% (FFT-MoE) under the same conditions. OvA-LP further maintains resilience under both symmetric and asymmetric label noise. In addition, precomputing encoder features makes per-round cost nearly independent of encoder size. Together, these results demonstrate that OvA-LP provides a principled and efficient basis for robust FFT under heterogeneity.

[237] Structural Properties, Cycloid Trajectories and Non-Asymptotic Guarantees of EM Algorithm for Mixed Linear Regression

Zhankun Luo, Abolfazl Hashemi

Main category: cs.LG

TL;DR: This paper analyzes EM algorithm for 2-component Mixed Linear Regression with unknown parameters, characterizing cycloid trajectories and proving non-asymptotic convergence guarantees across different SNR regimes.

Details

Motivation: Previous studies established EM convergence for 2MLR with known balanced weights, but theoretical behavior with fully unknown parameters remained unclear, especially regarding trajectory patterns and convergence order.

Method: Derived explicit EM update expressions for 2MLR with unknown mixing weights and regression parameters, analyzed structural properties and cycloid trajectories, established recurrence relations for sub-optimality angles, and sharpened statistical error bounds.

Result: Proved that regression parameters trace cycloid trajectories in noiseless case, quantified trajectory deviations in high SNR, revealed linear convergence when estimates are nearly orthogonal to ground truth and quadratic convergence when angles are small, established non-asymptotic convergence with arbitrary initialization.

Conclusion: The work provides a novel trajectory-based framework for analyzing EM in Mixed Linear Regression, establishing comprehensive convergence guarantees and characterizing the algorithm’s geometric behavior across different parameter regimes.

Abstract: This work investigates the structural properties, cycloid trajectories, and non-asymptotic convergence guarantees of the Expectation-Maximization (EM) algorithm for two-component Mixed Linear Regression (2MLR) with unknown mixing weights and regression parameters. Recent studies have established global convergence for 2MLR with known balanced weights and super-linear convergence in noiseless and high signal-to-noise ratio (SNR) regimes. However, the theoretical behavior of EM in the fully unknown setting remains unclear, with its trajectory and convergence order not yet fully characterized. We derive explicit EM update expressions for 2MLR with unknown mixing weights and regression parameters across all SNR regimes and analyze their structural properties and cycloid trajectories. In the noiseless case, we prove that the trajectory of the regression parameters in EM iterations traces a cycloid by establishing a recurrence relation for the sub-optimality angle, while in high SNR regimes we quantify its discrepancy from the cycloid trajectory. The trajectory-based analysis reveals the order of convergence: linear when the EM estimate is nearly orthogonal to the ground truth, and quadratic when the angle between the estimate and ground truth is small at the population level. Our analysis establishes non-asymptotic guarantees by sharpening bounds on statistical errors between finite-sample and population EM updates, relating EM’s statistical accuracy to the sub-optimality angle, and proving convergence with arbitrary initialization at the finite-sample level. This work provides a novel trajectory-based framework for analyzing EM in Mixed Linear Regression.

[238] Risk Prediction of Cardiovascular Disease for Diabetic Patients with Machine Learning and Deep Learning Techniques

Esha Chowdhury

Main category: cs.LG

TL;DR: This study proposes machine learning and hybrid deep learning models for cardiovascular disease risk prediction in diabetic patients, achieving high accuracy (0.9050) with XGBoost and LSTM models.

Details

Motivation: Addressing the growing prevalence of diabetes and its strong link to heart disease, aiming to improve CVD risk prediction for diabetic patients to enhance clinical decision-making.

Method: Used BRFSS dataset with preprocessing (removing duplicates, handling missing values, feature identification, PCA). Implemented ML models (DT, RF, KNN, SVM, AdaBoost, XGBoost) and DL models (ANN, DNN, RNN, CNN, LSTM, BiLSTM, GRU) including hybrid CNN combinations.

Result: XGBoost and LSTM achieved highest accuracy of 0.9050. Some models achieved perfect recall (1.00). High accuracy and F1 scores demonstrated strong predictive performance.

Conclusion: ML and DL models are effective for CVD risk prediction in diabetic patients, showing potential to automate and enhance clinical decision-making, improve personalized risk management, and preventive strategies.

Abstract: Accurate prediction of cardiovascular disease (CVD) risk is crucial for healthcare institutions. This study addresses the growing prevalence of diabetes and its strong link to heart disease by proposing an efficient CVD risk prediction model for diabetic patients using machine learning (ML) and hybrid deep learning (DL) approaches. The BRFSS dataset was preprocessed by removing duplicates, handling missing values, identifying categorical and numerical features, and applying Principal Component Analysis (PCA) for feature extraction. Several ML models, including Decision Trees (DT), Random Forest (RF), k-Nearest Neighbors (KNN), Support Vector Machine (SVM), AdaBoost, and XGBoost, were implemented, with XGBoost achieving the highest accuracy of 0.9050. Various DL models, such as Artificial Neural Networks (ANN), Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), and Gated Recurrent Unit (GRU), as well as hybrid models combining CNN with LSTM, BiLSTM, and GRU, were also explored. Some of these models achieved perfect recall (1.00), with the LSTM model achieving the highest accuracy of 0.9050. Our research highlights the effectiveness of ML and DL models in predicting CVD risk among diabetic patients, automating and enhancing clinical decision-making. High accuracy and F1 scores demonstrate these models’ potential to improve personalized risk management and preventive strategies.

[239] Less Is More: Generating Time Series with LLaMA-Style Autoregression in Simple Factorized Latent Spaces

Siyuan Li, Yifan Sun, Lei Cheng, Lewen Wang, Yang Liu, Weiqing Liu, Jianlong Li, Jiang Bian, Shikai Fang

Main category: cs.LG

TL;DR: FAR-TS is a fast autoregressive Transformer framework that generates multivariate time series by decomposing them into static basis and temporal coefficients, then modeling discrete tokens with a LLaMA-style Transformer for arbitrary-length generation.

Details

Motivation: Current diffusion-based time series generation methods are slow and limited to fixed-length windows, creating a need for faster, more flexible approaches that can handle variable-length sequences while preserving cross-channel correlations.

Method: Decompose time series into data-adaptive basis (static cross-channel correlations) and temporal coefficients, vector-quantize coefficients into discrete tokens, then use a LLaMA-style autoregressive Transformer to model token sequences for generation.

Result: Achieves orders-of-magnitude faster generation than Diffusion-TS while preserving cross-channel correlations and maintaining an interpretable latent space, enabling high-quality flexible time series synthesis.

Conclusion: FAR-TS provides a simple yet effective framework for fast, controllable generation of arbitrary-length multivariate time series with preserved correlations and interpretable representations.

Abstract: Generative models for multivariate time series are essential for data augmentation, simulation, and privacy preservation, yet current state-of-the-art diffusion-based approaches are slow and limited to fixed-length windows. We propose FAR-TS, a simple yet effective framework that combines disentangled factorization with an autoregressive Transformer over a discrete, quantized latent space to generate time series. Each time series is decomposed into a data-adaptive basis that captures static cross-channel correlations and temporal coefficients that are vector-quantized into discrete tokens. A LLaMA-style autoregressive Transformer then models these token sequences, enabling fast and controllable generation of sequences with arbitrary length. Owing to its streamlined design, FAR-TS achieves orders-of-magnitude faster generation than Diffusion-TS while preserving cross-channel correlations and an interpretable latent space, enabling high-quality and flexible time series synthesis.

[240] DL101 Neural Network Outputs and Loss Functions

Fernando Berzal

Main category: cs.LG

TL;DR: This report analyzes the statistical connection between neural network output layer activation functions and loss functions, showing that loss function choice is equivalent to assuming specific probability distributions for model outputs through Maximum Likelihood Estimation.

Details

Motivation: To provide a strong statistical justification for selecting appropriate loss functions in deep learning by connecting them to activation functions and Maximum Likelihood Estimation principles.

Method: Mathematical analysis of common activation functions (linear, sigmoid, ReLU, softmax) and their connection to loss functions (MSE, MAE, Cross-Entropy) through statistical principles and Generalized Linear Models framework.

Result: Demonstrates that choosing specific loss functions implicitly assumes particular probability distributions for model outputs, with MSE corresponding to Gaussian, MAE to Laplace, and Cross-Entropy to categorical/multinomial distributions.

Conclusion: There is a fundamental statistical justification linking activation functions and loss functions through Maximum Likelihood Estimation, providing guidance for appropriate loss function selection based on output layer design and desired probability distribution assumptions.

Abstract: The loss function used to train a neural network is strongly connected to its output layer from a statistical point of view. This technical report analyzes common activation functions for a neural network output layer, like linear, sigmoid, ReLU, and softmax, detailing their mathematical properties and their appropriate use cases. A strong statistical justification exists for the selection of the suitable loss function for training a deep learning model. This report connects common loss functions such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and various Cross-Entropy losses to the statistical principle of Maximum Likelihood Estimation (MLE). Choosing a specific loss function is equivalent to assuming a specific probability distribution for the model output, highlighting the link between these functions and the Generalized Linear Models (GLMs) that underlie network output layers. Additional scenarios of practical interest are also considered, such as alternative output encodings, constrained outputs, and distributions with heavy tails.

[241] Scaling Up ROC-Optimizing Support Vector Machines

Gimun Bae, Seung Jun Shin

Main category: cs.LG

TL;DR: A scalable variant of ROC-SVM that uses incomplete U-statistics and low-rank kernel approximation to reduce computational complexity from O(n²) while maintaining comparable AUC performance.

Details

Motivation: The original ROC-SVM directly maximizes AUC but has high computational cost (O(n²)) that limits its practical use, especially with class imbalance problems.

Method: Developed a scalable ROC-SVM using incomplete U-statistics to reduce computational complexity, and extended to nonlinear classification via low-rank kernel approximation in reproducing kernel Hilbert spaces.

Result: Theoretical analysis provides error bounds justifying the approximation, and empirical results show comparable AUC performance to original ROC-SVM with drastically reduced training time on synthetic and real datasets.

Conclusion: The proposed method successfully overcomes the computational limitations of ROC-SVM while maintaining its AUC maximization benefits, making it practical for real-world applications with class imbalance.

Abstract: The ROC-SVM, originally proposed by Rakotomamonjy, directly maximizes the area under the ROC curve (AUC) and has become an attractive alternative of the conventional binary classification under the presence of class imbalance. However, its practical use is limited by high computational cost, as training involves evaluating all $O(n^2)$. To overcome this limitation, we develop a scalable variant of the ROC-SVM that leverages incomplete U-statistics, thereby substantially reducing computational complexity. We further extend the framework to nonlinear classification through a low-rank kernel approximation, enabling efficient training in reproducing kernel Hilbert spaces. Theoretical analysis establishes an error bound that justifies the proposed approximation, and empirical results on both synthetic and real datasets demonstrate that the proposed method achieves comparable AUC performance to the original ROC-SVM with drastically reduced training time.

[242] Unlocking the Black Box: A Five-Dimensional Framework for Evaluating Explainable AI in Credit Risk

Rongbin Ye, Jiaqi Chen

Main category: cs.LG

TL;DR: This paper bridges the gap between complex ML models and regulatory explainability requirements in finance by applying SHAP and LIME frameworks, and proposes a 5D evaluation framework for model explainability.

Details

Motivation: The financial industry needs to balance advanced ML model predictability with regulatory explainability requirements from entities like OCC and CFPB, addressing the "black box" model challenge.

Method: Applied SHAP and LIME explainability frameworks to different ML models, and developed a novel five-dimensional framework evaluating Inherent Interpretability, Global Explanations, Local Explanations, Consistency, and Complexity.

Result: Demonstrated that complex models with better prediction power can achieve the same level of explainability as simpler models using SHAP and LIME techniques.

Conclusion: Sophisticated ML models can be feasibly employed in regulated financial environments using modern explainability techniques, with a structured approach to evaluate performance-interpretability trade-offs.

Abstract: The financial industry faces a significant challenge modeling and risk portfolios: balancing the predictability of advanced machine learning models, neural network models, and explainability required by regulatory entities (such as Office of the Comptroller of the Currency, Consumer Financial Protection Bureau). This paper intends to fill the gap in the application between these “black box” models and explainability frameworks, such as LIME and SHAP. Authors elaborate on the application of these frameworks on different models and demonstrates the more complex models with better prediction powers could be applied and reach the same level of the explainability, using SHAP and LIME. Beyond the comparison and discussion of performances, this paper proposes a novel five dimensional framework evaluating Inherent Interpretability, Global Explanations, Local Explanations, Consistency, and Complexity to offer a nuanced method for assessing and comparing model explainability beyond simple accuracy metrics. This research demonstrates the feasibility of employing sophisticated, high performing ML models in regulated financial environments by utilizing modern explainability techniques and provides a structured approach to evaluate the crucial trade offs between model performance and interpretability.

[243] Deep Progressive Training: scaling up depth capacity of zero/one-layer models

Zhiqi Bu

Main category: cs.LG

TL;DR: Progressive training scales model depth during training to reduce computation while maintaining performance. The paper proposes zero/one-layer progressive training for optimal compute-loss tradeoff, achieving 80% compute savings or 5x acceleration on GPT2 with minimal performance loss.

Details

Motivation: Deeper models achieve higher accuracy but require higher computational cost. Progressive training addresses this by scaling up model capacity during training to significantly reduce computation with little performance degradation.

Method: Studies depth expansion through optimization theory and feature learning, providing insights on new layer initialization, hyperparameter transfer, learning rate schedule, and expansion timing. Proposes zero/one-layer progressive training for optimal compute-loss tradeoff.

Result: Zero/one-layer progressive training on GPT2 saves ≈80% compute or accelerates ≈5× while achieving almost the same loss compared to a fully trained 60-layer model with 7B parameters.

Conclusion: Progressive training offers an effective strategy for efficient large model training, providing substantial computational savings with minimal performance impact through optimized depth expansion techniques.

Abstract: Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, an effective strategy is the progressive training, which scales up model capacity during training, hence significantly reducing computation with little to none performance degradation. In this work, we study the depth expansion of large models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training for the optimal tradeoff between computation and loss. For example, zero/one-layer progressive training on GPT2 can save $\approx 80%$ compute, or equivalently accelerate $\approx 5\times$ while achieving almost the same loss, compared to to a fully trained 60-layer model with 7B parameters.

[244] Peptide2Mol: A Diffusion Model for Generating Small Molecules as Peptide Mimics for Targeted Protein Binding

Xinheng He, Yijia Zhang, Haowei Lin, Xingang Peng, Xiangzhe Kong, Mingyu Li, Jianzhu Ma

Main category: cs.LG

TL;DR: Peptide2Mol is an E(3)-equivariant graph neural network diffusion model that generates small molecules by referencing peptide binders and protein pocket environments, achieving state-of-the-art performance in generative tasks.

Details

Motivation: Most AI-driven drug design approaches neglect endogenous protein interactions with peptides, leading to suboptimal molecule designs.

Method: E(3)-equivariant graph neural network diffusion model trained on large datasets, using both peptide binders and protein pocket environments for molecule generation.

Result: Achieves state-of-the-art performance in non-autoregressive generative tasks, produces molecules similar to original peptide binders, and enables molecule optimization through partial diffusion.

Conclusion: Peptide2Mol is an effective deep generative model for generating and optimizing bioactive small molecules from protein binding pockets.

Abstract: Structure-based drug design has seen significant advancements with the integration of artificial intelligence (AI), particularly in the generation of hit and lead compounds. However, most AI-driven approaches neglect the importance of endogenous protein interactions with peptides, which may result in suboptimal molecule designs. In this work, we present Peptide2Mol, an E(3)-equivariant graph neural network diffusion model that generates small molecules by referencing both the original peptide binders and their surrounding protein pocket environments. Trained on large datasets and leveraging sophisticated modeling techniques, Peptide2Mol not only achieves state-of-the-art performance in non-autoregressive generative tasks, but also produces molecules with similarity to the original peptide binder. Additionally, the model allows for molecule optimization and peptidomimetic design through a partial diffusion process. Our results highlight Peptide2Mol as an effective deep generative model for generating and optimizing bioactive small molecules from protein binding pockets.

[245] No One-Model-Fits-All: Uncovering Spatio-Temporal Forecasting Trade-offs with Graph Neural Networks and Foundation Models

Ragini Gupta, Naman Raina, Bo Chen, Li Chen, Claudiu Danilov, Josh Eckhardt, Keyshla Bernard, Klara Nahrstedt

Main category: cs.LG

TL;DR: Systematic study of forecasting models under varying spatial sensor density and sampling intervals, showing STGNNs excel with sparse deployments while TSFMs perform best at high frequencies with Moirai leading overall.

Details

Motivation: Existing IoT filtering techniques overlook how sampling frequency and spatial coverage variations affect downstream forecasting model performance, particularly the interplay between these factors and different model architectures.

Method: Benchmarked classical models (VAR), neural networks (GRU, Transformer), spatio-temporal GNNs (STGNNs), and time series foundation models (Chronos, Moirai, TimesFM) using real-world temperature data from wireless sensor networks under varying spatial node density and sampling intervals.

Result: STGNNs are effective with sparse deployments and moderate sampling rates, leveraging spatial correlations via graph structure. TSFMs perform competitively at high frequencies but degrade with reduced spatial coverage. Moirai outperforms all models by learning cross-sensor dependencies natively.

Conclusion: Findings provide actionable insights for building efficient forecasting pipelines in spatio-temporal systems, with model choice depending on deployment density and sampling frequency constraints.

Abstract: Modern IoT deployments for environmental sensing produce high volume spatiotemporal data to support downstream tasks such as forecasting, typically powered by machine learning models. While existing filtering and strategic deployment techniques optimize collected data volume at the edge, they overlook how variations in sampling frequencies and spatial coverage affect downstream model performance. In many forecasting models, incorporating data from additional sensors denoise predictions by providing broader spatial contexts. This interplay between sampling frequency, spatial coverage and different forecasting model architectures remain underexplored. This work presents a systematic study of forecasting models - classical models (VAR), neural networks (GRU, Transformer), spatio-temporal graph neural networks (STGNNs), and time series foundation models (TSFMs: Chronos Moirai, TimesFM) under varying spatial sensor nodes density and sampling intervals using real-world temperature data in a wireless sensor network. Our results show that STGNNs are effective when sensor deployments are sparse and sampling rate is moderate, leveraging spatial correlations via encoded graph structure to compensate for limited coverage. In contrast, TSFMs perform competitively at high frequencies but degrade when spatial coverage from neighboring sensors is reduced. Crucially, the multivariate TSFM Moirai outperforms all models by natively learning cross-sensor dependencies. These findings offer actionable insights for building efficient forecasting pipelines in spatio-temporal systems. All code for model configurations, training, dataset, and logs are open-sourced for reproducibility: https://github.com/UIUC-MONET-Projects/Benchmarking-Spatiotemporal-Forecast-Models

[246] Carbon Price Forecasting with Structural Breaks: A Comparative Study of Deep Learning Models

Runsheng Ren, Jing Li, Yanxiu Li, Shixun Huang, Jun Shen, Wanqing Li, John Le, Sheng Wang

Main category: cs.LG

TL;DR: Proposes a hybrid framework integrating structural break detection, wavelet denoising, and deep learning models (LSTM, GRU, TCN) for carbon price forecasting, achieving significant error reduction compared to existing methods.

Details

Motivation: Accurate carbon price forecasting is essential for energy markets and decarbonization strategies, but challenging due to structural breaks and noise from policy interventions. Existing methods treat denoising and modeling separately and lack systematic evaluation.

Method: Comprehensive hybrid framework combining structural break detection (Bai-Perron, ICSS, PELT), wavelet signal denoising, and three deep learning models (LSTM, GRU, TCN) using EUA spot prices and exogenous features.

Result: PELT-WT-TCN achieved highest accuracy: 22.35% RMSE and 18.63% MAE reduction vs state-of-the-art baseline, and 70.55% RMSE and 74.42% MAE reduction vs original LSTM without decomposition.

Conclusion: Integrating structural awareness and multiscale decomposition into deep learning enhances accuracy and interpretability for carbon price forecasting and nonstationary financial time series.

Abstract: Accurately forecasting carbon prices is essential for informed energy market decision-making, guiding sustainable energy planning, and supporting effective decarbonization strategies. However, it remains challenging due to structural breaks and high-frequency noise caused by frequent policy interventions and market shocks. Existing studies, including the most recent baseline approaches, have attempted to incorporate breakpoints but often treat denoising and modeling as separate processes and lack systematic evaluation across advanced deep learning architectures, limiting the robustness and the generalization capability. To address these gaps, this paper proposes a comprehensive hybrid framework that integrates structural break detection (Bai-Perron, ICSS, and PELT algorithms), wavelet signal denoising, and three state-of-the-art deep learning models (LSTM, GRU, and TCN). Using European Union Allowance (EUA) spot prices from 2007 to 2024 and exogenous features such as energy prices and policy indicators, the framework constructs univariate and multivariate datasets for comparative evaluation. Experimental results demonstrate that our proposed PELT-WT-TCN achieves the highest prediction accuracy, reducing forecasting errors by 22.35% in RMSE and 18.63% in MAE compared to the state-of-the-art baseline model (Breakpoints with Wavelet and LSTM), and by 70.55% in RMSE and 74.42% in MAE compared to the original LSTM without decomposition from the same baseline study. These findings underscore the value of integrating structural awareness and multiscale decomposition into deep learning architectures to enhance accuracy and interpretability in carbon price forecasting and other nonstationary financial time series.

[247] Usando LLMs para Programar Jogos de Tabuleiro e Variações

Álvaro Guglielmin Becker, Lana Bertoldo Rossato, Anderson Rocha Tavares

Main category: cs.LG

TL;DR: Testing LLMs’ capability to generate board game code and variants to expedite game development.

Details

Motivation: Creating board game programs is time-consuming, and LLMs offer potential to speed up this process through code generation from simple context.

Method: Proposed method to test three LLMs (Claude, DeepSeek, ChatGPT) on creating code for board games and generating new variants of existing games.

Result: Not specified in the abstract - requires full paper for evaluation results.

Conclusion: Not specified in the abstract - requires full paper for final conclusions.

Abstract: Creating programs to represent board games can be a time-consuming task. Large Language Models (LLMs) arise as appealing tools to expedite this process, given their capacity to efficiently generate code from simple contextual information. In this work, we propose a method to test how capable three LLMs (Claude, DeepSeek and ChatGPT) are at creating code for board games, as well as new variants of existing games.

[248] QuAnTS: Question Answering on Time Series

Felix Divo, Maurice Kraus, Anh Q. Nguyen, Hao Xue, Imran Razzak, Flora D. Salim, Kristian Kersting, Devendra Singh Dhami

Main category: cs.LG

TL;DR: Proposes QuAnTS, a novel time series question-answering dataset focused on human motion data, to bridge the gap in TSQA research and enable better interaction with time series models through text.

Details

Motivation: Text provides intuitive access to information that complements dense numerical time series, but most QA research focuses on vision and text while time series receives minimal attention.

Method: Created a comprehensive TSQA dataset (QuAnTS) with diverse questions and answers about human motion using tracked skeleton trajectories, and evaluated existing and new baselines.

Result: Verified that QuAnTS is well-formed and comprehensive through extensive experiments, and provided human performance benchmarks as reference for practical usability.

Conclusion: The work lays groundwork for deeper TSQA exploration and aims to encourage future research on text-based interaction with time series models for better decision-making and transparent systems.

Abstract: Text offers intuitive access to information. This can, in particular, complement the density of numerical time series, thereby allowing improved interactions with time series models to enhance accessibility and decision-making. While the creation of question-answering datasets and models has recently seen remarkable growth, most research focuses on question answering (QA) on vision and text, with time series receiving minute attention. To bridge this gap, we propose a challenging novel time series QA (TSQA) dataset, QuAnTS, for Question Answering on Time Series data. Specifically, we pose a wide variety of questions and answers about human motion in the form of tracked skeleton trajectories. We verify that the large-scale QuAnTS dataset is well-formed and comprehensive through extensive experiments. Thoroughly evaluating existing and newly proposed baselines then lays the groundwork for a deeper exploration of TSQA using QuAnTS. Additionally, we provide human performances as a key reference for gauging the practical usability of such models. We hope to encourage future research on interacting with time series models through text, enabling better decision-making and more transparent systems.

[249] Consecutive Preferential Bayesian Optimization

Aras Erarslan, Carlos Sevilla Salcedo, Ville Tanskanen, Anni Nisov, Eero Päiväkumpu, Heikki Aisala, Kaisu Honkapää, Arto Klami, Petrus Mikkola

Main category: cs.LG

TL;DR: Consecutive Preferential Bayesian Optimization reduces production costs by reusing previously generated candidates and incorporates perceptual ambiguity modeling to handle indifference feedback.

Details

Motivation: Existing preferential Bayesian optimization methods ignore the costs of generating candidate solutions for evaluation, which can be expensive in real-world applications.

Method: Generalizes preference-based optimization to account for production and evaluation costs by constraining comparisons to involve previously generated candidates. Incorporates a Just-Noticeable Difference threshold into probabilistic preference model to capture indifference to small utility differences.

Result: Empirically demonstrates notable increase in accuracy in setups with high production costs or with indifference feedback.

Conclusion: The proposed method effectively reduces production costs while maintaining optimization performance by leveraging previously generated candidates and accounting for perceptual ambiguity in human feedback.

Abstract: Preferential Bayesian optimization allows optimization of objectives that are either expensive or difficult to measure directly, by relying on a minimal number of comparative evaluations done by a human expert. Generating candidate solutions for evaluation is also often expensive, but this cost is ignored by existing methods. We generalize preference-based optimization to explicitly account for production and evaluation costs with Consecutive Preferential Bayesian Optimization, reducing production cost by constraining comparisons to involve previously generated candidates. We also account for the perceptual ambiguity of the oracle providing the feedback by incorporating a Just-Noticeable Difference threshold into a probabilistic preference model to capture indifference to small utility differences. We adapt an information-theoretic acquisition strategy to this setting, selecting new configurations that are most informative about the unknown optimum under a preference model accounting for the perceptual ambiguity. We empirically demonstrate a notable increase in accuracy in setups with high production costs or with indifference feedback.

[250] An End-to-End Deep Reinforcement Learning Approach for Solving the Traveling Salesman Problem with Drones

Taihelong Zeng, Yun Lin, Yuhe Shi, Yan Li, Zhiqing Wei, Xuanru Ji

Main category: cs.LG

TL;DR: A hierarchical Actor-Critic deep reinforcement learning framework with Transformer encoder and Minimal Gated Unit decoder solves the Traveling Salesman Problem with Drones (TSP-D), achieving competitive solutions faster than existing methods with superior training efficiency.

Details

Motivation: Truck-drone collaborative systems in last-mile logistics face NP-hard combinatorial complexity that conventional optimization cannot handle efficiently, requiring new approaches for synchronized vehicle coordination.

Method: Hierarchical Actor-Critic DRL framework with Transformer-inspired encoder using k-nearest neighbors sparse attention and global node features, plus Minimal Gated Unit decoder for solution sequence generation.

Result: Achieves competitive or superior solutions on TSP-D instances (N=10-100) with shorter computation times than heuristics and existing RL methods, and significantly reduces total training time while improving final performance.

Conclusion: The proposed framework demonstrates notable advantages in both solution quality and training efficiency for solving complex TSP-D problems in logistics optimization.

Abstract: The emergence of truck-drone collaborative systems in last-mile logistics has positioned the Traveling Salesman Problem with Drones (TSP-D) as a pivotal extension of classical routing optimization, where synchronized vehicle coordination promises substantial operational efficiency and reduced environmental impact, yet introduces NP-hard combinatorial complexity beyond the reach of conventional optimization paradigms. Deep reinforcement learning offers a theoretically grounded framework to address TSP-D’s inherent challenges through self-supervised policy learning and adaptive decision-making. This study proposes a hierarchical Actor-Critic deep reinforcement learning framework for solving the TSP-D problem. The architecture consists of two primary components: a Transformer-inspired encoder and an efficient Minimal Gated Unit decoder. The encoder incorporates a novel, optimized k-nearest neighbors sparse attention mechanism specifically for focusing on relevant spatial relationships, further enhanced by the integration of global node features. The Minimal Gated Unit decoder processes these encoded representations to efficiently generate solution sequences. The entire framework operates within an asynchronous advantage actor-critic paradigm. Experimental results show that, on benchmark TSP-D instances of various scales (N=10 to 100), the proposed model can obtain competitive or even superior solutions in shorter average computation times compared to high-performance heuristic algorithms and existing reinforcement learning methods. Moreover, compared to advanced reinforcement learning algorithm benchmarks, the proposed framework significantly reduces the total training time required while achieving superior final performance, highlighting its notable advantage in training efficiency.

[251] Multimodal Deep Learning for Prediction of Progression-Free Survival in Patients with Neuroendocrine Tumors Undergoing 177Lu-based Peptide Receptor Radionuclide Therapy

Simon Baur, Tristan Ruhwedel, Ekin Böke, Zuzanna Kobus, Gergana Lishkova, Christoph Wetz, Holger Amthauer, Christoph Roderburg, Frank Tacke, Julian M. Rogasch, Wojciech Samek, Henning Jann, Jackie Ma, Johannes Eschrich

Main category: cs.LG

TL;DR: Multimodal deep learning combining SR-PET, CT, and laboratory biomarkers outperforms unimodal approaches for predicting progression-free survival in PRRT-treated neuroendocrine tumor patients.

Details

Motivation: PRRT is effective for metastatic NETs but only provides long-term disease control in a subset of patients. Predicting PFS could enable individualized treatment planning and risk-adapted follow-up strategies.

Method: Retrospective study of 116 patients with metastatic NETs undergoing 177Lu-DOTATOC PRRT. Seven models were trained including unimodal (laboratory, SR-PET, CT) and multimodal fusion approaches using deep learning and Random Forest methods.

Result: Multimodal fusion model combining laboratory values, SR-PET, and CT achieved best performance (AUROC 0.72 ± 0.01, AUPRC 0.80 ± 0.01), outperforming unimodal approaches. Short-PFS patients had higher baseline chromogranin A, elevated gamma-GT, and fewer PRRT cycles.

Conclusion: Multimodal deep learning combining imaging and laboratory data provides superior PFS prediction for PRRT-treated NET patients, potentially supporting personalized treatment strategies after external validation.

Abstract: Peptide receptor radionuclide therapy (PRRT) is an established treatment for metastatic neuroendocrine tumors (NETs), yet long-term disease control occurs only in a subset of patients. Predicting progression-free survival (PFS) could support individualized treatment planning. This study evaluates laboratory, imaging, and multimodal deep learning models for PFS prediction in PRRT-treated patients. In this retrospective, single-center study 116 patients with metastatic NETs undergoing 177Lu-DOTATOC were included. Clinical characteristics, laboratory values, and pretherapeutic somatostatin receptor positron emission tomography/computed tomographies (SR-PET/CT) were collected. Seven models were trained to classify low- vs. high-PFS groups, including unimodal (laboratory, SR-PET, or CT) and multimodal fusion approaches. Explainability was evaluated by feature importance analysis and gradient maps. Forty-two patients (36%) had short PFS (< 1 year), 74 patients long PFS (>1 year). Groups were similar in most characteristics, except for higher baseline chromogranin A (p = 0.003), elevated gamma-GT (p = 0.002), and fewer PRRT cycles (p < 0.001) in short-PFS patients. The Random Forest model trained only on laboratory biomarkers reached an AUROC of 0.59 +- 0.02. Unimodal three-dimensional convolutional neural networks using SR-PET or CT performed worse (AUROC 0.42 +- 0.03 and 0.54 +- 0.01, respectively). A multimodal fusion model laboratory values, SR-PET, and CT -augmented with a pretrained CT branch

achieved the best results (AUROC 0.72 +- 0.01, AUPRC 0.80 +- 0.01). Multimodal deep learning combining SR-PET, CT, and laboratory biomarkers outperformed unimodal approaches for PFS prediction after PRRT. Upon external validation, such models may support risk-adapted follow-up strategies.

[252] Integrating Score-Based Diffusion Models with Machine Learning-Enhanced Localization for Advanced Data Assimilation in Geological Carbon Storage

Gabriel Serrão Seabra, Nikolaj T. Mücke, Vinicius Luiz Santos Silva, Alexandre A. Emerick, Denis Voskov, Femke Vossepoel

Main category: cs.LG

TL;DR: Machine learning-enhanced localization with diffusion models improves data assimilation for CO2 injection in channelized reservoirs, maintaining ensemble variance while achieving good data matching.

Details

Motivation: Accurate subsurface heterogeneity characterization is crucial for safe geological carbon storage projects, requiring improved data assimilation methods.

Method: Integrates score-based diffusion models with ML-enhanced localization using large ensembles, FLUVSIM geostatistical model, and DARTS simulator for CO2 injection scenarios.

Result: ML-based localization maintains significantly more ensemble variance than no localization while achieving comparable data-matching quality.

Conclusion: The framework improves reliability of uncertainty quantification for risk assessment in geological carbon storage projects.

Abstract: Accurate characterization of subsurface heterogeneity is important for the safe and effective implementation of geological carbon storage (GCS) projects. This paper explores how machine learning methods can enhance data assimilation for GCS with a framework that integrates score-based diffusion models with machine learning-enhanced localization in channelized reservoirs during CO$_2$ injection. We employ a machine learning-enhanced localization framework that uses large ensembles ($N_s = 5000$) with permeabilities generated by the diffusion model and states computed by simple ML algorithms to improve covariance estimation for the Ensemble Smoother with Multiple Data Assimilation (ESMDA). We apply ML algorithms to a prior ensemble of channelized permeability fields, generated with the geostatistical model FLUVSIM. Our approach is applied on a CO$_2$ injection scenario simulated using the Delft Advanced Research Terra Simulator (DARTS). Our ML-based localization maintains significantly more ensemble variance than when localization is not applied, while achieving comparable data-matching quality. This framework has practical implications for GCS projects, helping improve the reliability of uncertainty quantification for risk assessment.

[253] Associative Poisoning to Generative Machine Learning

Mathias Lundteigen Mohus, Jingyue Li, Zhirong Yang

Main category: cs.LG

TL;DR: A novel data poisoning technique called associative poisoning that compromises fine-grained features in generative models without requiring control over the training process, by manipulating statistical associations between specific feature pairs while preserving marginal distributions.

Details

Motivation: Existing poisoning attacks either cause broad degradation of generated data or require control over training process, limiting real-world applicability. Generative models like Stable Diffusion and ChatGPT are attractive targets for malicious exploitation through data poisoning.

Method: Associative poisoning perturbs only training data to manipulate statistical associations between specific feature pairs in generated outputs. Provides formal mathematical formulation and proves theoretical feasibility and stealthiness.

Result: Empirical evaluations show associative poisoning effectively induces or suppresses feature associations while preserving marginal distributions of targeted features and maintaining high-quality outputs, evading visual detection.

Conclusion: Generative systems in image synthesis, synthetic dataset generation, and NLP are susceptible to subtle, stealthy manipulations that compromise statistical integrity. Examines limitations of existing defenses and proposes novel countermeasure strategy.

Abstract: The widespread adoption of generative models such as Stable Diffusion and ChatGPT has made them increasingly attractive targets for malicious exploitation, particularly through data poisoning. Existing poisoning attacks compromising synthesised data typically either cause broad degradation of generated data or require control over the training process, limiting their applicability in real-world scenarios. In this paper, we introduce a novel data poisoning technique called associative poisoning, which compromises fine-grained features of the generated data without requiring control of the training process. This attack perturbs only the training data to manipulate statistical associations between specific feature pairs in the generated outputs. We provide a formal mathematical formulation of the attack and prove its theoretical feasibility and stealthiness. Empirical evaluations using two state-of-the-art generative models demonstrate that associative poisoning effectively induces or suppresses feature associations while preserving the marginal distributions of the targeted features and maintaining high-quality outputs, thereby evading visual detection. These results suggest that generative systems used in image synthesis, synthetic dataset generation, and natural language processing are susceptible to subtle, stealthy manipulations that compromise their statistical integrity. To address this risk, we examine the limitations of existing defensive strategies and propose a novel countermeasure strategy.

[254] Linear Gradient Prediction with Control Variates

Kamil Ciosek, Nicolò Felicioni, Juan Elenter Litwin

Main category: cs.LG

TL;DR: Proposes training neural networks using approximate predicted gradients instead of full gradients to reduce training costs, with a control-variate technique for unbiased updates and a Neural Tangent Kernel-inspired predictor.

Details

Motivation: To reduce the high computational cost of neural network training, particularly the expensive backward pass required for full gradient computation.

Method: Uses approximate predicted gradients instead of full gradients, employs control-variate technique to ensure unbiased gradient estimates, and develops a predictor inspired by Neural Tangent Kernel theory.

Result: Empirically demonstrates effectiveness on vision transformer classification tasks.

Conclusion: The proposed method successfully reduces training costs while maintaining performance through approximate gradient prediction and unbiased update techniques.

Abstract: We propose a new way of training neural networks, with the goal of reducing training cost. Our method uses approximate predicted gradients instead of the full gradients that require an expensive backward pass. We derive a control-variate-based technique that ensures our updates are unbiased estimates of the true gradient. Moreover, we propose a novel way to derive a predictor for the gradient inspired by the theory of the Neural Tangent Kernel. We empirically show the efficacy of the technique on a vision transformer classification task.

[255] ActiTect: A Generalizable Machine Learning Pipeline for REM Sleep Behavior Disorder Screening through Standardized Actigraphy

David Bertram, Anja Ophey, Sinah Röttgen, Konstantin Kuffer, Gereon R. Fink, Elke Kalbe, Clint Hansen, Walter Maetzler, Maximilian Kapsecker, Lara M. Reimer, Stephan Jonas, Andreas T. Damgaard, Natasha B. Bertelsen, Casper Skjaerbaek, Per Borghammer, Karolien Groenewald, Pietro-Luca Ratti, Michele T. Hu, Noémie Moreau, Michael Sommerauer, Katarzyna Bozek

Main category: cs.LG

TL;DR: ActiTect is an automated, open-source machine learning tool that detects REM sleep behavior disorder (RBD) from wrist-worn actigraphy data with strong performance across multiple validation cohorts.

Details

Motivation: iRBD is a major prodromal marker for α-synucleinopathies like Parkinson's disease, and wrist-worn actimeters have potential for large-scale screening but need reliable analysis pipelines.

Method: Developed ActiTect with robust preprocessing, automated sleep-wake detection, and physiologically interpretable motion features. Used machine learning on 78 individuals with nested cross-validation.

Result: Achieved AUROC of 0.95 in development, 0.86 on local test set (n=31), and 0.84-0.94 on two external cohorts (n=113, n=57). Leave-one-dataset-out validation showed consistent performance (AUROC 0.84-0.89).

Conclusion: ActiTect provides a robust, generalizable RBD detection tool that is open-source for widespread adoption, facilitating independent validation and collaborative improvements in wearable-based RBD screening.

Abstract: Isolated rapid eye movement sleep behavior disorder (iRBD) is a major prodromal marker of $\alpha$-synucleinopathies, often preceding the clinical onset of Parkinson’s disease, dementia with Lewy bodies, or multiple system atrophy. While wrist-worn actimeters hold significant potential for detecting RBD in large-scale screening efforts by capturing abnormal nocturnal movements, they become inoperable without a reliable and efficient analysis pipeline. This study presents ActiTect, a fully automated, open-source machine learning tool to identify RBD from actigraphy recordings. To ensure generalizability across heterogeneous acquisition settings, our pipeline includes robust preprocessing and automated sleep-wake detection to harmonize multi-device data and extract physiologically interpretable motion features characterizing activity patterns. Model development was conducted on a cohort of 78 individuals, yielding strong discrimination under nested cross-validation (AUROC = 0.95). Generalization was confirmed on a blinded local test set (n = 31, AUROC = 0.86) and on two independent external cohorts (n = 113, AUROC = 0.84; n = 57, AUROC = 0.94). To assess real-world robustness, leave-one-dataset-out cross-validation across the internal and external cohorts demonstrated consistent performance (AUROC range = 0.84-0.89). A complementary stability analysis showed that key predictive features remained reproducible across datasets, supporting the final pooled multi-center model as a robust pre-trained resource for broader deployment. By being open-source and easy to use, our tool promotes widespread adoption and facilitates independent validation and collaborative improvements, thereby advancing the field toward a unified and generalizable RBD detection model using wearable devices.

[256] The Causal Round Trip: Generating Authentic Counterfactuals by Eliminating Information Loss

Rui Wu, Lizheng Wang, Yongjun Li

Main category: cs.LG

TL;DR: BELM-MDCM is a diffusion-based framework that eliminates Structural Reconstruction Error to enable faithful counterfactual reasoning through causally sound abduction.

Details

Motivation: Standard diffusion models introduce information loss (Structural Reconstruction Error) when used for counterfactual reasoning, preventing faithful abduction required for Pearl's Structural Causal Models.

Method: Introduces BELM-MDCM framework with Causal Information Conservation principle, analytically invertible mechanisms, targeted modeling strategy, and hybrid training objective to eliminate SRE by construction.

Result: Achieves state-of-the-art accuracy and enables high-fidelity, individual-level counterfactuals for deep causal inquiries.

Conclusion: Provides a foundational blueprint that reconciles modern generative models with classical causal theory, establishing a new rigorous standard for causal inference.

Abstract: Judea Pearl’s vision of Structural Causal Models (SCMs) as engines for counterfactual reasoning hinges on faithful abduction: the precise inference of latent exogenous noise. For decades, operationalizing this step for complex, non-linear mechanisms has remained a significant computational challenge. The advent of diffusion models, powerful universal function approximators, offers a promising solution. However, we argue that their standard design, optimized for perceptual generation over logical inference, introduces a fundamental flaw for this classical problem: an inherent information loss we term the Structural Reconstruction Error (SRE). To address this challenge, we formalize the principle of Causal Information Conservation (CIC) as the necessary condition for faithful abduction. We then introduce BELM-MDCM, the first diffusion-based framework engineered to be causally sound by eliminating SRE by construction through an analytically invertible mechanism. To operationalize this framework, a Targeted Modeling strategy provides structural regularization, while a Hybrid Training Objective instills a strong causal inductive bias. Rigorous experiments demonstrate that our Zero-SRE framework not only achieves state-of-the-art accuracy but, more importantly, enables the high-fidelity, individual-level counterfactuals required for deep causal inquiries. Our work provides a foundational blueprint that reconciles the power of modern generative models with the rigor of classical causal theory, establishing a new and more rigorous standard for this emerging field.

[257] Embedding-Space Data Augmentation to Prevent Membership Inference Attacks in Clinical Time Series Forecasting

Marius Fracarolli, Michael Staniek, Stefan Riezler

Main category: cs.LG

TL;DR: Data augmentation methods like ZOO-PCA can reduce membership inference attack effectiveness on time series forecasting models while maintaining predictive performance.

Details

Motivation: Need to balance privacy protection against membership inference attacks with maintaining high predictive performance in time series forecasting of electronic health records.

Method: Explored multiple data augmentation strategies: Zeroth-Order Optimization (ZOO), ZOO-PCA (constrained by Principal Component Analysis), and MixUp to generate synthetic samples that resemble original training data but introduce novelty.

Result: ZOO-PCA achieved the best reduction in true-positive to false-positive ratio for membership inference attacks without sacrificing test data performance.

Conclusion: Data augmentation, particularly ZOO-PCA, effectively enhances model resilience against membership inference attacks while preserving forecasting accuracy.

Abstract: Balancing strong privacy guarantees with high predictive performance is critical for time series forecasting (TSF) tasks involving Electronic Health Records (EHR). In this study, we explore how data augmentation can mitigate Membership Inference Attacks (MIA) on TSF models. We show that retraining with synthetic data can substantially reduce the effectiveness of loss-based MIAs by reducing the attacker’s true-positive to false-positive ratio. The key challenge is generating synthetic samples that closely resemble the original training data to confuse the attacker, while also introducing enough novelty to enhance the model’s ability to generalize to unseen data. We examine multiple augmentation strategies - Zeroth-Order Optimization (ZOO), a variant of ZOO constrained by Principal Component Analysis (ZOO-PCA), and MixUp - to strengthen model resilience without sacrificing accuracy. Our experimental results show that ZOO-PCA yields the best reductions in TPR/FPR ratio for MIA attacks without sacrificing performance on test data.

[258] Attention and Compression is all you need for Controllably Efficient Language Models

Jatin Prakash, Aahlad Puli, Rajesh Ranganath

Main category: cs.LG

TL;DR: CAT (Compress & Attend Transformer) is an efficient transformer architecture that uses dense attention and compression to reduce computational costs while maintaining quality, enabling adaptive quality-compute trade-offs at test time without retraining.

Details

Motivation: Existing efficient transformer approaches (sparse attention, sliding windows, convolutions, linear attention) often sacrifice in-context recall performance, require heuristic choices, complex recurrent states, or hybrid architectures, making them suboptimal and complicated to scale.

Method: CAT decodes chunks of tokens by attending to compressed chunks of previous sequences. It uses two simple components: dense attention and compression, allowing training with multiple chunk sizes simultaneously for adaptive quality-compute trade-offs.

Result: A single adaptive CAT model outperforms existing efficient baselines across different compute-memory budgets, matches dense transformer in language modeling while being 1.4-3x faster and requiring 2-9x lower total memory usage.

Conclusion: CAT provides a simple yet effective solution for efficient transformers that maintains quality while offering significant speed and memory improvements, with the unique advantage of adaptive quality-compute trade-offs at test time.

Abstract: The quadratic cost of attention in transformers motivated the development of efficient approaches: namely sparse and sliding window attention, convolutions and linear attention. Although these approaches result in impressive reductions in compute and memory, they often trade-off with quality, specifically in-context recall performance. Moreover, apriori fixing this quality-compute tradeoff means being suboptimal from the get-go: some downstream applications require more memory for in-context recall, while others require lower latency and memory. Further, these approaches rely on heuristic choices that artificially restrict attention, or require handcrafted and complex recurrent state update rules, or they must be carefully composed with attention at specific layers to form a hybrid architecture that complicates the design process, especially at scale. To address above issues, we propose Compress & Attend Transformer (CAT), a conceptually simple architecture employing two simple ingredients only: dense attention and compression. CAT decodes chunks of tokens by attending to compressed chunks of the sequence so far. Compression results in decoding from a reduced sequence length that yields compute and memory savings, while choosing a particular chunk size trades-off quality for efficiency. Moreover, CAT can be trained with multiple chunk sizes at once, unlocking control of quality-compute trade-offs directly at test-time without any retraining, all in a single adaptive architecture. In exhaustive evaluations on common language modeling tasks, in-context recall, and long-context understanding, a single adaptive CAT model outperforms existing efficient baselines, including hybrid architectures, across different compute-memory budgets. Further, a single CAT matches dense transformer in language modeling across model scales while being 1.4-3x faster and requiring 2-9x lower total memory usage.

[259] Sample Complexity of Distributionally Robust Off-Dynamics Reinforcement Learning with Online Interaction

Yiting He, Zhishuai Liu, Weixin Wang, Pan Xu

Main category: cs.LG

TL;DR: This paper studies online off-dynamics RL where training and deployment dynamics differ, formulated as robust MDPs. It addresses the challenging setting of online interaction with training environments, introduces a novel supremal visitation ratio metric, and provides the first efficient algorithm with sublinear regret.

Details

Motivation: Existing off-dynamics RL literature assumes access to generative models or pre-collected datasets with good state coverage, bypassing exploration challenges. This work addresses the more realistic setting where agents are limited to online interaction with training environments.

Method: The paper introduces the supremal visitation ratio to measure training-deployment dynamics mismatch. It proposes a computationally efficient algorithm for online RMDPs with f-divergence based transition uncertainties, achieving sublinear regret.

Result: The algorithm achieves sublinear regret in online RMDPs and establishes matching regret lower bounds, demonstrating optimal dependence on both the supremal visitation ratio and number of interaction episodes. Numerical experiments validate the theoretical results.

Conclusion: The work provides the first efficient algorithm for online off-dynamics RL with theoretical guarantees, addressing the fundamental challenge of exploration in realistic settings where training and deployment dynamics differ.

Abstract: Off-dynamics reinforcement learning (RL), where training and deployment transition dynamics are different, can be formulated as learning in a robust Markov decision process (RMDP) where uncertainties in transition dynamics are imposed. Existing literature mostly assumes access to generative models allowing arbitrary state-action queries or pre-collected datasets with a good state coverage of the deployment environment, bypassing the challenge of exploration. In this work, we study a more realistic and challenging setting where the agent is limited to online interaction with the training environment. To capture the intrinsic difficulty of exploration in online RMDPs, we introduce the supremal visitation ratio, a novel quantity that measures the mismatch between the training dynamics and the deployment dynamics. We show that if this ratio is unbounded, online learning becomes exponentially hard. We propose the first computationally efficient algorithm that achieves sublinear regret in online RMDPs with $f$-divergence based transition uncertainties. We also establish matching regret lower bounds, demonstrating that our algorithm achieves optimal dependence on both the supremal visitation ratio and the number of interaction episodes. Finally, we validate our theoretical results through comprehensive numerical experiments.

[260] Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval

Janet Jenq, Hongda Shen

Main category: cs.LG

TL;DR: Proposes using typographic attacks positively by rendering product text on images to improve multimodal retrieval in e-commerce, showing consistent gains across datasets and models.

Details

Motivation: Vision-language models like CLIP are vulnerable to typographic attacks where embedded text skews predictions, but this can be leveraged to improve product retrieval by rendering relevant metadata on images.

Method: Reverse typographic attack logic by rendering product titles/descriptions directly onto product images for vision-text compression, strengthening image-text alignment.

Result: Consistent improvements in unimodal and multimodal retrieval accuracy across three e-commerce datasets (sneakers, handbags, trading cards) using six state-of-the-art vision foundation models.

Conclusion: Visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.

Abstract: Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by rendering relevant textual content (e.g., titles, descriptions) directly onto product images to perform vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using six state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.

[261] Learning Dynamics from Input-Output Data with Hamiltonian Gaussian Processes

Jan-Hendrik Ewering, Robin E. Herrmann, Niklas Wahlström, Thomas B. Schön, Thomas Seel

Main category: cs.LG

TL;DR: This paper presents a Bayesian method for learning Hamiltonian dynamics from input-output data without requiring velocity/momentum measurements, using reduced-rank Gaussian Process approximation for computational efficiency.

Details

Motivation: To construct physically consistent models from limited data by embedding energy conservation laws, addressing the practical limitation that velocity/momentum data is rarely available in real applications.

Method: Uses non-conservative Hamiltonian Gaussian Processes with fully Bayesian scheme for estimating hidden states, GP hyperparameters, and structural parameters like damping coefficients, employing reduced-rank GP approximation for computational efficiency.

Result: The method is evaluated in nonlinear simulation case studies and shows comparable performance to state-of-the-art approaches that require momentum measurements.

Conclusion: The proposed approach enables physically consistent dynamics learning from more realistic input-output data settings without needing velocity/momentum measurements, while maintaining computational efficiency through reduced-rank GP approximation.

Abstract: Embedding non-restrictive prior knowledge, such as energy conservation laws, in learning-based approaches is a key motive to construct physically consistent models from limited data, relevant for, e.g., model-based control. Recent work incorporates Hamiltonian dynamics into Gaussian Process (GP) regression to obtain uncertainty-quantifying models that adhere to the underlying physical principles. However, these works rely on velocity or momentum data, which is rarely available in practice. In this paper, we consider dynamics learning with non-conservative Hamiltonian GPs, and address the more realistic problem setting of learning from input-output data. We provide a fully Bayesian scheme for estimating probability densities of unknown hidden states, of GP hyperparameters, as well as of structural hyperparameters, such as damping coefficients. Considering the computational complexity of GPs, we take advantage of a reduced-rank GP approximation and leverage its properties for computationally efficient prediction and training. The proposed method is evaluated in a nonlinear simulation case study and compared to a state-of-the-art approach that relies on momentum measurements.

[262] ProDER: A Continual Learning Approach for Fault Prediction in Evolving Smart Grids

Emad Efatinasab, Nahal Azadi, Davide Dalle Pezze, Gian Antonio Susto, Chuadhry Mujeeb Ahmed, Mirco Rampazzo

Main category: cs.LG

TL;DR: Proposes ProDER, a continual learning framework for smart grid fault prediction that adapts to evolving environments with new fault types and operational zones, achieving minimal accuracy drops.

Details

Motivation: Existing AI-based fault prediction models struggle with reliability in evolving smart grid environments that require adaptation to new fault types and operational zones.

Method: ProDER (Prototype-based Dark Experience Replay) integrates prototype-based feature regularization, logit distillation, and prototype-guided replay memory in a unified replay-based continual learning approach.

Result: ProDER achieved best performance among tested CL techniques with only 0.045 accuracy drop for fault type prediction and 0.015 for fault zone prediction across four realistic evaluation scenarios.

Conclusion: The results demonstrate the practicality of continual learning for scalable, real-world fault prediction in smart grids.

Abstract: As smart grids evolve to meet growing energy demands and modern operational challenges, the ability to accurately predict faults becomes increasingly critical. However, existing AI-based fault prediction models struggle to ensure reliability in evolving environments where they are required to adapt to new fault types and operational zones. In this paper, we propose a continual learning (CL) framework in the smart grid context to evolve the model together with the environment. We design four realistic evaluation scenarios grounded in class-incremental and domain-incremental learning to emulate evolving grid conditions. We further introduce Prototype-based Dark Experience Replay (ProDER), a unified replay-based approach that integrates prototype-based feature regularization, logit distillation, and a prototype-guided replay memory. ProDER achieves the best performance among tested CL techniques, with only a 0.045 accuracy drop for fault type prediction and 0.015 for fault zone prediction. These results demonstrate the practicality of CL for scalable, real-world fault prediction in smart grids.

[263] SAD-Flower: Flow Matching for Safe, Admissible, and Dynamically Consistent Planning

Tzu-Yuan Huang, Armin Lederer, Dai-Jie Wu, Xiaobing Dai, Sihua Zhang, Stefan Sosnowski, Shao-Hua Sun, Sandra Hirche

Main category: cs.LG

TL;DR: SAD-Flower is a novel framework that generates Safe, Admissible, and Dynamically consistent trajectories by augmenting flow matching with virtual control inputs, providing formal guarantees without retraining.

Details

Motivation: Current flow matching methods lack formal guarantees for state/action constraints and dynamic consistency, which are crucial for safety and executability of planned trajectories.

Method: Augments flow matching with virtual control inputs and uses nonlinear control theory techniques to derive principled guidance for constraint satisfaction.

Result: Outperforms various generative-model-based baselines in ensuring constraint satisfaction across several tasks, operating without retraining.

Conclusion: SAD-Flower successfully addresses the limitations of existing FM planners by providing formal guarantees for safety, admissibility, and dynamic consistency while maintaining flexibility for unseen constraints.

Abstract: Flow matching (FM) has shown promising results in data-driven planning. However, it inherently lacks formal guarantees for ensuring state and action constraints, whose satisfaction is a fundamental and crucial requirement for the safety and admissibility of planned trajectories on various systems. Moreover, existing FM planners do not ensure the dynamical consistency, which potentially renders trajectories inexecutable. We address these shortcomings by proposing SAD-Flower, a novel framework for generating Safe, Admissible, and Dynamically consistent trajectories. Our approach relies on an augmentation of the flow with a virtual control input. Thereby, principled guidance can be derived using techniques from nonlinear control theory, providing formal guarantees for state constraints, action constraints, and dynamic consistency. Crucially, SAD-Flower operates without retraining, enabling test-time satisfaction of unseen constraints. Through extensive experiments across several tasks, we demonstrate that SAD-Flower outperforms various generative-model-based baselines in ensuring constraint satisfaction.

[264] APP: Accelerated Path Patching with Task-Specific Pruning

Frauke Andersen, William Rudman, Ruochen Zhang, Carsten Eickhoff

Main category: cs.LG

TL;DR: Proposes Accelerated Path Patching (APP), a hybrid method combining Contrastive-FLAP pruning with Path Patching to speed up circuit discovery while maintaining circuit quality.

Details

Motivation: Current circuit discovery methods like Path Patching are computationally expensive and limit in-depth analysis for smaller models.

Method: APP uses Contrastive-FLAP pruning to reduce search space by 56% on average, then applies traditional Path Patching on remaining attention heads.

Result: Achieves 59.63%-93.27% speedup compared to Path Patching on dense models, with circuits showing substantial overlap and similar performance to established Path Patching circuits.

Conclusion: APP provides substantial computational savings while maintaining circuit quality, enabling more efficient mechanistic interpretability analysis.

Abstract: Circuit discovery is a key step in many mechanistic interpretability pipelines. Current methods, such as Path Patching, are computationally expensive and have limited in-depth circuit analysis for smaller models. In this study, we propose Accelerated Path Patching (APP), a hybrid approach leveraging our novel contrastive attention head pruning method to drastically reduce the search space of circuit discovery methods. Our Contrastive-FLAP pruning algorithm uses techniques from causal mediation analysis to assign higher pruning scores to task-specific attention heads, leading to higher performing sparse models compared to traditional pruning techniques. Although Contrastive-FLAP is successful at preserving task-specific heads that existing pruning algorithms remove at low sparsity ratios, the circuits found by Contrastive-FLAP alone are too large to satisfy the minimality constraint required in circuit analysis. APP first applies Contrastive-FLAP to reduce the search space on required for circuit discovery algorithms by, on average, 56%. Next, APP, applies traditional Path Patching on the remaining attention heads, leading to a speed up of 59.63%-93.27% compared to Path Patching applied to the dense model. Despite the substantial computational saving that APP provides, circuits obtained from APP exhibit substantial overlap and similar performance to previously established Path Patching circuits

[265] Diffusion-Based Electromagnetic Inverse Design of Scattering Structured Media

Mikhail Tsukerman, Konstantin Grotov, Pavel Ginzburg

Main category: cs.LG

TL;DR: A conditional diffusion model for electromagnetic inverse design that generates dielectric sphere structures directly from target scattering profiles, bypassing iterative optimization and reducing design time from hours to seconds.

Details

Motivation: To overcome the computational expense and time-consuming nature of traditional iterative optimization methods in electromagnetic inverse design, which typically require hours of computation.

Method: A 1D U-Net architecture with Feature-wise Linear Modulation trained on 11,000 simulated metasurfaces to map desired angular scattering patterns to 2x2 dielectric sphere structures, naturally handling non-uniqueness by sampling diverse valid designs.

Result: Achieves median MPE below 19% on unseen targets (best: 1.39%), outperforming CMA-ES evolutionary optimization while reducing design time from hours to seconds.

Conclusion: Diffusion models are promising for advancing electromagnetic inverse design research, enabling rapid exploration of complex metasurface architectures and accelerating development of next-generation photonic and wireless communication systems.

Abstract: We present a conditional diffusion model for electromagnetic inverse design that generates structured media geometries directly from target differential scattering cross-section profiles, bypassing expensive iterative optimization. Our 1D U-Net architecture with Feature-wise Linear Modulation learns to map desired angular scattering patterns to 2x2 dielectric sphere structure, naturally handling the non-uniqueness of inverse problems by sampling diverse valid designs. Trained on 11,000 simulated metasurfaces, the model achieves median MPE below 19% on unseen targets (best: 1.39%), outperforming CMA-ES evolutionary optimization while reducing design time from hours to seconds. These results demonstrate that employing diffusion models is promising for advancing electromagnetic inverse design research, potentially enabling rapid exploration of complex metasurface architectures and accelerating the development of next-generation photonic and wireless communication systems. The code is publicly available at https://github.com/mikzuker/inverse_design_metasurface_generation.

[266] Adversarially Robust Multitask Adaptive Control

Kasra Fallah, Leonardo F. Toso, James Anderson

Main category: cs.LG

TL;DR: Adversarially robust multitask adaptive LQR control using clustered approach with resilient aggregation to handle model uncertainty and adversarial corruption.

Details

Motivation: Multiple systems need to collaboratively learn control policies under model uncertainty and adversarial corruption, requiring robust methods that can mitigate corrupted model updates.

Method: Proposed clustered multitask approach that integrates clustering and system identification with resilient aggregation to handle adversarial behavior.

Result: Established non-asymptotic bounds showing regret decreases inversely with number of honest systems per cluster, and this reduction is preserved under bounded fraction of adversarial systems within each cluster.

Conclusion: The clustered multitask approach with resilient aggregation effectively handles adversarial corruption in multitask LQR control while maintaining performance benefits from collaborative learning.

Abstract: We study adversarially robust multitask adaptive linear quadratic control; a setting where multiple systems collaboratively learn control policies under model uncertainty and adversarial corruption. We propose a clustered multitask approach that integrates clustering and system identification with resilient aggregation to mitigate corrupted model updates. Our analysis characterizes how clustering accuracy, intra-cluster heterogeneity, and adversarial behavior affect the expected regret of certainty-equivalent (CE) control across LQR tasks. We establish non-asymptotic bounds demonstrating that the regret decreases inversely with the number of honest systems per cluster and that this reduction is preserved under a bounded fraction of adversarial systems within each cluster.

[267] Parameter-Efficient Conditioning for Material Generalization in Graph-Based Simulators

Naveen Raj Manoharan, Hassan Iqbal, Krishna Kumar

Main category: cs.LG

TL;DR: Parameter-efficient conditioning mechanism for Graph Network Simulators (GNS) that enables adaptation to different material properties in granular flows by targeting early message-passing layers, achieving accurate predictions with minimal training data.

Details

Motivation: Existing GNS models are trained for single material types and fail to generalize across different constitutive behaviors, limiting their real-world engineering applications where material properties vary.

Method: Proposed Feature-wise Linear Modulation (FiLM) conditioning mechanism targeting early message-passing layers, based on finding that material sensitivity is concentrated in initial layers. Fine-tunes only first 1-5 of 10 MP layers for efficiency.

Result: Achieves accurate long-term rollouts on unseen, interpolated, and moderately extrapolated material values (up to 2.5° friction angle, 0.25 kPa cohesion) with only 12 short simulation trajectories, representing 5x data reduction compared to baseline multi-task learning.

Conclusion: Enables GNS application in inverse design and closed-loop control tasks where material properties are design variables, successfully demonstrated through inverse problem solving for unknown cohesion parameters.

Abstract: Graph network-based simulators (GNS) have demonstrated strong potential for learning particle-based physics (such as fluids, deformable solids, and granular flows) while generalizing to unseen geometries due to their inherent inductive biases. However, existing models are typically trained for a single material type and fail to generalize across distinct constitutive behaviors, limiting their applicability in real-world engineering settings. Using granular flows as a running example, we propose a parameter-efficient conditioning mechanism that makes the GNS model adaptive to material parameters. We identify that sensitivity to material properties is concentrated in the early message-passing (MP) layers, a finding we link to the local nature of constitutive models (e.g., Mohr-Coulomb) and their effects on information propagation. We empirically validate this by showing that fine-tuning only the first few (1-5) of 10 MP layers of a pretrained model achieves comparable test performance as compared to fine-tuning the entire network. Building on this insight, we propose a parameter-efficient Feature-wise Linear Modulation (FiLM) conditioning mechanism designed to specifically target these early layers. This approach produces accurate long-term rollouts on unseen, interpolated, or moderately extrapolated values (e.g., up to 2.5 degrees for friction angle and 0.25 kPa for cohesion) when trained exclusively on as few as 12 short simulation trajectories from new materials, representing a 5-fold data reduction compared to a baseline multi-task learning method. Finally, we validate the model’s utility by applying it to an inverse problem, successfully identifying unknown cohesion parameters from trajectory data. This approach enables the use of GNS in inverse design and closed-loop control tasks where material properties are treated as design variables.

[268] Synapse: Adaptive Arbitration of Complementary Expertise in Time Series Foundational Models

Sarkar Snigdha Sarathi Das, Palash Goyal, Mihir Parmar, Yiwen Song, Long T. Le, Lesly Miculicich, Jinsung Yoon, Rui Zhang, Hamid Palangi, Tomas Pfister

Main category: cs.LG

TL;DR: Synapse is a novel arbitration framework that dynamically combines multiple pre-trained Time Series Foundational Models (TSFMs) by adaptively weighting their outputs based on context-dependent performance, achieving superior forecasting results compared to individual models and traditional ensembling methods.

Details

Motivation: Different TSFMs exhibit highly variable performance across forecasting tasks due to divergent training protocols and data sources. Leveraging their complementary expertise through arbitration remains unexplored but presents a compelling strategy for improved forecasting.

Method: Proposed Synapse framework: dynamically leverages a pool of TSFMs, assigns and adjusts predictive weights based on relative context-dependent performance, and constructs robust forecast distribution by adaptively sampling from constituent models’ output quantiles.

Result: Experimental results show Synapse consistently outperforms other popular ensembling techniques and individual TSFMs, demonstrating its efficacy in time series forecasting across various settings.

Conclusion: Synapse effectively leverages the specialized performance profiles of different TSFMs through dynamic arbitration, providing a robust solution that adapts to varying forecasting contexts and outperforms existing methods.

Abstract: Pre-trained Time Series Foundational Models (TSFMs) represent a significant advance, capable of forecasting diverse time series with complex characteristics, including varied seasonalities, trends, and long-range dependencies. Despite their primary goal of universal time series forecasting, their efficacy is far from uniform; divergent training protocols and data sources cause individual TSFMs to exhibit highly variable performance across different forecasting tasks, domains, and horizons. Leveraging this complementary expertise by arbitrating existing TSFM outputs presents a compelling strategy, yet this remains a largely unexplored area of research. In this paper, we conduct a thorough examination of how different TSFMs exhibit specialized performance profiles across various forecasting settings, and how we can effectively leverage this behavior in arbitration between different time series models. We specifically analyze how factors such as model selection and forecast horizon distribution can influence the efficacy of arbitration strategies. Based on this analysis, we propose Synapse, a novel arbitration framework for TSFMs. Synapse is designed to dynamically leverage a pool of TSFMs, assign and adjust predictive weights based on their relative, context-dependent performance, and construct a robust forecast distribution by adaptively sampling from the output quantiles of constituent models. Experimental results demonstrate that Synapse consistently outperforms other popular ensembling techniques as well as individual TSFMs, demonstrating Synapse’s efficacy in time series forecasting.

[269] On Flow Matching KL Divergence

Maojiang Su, Jerry Yao-Chieh Hu, Sophia Pi, Han Liu

Main category: cs.LG

TL;DR: The paper provides a deterministic, non-asymptotic upper bound on the KL divergence for flow-matching distribution approximation, showing that if the L2 flow-matching loss is bounded by ε², then the KL divergence is bounded by A₁ε + A₂ε², leading to statistical convergence rates and near minimax-optimal efficiency.

Details

Motivation: To establish theoretical guarantees for flow-matching methods by deriving explicit bounds on distribution approximation error and comparing their statistical efficiency with diffusion models.

Method: Derived deterministic, non-asymptotic upper bounds on KL divergence using flow-matching loss bounds, analyzed statistical convergence rates under Total Variation distance, and conducted numerical studies on synthetic and learned velocities.

Result: Proved that flow-matching achieves nearly minimax-optimal efficiency in estimating smooth distributions, with KL divergence bounded by A₁ε + A₂ε² when L2 flow-matching loss is bounded by ε².

Conclusion: Flow matching demonstrates comparable statistical efficiency to diffusion models under TV distance, with theoretical bounds supported by numerical validation.

Abstract: We derive a deterministic, non-asymptotic upper bound on the Kullback-Leibler (KL) divergence of the flow-matching distribution approximation. In particular, if the $L_2$ flow-matching loss is bounded by $\epsilon^2 > 0$, then the KL divergence between the true data distribution and the estimated distribution is bounded by $A_1 \epsilon + A_2 \epsilon^2$. Here, the constants $A_1$ and $A_2$ depend only on the regularities of the data and velocity fields. Consequently, this bound implies statistical convergence rates of Flow Matching Transformers under the Total Variation (TV) distance. We show that, flow matching achieves nearly minimax-optimal efficiency in estimating smooth distributions. Our results make the statistical efficiency of flow matching comparable to that of diffusion models under the TV distance. Numerical studies on synthetic and learned velocities corroborate our theory.

[270] SiamMM: A Mixture Model Perspective on Deep Unsupervised Learning

Xiaodong Wang, Jing Huang, Kevin J Liang

Main category: cs.LG

TL;DR: Connects unsupervised clustering methods to classical mixture models, developing SiamMM which achieves state-of-the-art performance in self-supervised learning and reveals potential mislabeling in datasets.

Details

Motivation: Clustering-based approaches for self-supervised learning are effective but often applied heuristically without clear optimal methodology.

Method: Establishes connections between unsupervised clustering methods and classical mixture models, leading to the development of SiamMM model.

Result: Achieves state-of-the-art performance across various self-supervised learning benchmarks. Learned clusters strongly resemble unseen ground truth labels and uncover potential mislabeling instances.

Conclusion: The framework connecting clustering methods to mixture models enables significant enhancements and reveals insights about dataset quality through cluster analysis.

Abstract: Recent studies have demonstrated the effectiveness of clustering-based approaches for self-supervised and unsupervised learning. However, the application of clustering is often heuristic, and the optimal methodology remains unclear. In this work, we establish connections between these unsupervised clustering methods and classical mixture models from statistics. Through this framework, we demonstrate significant enhancements to these clustering methods, leading to the development of a novel model named SiamMM. Our method attains state-of-the-art performance across various self-supervised learning benchmarks. Inspection of the learned clusters reveals a strong resemblance to unseen ground truth labels, uncovering potential instances of mislabeling.

[271] DGTN: Graph-Enhanced Transformer with Diffusive Attention Gating Mechanism for Enzyme DDG Prediction

Abigail Lin

Main category: cs.LG

TL;DR: DGTN is a novel architecture that co-learns GNN weights and transformer attention through bidirectional diffusion, achieving state-of-the-art performance in predicting mutation effects on enzyme stability.

Details

Motivation: Existing deep learning approaches process sequence and structure information independently, failing to capture the intricate coupling between local structural geometry and global sequential patterns in protein stability prediction.

Method: DGTN uses a bidirectional diffusion process where GNN-derived structural embeddings guide transformer attention via learnable diffusion kernels, and transformer representations refine GNN message passing through attention-modulated graph updates.

Result: On ProTherm and SKEMPI benchmarks, DGTN achieves state-of-the-art performance (Pearson Rho = 0.87, RMSE = 1.21 kcal/mol) with 6.2% improvement over best baselines. The diffusion mechanism contributes 4.8 points to correlation.

Conclusion: This work establishes a principled framework for integrating heterogeneous protein representations through learnable diffusion, with proven convergence to optimal structure-sequence coupling.

Abstract: Predicting the effect of amino acid mutations on enzyme thermodynamic stability (DDG) is fundamental to protein engineering and drug design. While recent deep learning approaches have shown promise, they often process sequence and structure information independently, failing to capture the intricate coupling between local structural geometry and global sequential patterns. We present DGTN (Diffused Graph-Transformer Network), a novel architecture that co-learns graph neural network (GNN) weights for structural priors and transformer attention through a diffusion mechanism. Our key innovation is a bidirectional diffusion process where: (1) GNN-derived structural embeddings guide transformer attention via learnable diffusion kernels, and (2) transformer representations refine GNN message passing through attention-modulated graph updates. We provide rigorous mathematical analysis showing this co-learning scheme achieves provably better approximation bounds than independent processing. On ProTherm and SKEMPI benchmarks, DGTN achieves state-of-the-art performance (Pearson Rho = 0.87, RMSE = 1.21 kcal/mol), with 6.2% improvement over best baselines. Ablation studies confirm the diffusion mechanism contributes 4.8 points to correlation. Our theoretical analysis proves the diffused attention converges to optimal structure-sequence coupling, with convergence rate O(1/sqrt(T) ) where T is diffusion steps. This work establishes a principled framework for integrating heterogeneous protein representations through learnable diffusion.

[272] Precipitation nowcasting of satellite data using physically conditioned neural networks

Antônio Catão, Melvin Poveda, Leonardo Voltarelli, Paulo Orenstein

Main category: cs.LG

TL;DR: TUPANN is a satellite-only precipitation nowcasting model that uses physics-aligned deep learning to forecast rainfall up to 3 hours ahead, achieving state-of-the-art performance across diverse climates.

Details

Motivation: Current short-term precipitation forecasts rely heavily on dense weather-radar networks, which limits operational value in regions most vulnerable to climate extremes where radar coverage is sparse.

Method: TUPANN decomposes forecasts into physically meaningful components: variational encoder-decoder infers motion/intensity fields under optical-flow supervision, lead-time-conditioned MaxViT evolves latent state, and differentiable advection reconstructs future frames.

Result: TUPANN achieves best or second-best performance in most settings across four distinct climates (Rio, Manaus, Miami, La Paz) at 10-180min lead times, with pronounced gains at higher rainfall thresholds. Training on multiple cities improves performance, and cross-city experiments show modest degradation.

Conclusion: Physically aligned learning can provide skillful, transferable and global precipitation nowcasts using only satellite data, enabling near real-time forecasting in radar-sparse regions vulnerable to climate extremes.

Abstract: Accurate short-term precipitation forecasts predominantly rely on dense weather-radar networks, limiting operational value in places most exposed to climate extremes. We present TUPANN (Transferable and Universal Physics-Aligned Nowcasting Network), a satellite-only model trained on GOES-16 RRQPE. Unlike most deep learning models for nowcasting, TUPANN decomposes the forecast into physically meaningful components: a variational encoder-decoder infers motion and intensity fields from recent imagery under optical-flow supervision, a lead-time-conditioned MaxViT evolves the latent state, and a differentiable advection operator reconstructs future frames. We evaluate TUPANN on both GOES-16 and IMERG data, in up to four distinct climates (Rio de Janeiro, Manaus, Miami, La Paz) at 10-180min lead times using the CSI and HSS metrics over 4-64 mm/h thresholds. Comparisons against optical-flow, deep learning and hybrid baselines show that TUPANN achieves the best or second-best skill in most settings, with pronounced gains at higher thresholds. Training on multiple cities further improves performance, while cross-city experiments show modest degradation and occasional gains for rare heavy-rain regimes. The model produces smooth, interpretable motion fields aligned with numerical optical flow and runs in near real time due to the low latency of GOES-16. These results indicate that physically aligned learning can provide nowcasts that are skillful, transferable and global.

[273] SoilX: Calibration-Free Comprehensive Soil Sensing Through Contrastive Cross-Component Learning

Kang Yang, Yuanlin Yang, Yuning Chen, Sikai Yang, Xinyu Zhang, Wan Du

Main category: cs.LG

TL;DR: SoilX is a calibration-free wireless soil sensing system that jointly measures six key soil components (moisture, nitrogen, phosphorus, potassium, carbon, aluminum) without requiring recalibration for different soil textures.

Details

Motivation: Current wireless soil sensing solutions require recalibration to handle variations in soil texture (aluminosilicates and organic carbon), limiting their practical deployment in precision agriculture.

Method: SoilX uses Contrastive Cross-Component Learning (3CL) with Orthogonality Regularizer and Separation Loss to disentangle cross-component interference, and a novel tetrahedral antenna array with antenna-switching mechanism for robust dielectric permittivity measurement.

Result: SoilX reduces estimation errors by 23.8% to 31.5% compared to baselines and generalizes well to unseen fields without recalibration.

Conclusion: SoilX provides a practical, calibration-free solution for continuous soil monitoring that can handle soil texture variations, making it suitable for real-world precision agriculture applications.

Abstract: Precision agriculture demands continuous and accurate monitoring of soil moisture (M) and key macronutrients, including nitrogen (N), phosphorus (P), and potassium (K), to optimize yields and conserve resources. Wireless soil sensing has been explored to measure these four components; however, current solutions require recalibration (i.e., retraining the data processing model) to handle variations in soil texture, characterized by aluminosilicates (Al) and organic carbon (C), limiting their practicality. To address this, we introduce SoilX, a calibration-free soil sensing system that jointly measures six key components: {M, N, P, K, C, Al}. By explicitly modeling C and Al, SoilX eliminates texture- and carbon-dependent recalibration. SoilX incorporates Contrastive Cross-Component Learning (3CL), with two customized terms: the Orthogonality Regularizer and the Separation Loss, to effectively disentangle cross-component interference. Additionally, we design a novel tetrahedral antenna array with an antenna-switching mechanism, which can robustly measure soil dielectric permittivity independent of device placement. Extensive experiments demonstrate that SoilX reduces estimation errors by 23.8% to 31.5% over baselines and generalizes well to unseen fields.

[274] RNN(p) for Power Consumption Forecasting

Roberto Baviera, Pietro Manzoni

Main category: cs.LG

TL;DR: RNN(p) models generalize linear autoregressive models for multi-scale seasonal forecasting, with efficient training and high interpretability.

Details

Motivation: To develop powerful forecasting tools for variables with seasonal patterns across multiple time scales, commonly found in energy, economic, and financial time series.

Method: RNN(p) architecture with structured feedbacks across time lags, enabling efficient training strategies and comparative study of learning algorithms.

Result: RNN(p) models achieve excellent forecasting accuracy in power consumption forecasting while maintaining high interpretability.

Conclusion: RNN(p) models are well-suited for decision-making in energy markets and fintech applications where reliable predictions have significant economic impact.

Abstract: An elementary Recurrent Neural Network that operates on p time lags, called an RNN(p), is the natural generalisation of a linear autoregressive model ARX(p). It is a powerful forecasting tool for variables displaying inherent seasonal patterns across multiple time scales, as is often observed in energy, economic, and financial time series. The architecture of RNN(p) models, characterised by structured feedbacks across time lags, enables the design of efficient training strategies. We conduct a comparative study of learning algorithms for these models, providing a rigorous analysis of their computational complexity and training performance. We present two applications of RNN(p) models in power consumption forecasting, a key domain within the energy sector where accurate forecasts inform both operational and financial decisions. Experimental results show that RNN(p) models achieve excellent forecasting accuracy while maintaining a high degree of interpretability. These features make them well-suited for decision-making in energy markets and other fintech applications where reliable predictions play a significant economic role.

[275] A Multi-Stage Automated Online Network Data Stream Analytics Framework for IIoT Systems

Li Yang, Abdallah Shami

Main category: cs.LG

TL;DR: Proposes MSANA framework for concept drift adaptation in IIoT systems, featuring automated data preprocessing, dynamic feature selection, model learning, and ensemble methods to improve network data stream analytics in Industry 5.0 environments.

Details

Motivation: Address concept drift issues in IIoT network data stream analytics that cause performance degradation and automation difficulties in dynamic industrial environments, supporting Industry 5.0's human-machine collaboration goals.

Method: Multi-stage framework with dynamic data preprocessing, Drift-based Dynamic Feature Selection (DD-FS), dynamic model learning & selection, and Window-based Performance Weighted Probability Averaging Ensemble (W-PWPAE) model.

Result: Experimental results on two public IoT datasets show the framework outperforms state-of-the-art methods for IIoT data stream analytics.

Conclusion: MSANA provides a complete automated data stream analytics framework that enables automatic, effective, and efficient data analytics for IIoT systems in Industry 5.0.

Abstract: Industry 5.0 aims at maximizing the collaboration between humans and machines. Machines are capable of automating repetitive jobs, while humans handle creative tasks. As a critical component of Industrial Internet of Things (IIoT) systems for service delivery, network data stream analytics often encounter concept drift issues due to dynamic IIoT environments, causing performance degradation and automation difficulties. In this paper, we propose a novel Multi-Stage Automated Network Analytics (MSANA) framework for concept drift adaptation in IIoT systems, consisting of dynamic data pre-processing, the proposed Drift-based Dynamic Feature Selection (DD-FS) method, dynamic model learning & selection, and the proposed Window-based Performance Weighted Probability Averaging Ensemble (W-PWPAE) model. It is a complete automated data stream analytics framework that enables automatic, effective, and efficient data analytics for IIoT systems in Industry 5.0. Experimental results on two public IoT datasets demonstrate that the proposed framework outperforms state-of-the-art methods for IIoT data stream analytics.

[276] Non-stationary Delayed Online Convex Optimization: From Full-information to Bandit Setting

Yuanyu Wan, Chang Yao, Yitao Ma, Mingli Song, Lijun Zhang

Main category: cs.LG

TL;DR: This paper proposes Mild-OGD, an algorithm for delayed online convex optimization in non-stationary environments, achieving optimal dynamic regret bounds with both full-information and bandit feedback.

Details

Motivation: Previous OCO studies focused on stationary environments with static regret, but real-world applications often involve non-stationary environments and delayed feedback.

Method: Maintains multiple experts with different learning rates for delayed gradients, uses meta-algorithm to track best expert based on delayed performance. Also develops bandit variant for delayed loss values only.

Result: Achieves dynamic regret bounds of O(√(d̄T(P_T+1))) under in-order delays and O(√(dT(P_T+1))) in worst case, with matching lower bound proving optimality. Bandit variant performs comparably to non-delayed algorithms under large delays.

Conclusion: The proposed Mild-OGD algorithm effectively handles delayed OCO in non-stationary environments with optimal dynamic regret bounds, and the bandit variant shows strong performance even with significant delays.

Abstract: Although online convex optimization (OCO) under arbitrary delays has received increasing attention recently, previous studies focus on stationary environments with the goal of minimizing static regret. In this paper, we investigate the delayed OCO in non-stationary environments, and choose dynamic regret with respect to any sequence of comparators as the performance metric. To this end, we first propose an algorithm called Mild-OGD for the full-information case, where delayed gradients are available. The basic idea is to maintain multiple experts in parallel, each performing a gradient descent step with different learning rates for every delayed gradient according to their arrival order, and utilize a meta-algorithm to track the best one based on their delayed performance. Despite the simplicity of this idea, our novel analysis shows that the dynamic regret of Mild-OGD can be automatically bounded by $O(\sqrt{\bar{d}T(P_T+1)})$ under the in-order assumption and $O(\sqrt{dT(P_T+1)})$ in the worst case, where $\bar{d}$ and $d$ denote the average and maximum delay respectively, $T$ is the time horizon, and $P_T$ is the path-length of comparators. Moreover, we demonstrate that the result in the worst case is optimal by deriving a matching lower bound. Finally, we develop a bandit variant of Mild-OGD for a more challenging case with only delayed loss values. Interestingly, we prove that under a relatively large amount of delay, our bandit algorithm even enjoys the best dynamic regret bound of existing non-delayed bandit algorithms.

[277] Learning for Interval Prediction of Electricity Demand: A Cluster-based Bootstrapping Approach

Rohit Dube, Natarajan Gautam, Amarnath Banerjee, Harsha Nagarajan

Main category: cs.LG

TL;DR: A residual bootstrap algorithm for day-ahead electricity demand interval estimation in Microgrids, using ML point estimates and clustering of similar demand patterns.

Details

Motivation: Accurate electricity demand predictions are crucial for Microgrid operations, but low aggregation makes demands highly stochastic, requiring interval estimates to quantify uncertainty around point predictions.

Method: Uses ML for point estimates, stores residuals in memory partitioned by clusters of similar demand days identified via unsupervised learning, then bootstraps residuals from the closest cluster for test days.

Result: Evaluated on real EULR electricity demand data and compared to other bootstrapping methods across varying confidence intervals.

Conclusion: The proposed residual bootstrap algorithm with clustering provides effective interval estimation for day-ahead electricity demand in stochastic Microgrid settings.

Abstract: Accurate predictions of electricity demands are necessary for managing operations in a small aggregation load setting like a Microgrid. Due to low aggregation, the electricity demands can be highly stochastic and point estimates would lead to inflated errors. Interval estimation in this scenario, would provide a range of values within which the future values might lie and helps quantify the errors around the point estimates. This paper introduces a residual bootstrap algorithm to generate interval estimates of day-ahead electricity demand. A machine learning algorithm is used to obtain the point estimates of electricity demand and respective residuals on the training set. The obtained residuals are stored in memory and the memory is further partitioned. Days with similar demand patterns are grouped in clusters using an unsupervised learning algorithm and these clusters are used to partition the memory. The point estimates for test day are used to find the closest cluster of similar days and the residuals are bootstrapped from the chosen cluster. This algorithm is evaluated on the real electricity demand data from EULR(End Use Load Research) and is compared to other bootstrapping methods for varying confidence intervals.

[278] Characterizing the Training Dynamics of Private Fine-tuning with Langevin diffusion

Shuqi Ke, Charlie Hou, Sewoong Oh, Giulia Fanti

Main category: cs.LG

TL;DR: Differentially private full fine-tuning (DP-FFT) distorts pre-trained features due to misalignment between backbone and head. Sequential fine-tuning (DP-LP-FFT) mitigates this distortion. Theoretical analysis provides bounds and reveals privacy budget allocation trade-offs.

Details

Motivation: To understand and address the feature distortion problem in differentially private full fine-tuning of pre-trained models, which occurs due to misalignment between pre-trained backbone and randomly initialized linear head.

Method: Proposed sequential fine-tuning strategy: first-linear-probing-then-fine-tuning (DP-LP-FFT). Developed theoretical analysis using 2-layer neural networks with ReLU activation, deriving approximate bounds on training loss. Also analyzed 2-layer linear networks without approximation.

Result: Theoretical and empirical results show DP-LP-FFT mitigates feature distortion compared to DP-FFT. Experiments on real-world datasets confirm theoretical insights. Derived new upper bounds for 2-layer linear networks.

Conclusion: Sequential fine-tuning (DP-LP-FFT) effectively addresses feature distortion in differentially private fine-tuning. Theoretical analysis provides insights into privacy budget allocation trade-offs in multi-phase fine-tuning methods.

Abstract: We show that differentially private full fine-tuning (DP-FFT) can distort pre-trained backbone features based on both theoretical and empirical results. We identify the cause of the distortion as the misalignment between the pre-trained backbone and the randomly initialized linear head. We prove that a sequential fine-tuning strategy can mitigate the feature distortion: first-linear-probing-then-fine-tuning (DP-LP-FFT). A new approximation scheme allows us to derive approximate upper and lower bounds on the training loss of DP-LP and DP-FFT, in a simple but canonical setting of 2-layer neural networks with ReLU activation. Experiments on real-world datasets and architectures are consistent with our theoretical insights. We also derive new upper bounds for 2-layer linear networks without the approximation. Moreover, our theory suggests a trade-off of privacy budget allocation in multi-phase fine-tuning methods like DP-LP-FFT.

[279] Tactical Decision Making for Autonomous Trucks by Deep Reinforcement Learning with Total Cost of Operation Based Reward

Deepthi Pathare, Leo Laine, Morteza Haghir Chehreghani

Main category: cs.LG

TL;DR: Deep reinforcement learning framework for autonomous truck tactical decision making in highway scenarios, separating high-level decisions from low-level control.

Details

Motivation: To improve tactical decision making for autonomous trucks in highway scenarios, specifically for Adaptive Cruise Control and lane change maneuvers.

Method: Separates high-level decision-making from low-level control actions between RL agent and physical model controllers. Uses realistic multi-objective reward function based on Total Cost of Operation with different optimization approaches.

Result: Demonstrates benefits of separating decision-making processes from control actions. Studies optimization with weighted rewards, normalized rewards, and curriculum learning techniques.

Conclusion: The separation between high-level reinforcement learning decision making and low-level physical model controllers is beneficial for autonomous truck tactical operations.

Abstract: We develop a deep reinforcement learning framework for tactical decision making in an autonomous truck, specifically for Adaptive Cruise Control (ACC) and lane change maneuvers in a highway scenario. Our results demonstrate that it is beneficial to separate high-level decision-making processes and low-level control actions between the reinforcement learning agent and the low-level controllers based on physical models. In the following, we study optimizing the performance with a realistic and multi-objective reward function based on Total Cost of Operation (TCOP) of the truck using different approaches; by adding weights to reward components, by normalizing the reward components and by using curriculum learning techniques.

[280] A Closer Look at Deep Learning Methods on Tabular Datasets

Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, De-Chuan Zhan

Main category: cs.LG

TL;DR: Extensive evaluation of tabular prediction methods using TALENT benchmark (300+ datasets) shows tree ensembles remain strong but pretrained models are catching up, with dataset heterogeneity determining method performance.

Details

Motivation: Need for systematic evaluation of deep tabular prediction methods and understanding their behavior across diverse datasets as pretrained foundation models advance.

Method: Used TALENT benchmark with 300+ datasets spanning various sizes, feature compositions, domains, and output types. Evaluated tree-based and neural approaches, including ensembling. Analyzed dataset heterogeneity through meta-features and early training dynamics.

Result: Ensembling benefits both tree-based and neural methods. Gradient-boosted trees remain strong baselines, but pretrained tabular models now match or surpass them on many tasks. Top performance concentrates in a small subset of models. Dataset heterogeneity largely determines which method family performs best.

Conclusion: Provides actionable insights for method selection and future directions in deep tabular learning, with TALENT-tiny core for rapid evaluation and TALENT-extension for stress testing high-dimensional/large-scale settings.

Abstract: Tabular data is prevalent across diverse domains in machine learning. With the rapid progress of deep tabular prediction methods, especially pretrained (foundation) models, there is a growing need to evaluate these methods systematically and to understand their behavior. We present an extensive study on TALENT, a collection of 300+ datasets spanning broad ranges of size, feature composition (numerical/categorical mixes), domains, and output types (binary, multi–class, regression). Our evaluation shows that ensembling benefits both tree-based and neural approaches. Traditional gradient-boosted trees remain very strong baselines, yet recent pretrained tabular models now match or surpass them on many tasks, narrowing–but not eliminating–the historical advantage of tree ensembles. Despite architectural diversity, top performance concentrates within a small subset of models, providing practical guidance for method selection. To explain these outcomes, we quantify dataset heterogeneity by learning from meta-features and early training dynamics to predict later validation behavior. This dynamics-aware analysis indicates that heterogeneity–such as the interplay of categorical and numerical attributes–largely determines which family of methods is favored. Finally, we introduce a two-level design beyond the 300 common-size datasets: a compact TALENT-tiny core (45 datasets) for rapid, reproducible evaluation, and a TALENT-extension suite targeting high-dimensional, many-class, and very large-scale settings for stress testing. In summary, these results offer actionable insights into the strengths, limitations, and future directions for improving deep tabular learning.

[281] ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models

Duy M. H. Nguyen, Nghiem T. Diep, Trung Q. Nguyen, Hoang-Bao Le, Tai Nguyen, Tien Nguyen, TrungTin Nguyen, Nhat Ho, Pengtao Xie, Roger Wattenhofer, James Zou, Daniel Sonntag, Mathias Niepert

Main category: cs.LG

TL;DR: ExGra-Med introduces a multi-graph alignment framework for medical multi-modal LLMs that improves vision-language alignment using only 10% of pre-training data while matching or outperforming state-of-the-art models.

Details

Motivation: Current med-MLLMs suffer from weak vision-language alignment due to reliance on autoregressive objectives and large-scale training, making them overly dependent on costly instruction-following data.

Method: A novel multi-graph alignment framework that jointly aligns images, instruction responses, and extended captions in latent space, with an efficient end-to-end training scheme using black-box gradient estimation for large LLMs.

Result: Achieves LLaVA-Med’s performance with only 10% pre-training data, gains 20.13% on VQA-RAD, and outperforms BioMedGPT and RadFM on visual chatbot and zero-shot classification tasks.

Conclusion: ExGra-Med demonstrates efficient, high-quality vision-language integration in medical AI through improved semantic grounding and cross-modal coherence with significantly reduced data requirements.

Abstract: State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLaVA-Med and BioMedGPT, primarily depend on scaling model size and data volume, with training driven largely by autoregressive objectives. However, we reveal that this approach can lead to weak vision-language alignment, making these models overly dependent on costly instruction-following data. To address this, we introduce ExGra-Med, a novel multi-graph alignment framework that jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence. To scale to large LLMs (e.g., LLaMA-7B), we develop an efficient end-to-end training scheme using black-box gradient estimation, enabling fast and scalable optimization. Empirically, ExGra-Med matches LLaVA-Med’s performance using just 10% of the pre-training data, achieving a 20.13% gain on VQA-RAD and approaching full-data performance. It also outperforms strong baselines like BioMedGPT and RadFM on visual chatbot and zero-shot classification tasks, demonstrating its promise for efficient, high-quality vision-language integration in medical AI.

Fuying Wang, Feng Wu, Yihan Tang, Lequan Yu

Main category: cs.LG

TL;DR: CTPD framework discovers cross-modal temporal patterns from multimodal EHR data (numerical time series + clinical notes) to improve clinical outcome predictions by aligning temporal patterns across modalities.

Details

Motivation: Existing methods focus on temporal interactions within individual samples and multimodal fusion, but overlook critical temporal patterns across patients that can indicate deteriorating health or critical events.

Method: Cross-Modal Temporal Pattern Discovery (CTPD) framework with shared initial temporal pattern representations refined using slot attention, plus contrastive-based TPNCE loss for cross-modal alignment and reconstruction losses to retain modality-specific information.

Result: Superior performance on 48-hour in-hospital mortality and 24-hour phenotype classification tasks using MIMIC-III database compared to existing approaches.

Conclusion: The CTPD framework effectively discovers meaningful cross-modal temporal patterns from multimodal EHR data, improving clinical outcome prediction accuracy by capturing critical temporal patterns across patients and modalities.

Abstract: Integrating multimodal Electronic Health Records (EHR) data, such as numerical time series and free-text clinical reports, has great potential in predicting clinical outcomes. However, prior work has primarily focused on capturing temporal interactions within individual samples and fusing multimodal information, overlooking critical temporal patterns across patients. These patterns, such as trends in vital signs like abnormal heart rate or blood pressure, can indicate deteriorating health or an impending critical event. Similarly, clinical notes often contain textual descriptions that reflect these patterns. Identifying corresponding temporal patterns across different modalities is crucial for improving the accuracy of clinical outcome predictions, yet it remains a challenging task. To address this gap, we introduce a Cross-Modal Temporal Pattern Discovery (CTPD) framework, designed to efficiently extract meaningful cross-modal temporal patterns from multimodal EHR data. Our approach introduces shared initial temporal pattern representations which are refined using slot attention to generate temporal semantic embeddings. To ensure rich cross-modal temporal semantics in the learned patterns, we introduce a contrastive-based TPNCE loss for cross-modal alignment, along with two reconstruction losses to retain core information of each modality. Evaluations on two clinically critical tasks, 48-hour in-hospital mortality and 24-hour phenotype classification, using the MIMIC-III database demonstrate the superiority of our method over existing approaches.

[283] P1-KAN: an effective Kolmogorov-Arnold network with application to hydraulic valley optimization

Xavier Warin

Main category: cs.LG

TL;DR: A new Kolmogorov-Arnold network (KAN) is proposed for approximating irregular functions in high dimensions, with error bounds for smooth functions and universal approximation theorems for continuous functions. It outperforms multilayer perceptrons in accuracy and convergence speed, and shows superior performance for irregular functions compared to other KAN networks.

Details

Motivation: To develop a neural network architecture that can effectively approximate irregular functions in high-dimensional spaces, addressing limitations of traditional multilayer perceptrons and improving upon existing Kolmogorov-Arnold network variants.

Method: Proposes a new Kolmogorov-Arnold network (KAN) architecture that leverages the Kolmogorov-Arnold representation theorem, with specific focus on handling irregular functions through appropriate expansion functions and providing theoretical guarantees.

Result: The proposed KAN outperforms multilayer perceptrons in accuracy and convergence speed. For irregular functions, it surpasses all other KAN networks, while achieving similar accuracy to the original spline-based KAN for smooth functions. Successful application in optimizing a French hydraulic valley demonstrates practical utility.

Conclusion: The new KAN network provides an effective framework for approximating irregular functions in high dimensions, offering theoretical guarantees and practical performance improvements over existing methods, with demonstrated success in real-world optimization problems.

Abstract: A new Kolmogorov-Arnold network (KAN) is proposed to approximate potentially irregular functions in high dimensions. We provide error bounds for this approximation, assuming that the Kolmogorov-Arnold expansion functions are sufficiently smooth. When the function is only continuous, we also provide universal approximation theorems. We show that it outperforms multilayer perceptrons in terms of accuracy and convergence speed. We also compare it with several proposed KAN networks: it outperforms all networks for irregular functions and achieves similar accuracy to the original spline-based KAN network for smooth functions. Finally, we compare some of the KAN networks in optimizing a French hydraulic valley.

[284] TOBUGraph: Knowledge Graph-Based Retrieval for Enhanced LLM Performance Beyond RAG

Savini Kashmira, Jayanaka L. Dantanarayana, Joshua Brodsky, Ashish Mahendra, Yiping Kang, Krisztian Flautner, Lingjia Tang, Jason Mars

Main category: cs.LG

TL;DR: TOBUGraph is a graph-based retrieval framework that outperforms traditional RAG by constructing knowledge graphs from unstructured data using LLMs, enabling more accurate retrieval through graph traversal instead of text similarity.

Details

Motivation: RAG faces limitations in commercial use due to reliance on text-to-text similarity, inability to capture deep semantic relationships, sensitivity to chunking strategies, and hallucination issues.

Method: Constructs knowledge graphs from unstructured data dynamically using LLMs to extract structured knowledge and diverse relationships, then performs retrieval through graph traversal.

Result: Outperforms multiple RAG implementations in precision and recall, eliminates chunking configuration needs, reduces hallucinations, and improves user experience in real-world applications.

Conclusion: Graph-based retrieval with structured knowledge extraction provides superior performance over traditional RAG methods for commercial applications.

Abstract: Retrieval-Augmented Generation (RAG) is one of the leading and most widely used techniques for enhancing LLM retrieval capabilities, but it still faces significant limitations in commercial use cases. RAG primarily relies on the query-chunk text-to-text similarity in the embedding space for retrieval and can fail to capture deeper semantic relationships across chunks, is highly sensitive to chunking strategies, and is prone to hallucinations. To address these challenges, we propose TOBUGraph, a graph-based retrieval framework that first constructs the knowledge graph from unstructured data dynamically and automatically. Using LLMs, TOBUGraph extracts structured knowledge and diverse relationships among data, going beyond RAG’s text-to-text similarity. Retrieval is achieved through graph traversal, leveraging the extracted relationships and structures to enhance retrieval accuracy, eliminating the need for chunking configurations while reducing hallucination. We demonstrate TOBUGraph’s effectiveness in TOBU, a real-world application in production for personal memory organization and retrieval. Our evaluation using real user data demonstrates that TOBUGraph outperforms multiple RAG implementations in both precision and recall, significantly improving user experience through improved retrieval accuracy.

[285] Learning to Learn with Contrastive Meta-Objective

Shiguang Wu, Yaqing Wang, Yatao Bian, Quanming Yao

Main category: cs.LG

TL;DR: ConML enhances meta-learning by using task identity as additional supervision through contrastive learning of model representations, improving performance across various meta-learners with minimal implementation cost.

Details

Motivation: To improve meta-learning generalizability by exploiting task identity as additional supervision, inspired by human fast learning capabilities involving alignment and discrimination.

Method: Proposes ConML framework that adds contrastive meta-objective to existing meta-training, contrasting model representations to leverage task identity information.

Result: ConML integrates seamlessly with existing meta-learners and in-context learning models, bringing significant performance improvements with small implementation cost.

Conclusion: Task identity can be effectively used as additional supervision in meta-training through contrastive learning, enhancing meta-learning performance across various approaches.

Abstract: Meta-learning enables learning systems to adapt quickly to new tasks, similar to humans. Different meta-learning approaches all work under/with the mini-batch episodic training framework. Such framework naturally gives the information about task identity, which can serve as additional supervision for meta-training to improve generalizability. We propose to exploit task identity as additional supervision in meta-training, inspired by the alignment and discrimination ability which is is intrinsic in human’s fast learning. This is achieved by contrasting what meta-learners learn, i.e., model representations. The proposed ConML is evaluating and optimizing the contrastive meta-objective under a problem- and learner-agnostic meta-training framework. We demonstrate that ConML integrates seamlessly with existing meta-learners, as well as in-context learning models, and brings significant boost in performance with small implementation cost.

[286] Cognitive Edge Computing: A Comprehensive Survey on Optimizing Large Models and AI Agents for Pervasive Deployment

Xubin Wang, Qing Li, Weijia Jia

Main category: cs.LG

TL;DR: Cognitive Edge Computing enables deployment of reasoning-capable LLMs and AI agents on resource-constrained edge devices through model optimization, system architecture, and adaptive intelligence techniques.

Details

Motivation: To enable practical deployment of advanced AI capabilities like multi-step reasoning and autonomous agents on edge devices with limited computational resources, memory, and energy constraints.

Method: A unified framework combining model optimization (quantization, sparsity, LoRA, distillation), system architecture (on-device inference, elastic offloading, cloud-edge collaboration), and adaptive intelligence (context compression, dynamic routing, federated personalization).

Result: Synthesizes advances in efficient Transformer design, multimodal integration, hardware-aware compilation, privacy-preserving learning, and agentic tool use, mapping them to edge-specific operating envelopes with standardized evaluation protocols.

Conclusion: Cross-layer co-design of algorithms, runtime, and hardware is essential for delivering reliable, efficient, and privacy-preserving cognitive capabilities on edge devices, with remaining challenges in benchmarks, energy reporting, safety evaluation, and multi-agent testbeds.

Abstract: This article surveys Cognitive Edge Computing as a practical and methodical pathway for deploying reasoning-capable Large Language Models (LLMs) and autonomous AI agents on resource-constrained devices at the network edge. We present a unified, cognition-preserving framework spanning: (1) model optimization (quantization, sparsity, low-rank adaptation, distillation) aimed at retaining multi-step reasoning under tight memory/compute budgets; (2) system architecture (on-device inference, elastic offloading, cloud-edge collaboration) that trades off latency, energy, privacy, and capacity; and (3) adaptive intelligence (context compression, dynamic routing, federated personalization) that tailors computation to task difficulty and device constraints. We synthesize advances in efficient Transformer design, multimodal integration, hardware-aware compilation, privacy-preserving learning, and agentic tool use, and map them to edge-specific operating envelopes. We further outline a standardized evaluation protocol covering latency, throughput, energy per token, accuracy, robustness, privacy, and sustainability, with explicit measurement assumptions to enhance comparability. Remaining challenges include modality-aware reasoning benchmarks, transparent and reproducible energy reporting, edge-oriented safety/alignment evaluation, and multi-agent testbeds. We conclude with practitioner guidelines for cross-layer co-design of algorithms, runtime, and hardware to deliver reliable, efficient, and privacy-preserving cognitive capabilities on edge devices.

[287] LoKO: Low-Rank Kalman Optimizer for Online Fine-Tuning of Large Models

Hossein Abdi, Mingfei Sun, Andi Zhang, Samuel Kaski, Wei Pan

Main category: cs.LG

TL;DR: LoKO is a new optimizer that treats PEFT as an optimal filtering problem, using Kalman filter to estimate trainable parameters online with reduced computational complexity.

Details

Motivation: Training large models from scratch is computationally expensive, and current PEFT methods like LoRA still rely on gradient-based optimizers.

Method: Cast PEFT as optimal filtering problem, use low-rank decomposition from LoRA to reduce matrix sizes in Kalman iterations, and apply diagonal approximation of covariance matrix to reduce complexity from quadratic to linear.

Result: LoKO converges with fewer iterations and yields better performance than commonly used optimizers with LoRA in both image classification and language tasks.

Conclusion: Kalman filter can be leveraged as an effective optimizer for online fine-tuning of large models, opening new possibilities for PEFT methods.

Abstract: Training large models with millions or even billions of parameters from scratch incurs substantial computational costs. Parameter Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), address this challenge by adapting only a reduced number of parameters to specific tasks with gradient-based optimizers. In this paper, we cast PEFT as an optimal filtering/state estimation problem and present Low-Rank Kalman Optimizer (LoKO) to estimate the optimal trainable parameters in an online manner. We leverage the low-rank decomposition in LoRA to significantly reduce matrix sizes in Kalman iterations and further capitalize on a diagonal approximation of the covariance matrix to effectively decrease computational complexity from quadratic to linear in the number of trainable parameters. Moreover, we discovered that the initialization of the covariance matrix within the Kalman algorithm and the accurate estimation of the observation noise covariance are the keys in this formulation, and we propose robust approaches that work well across a vast range of well-established computer vision and language models. Our results show that LoKO converges with fewer iterations and yields better performance models compared to commonly used optimizers with LoRA in both image classifications and language tasks. Our study opens up the possibility of leveraging the Kalman filter as an effective optimizer for the online fine-tuning of large models.

[288] AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science

Qiuhai Zeng, Claire Jin, Xinyue Wang, Yuhan Zheng, Qunhua Li

Main category: cs.LG

TL;DR: AIRepr is an automated framework that evaluates and improves the reproducibility of LLM-generated data analysis workflows using statistical principles and novel prompting strategies.

Details

Motivation: LLMs are increasingly used for automated data analysis, but multiple valid solutions exist for data science tasks, making it critical to understand the reasoning behind analyses. Manual review is labor-intensive, so scalable workflow evaluation is needed.

Method: Developed AIRepr framework with two novel reproducibility-enhancing prompting strategies, benchmarked against standard prompting across 15 analyst-inspector LLM pairs and 1,032 tasks from three public benchmarks.

Result: Workflows with higher reproducibility also yield more accurate analyses, and reproducibility-enhancing prompts substantially improve both reproducibility and accuracy metrics.

Conclusion: AIRepr provides a foundation for transparent, reliable, and efficient human-AI collaboration in data science by enabling automated assessment of LLM-generated workflow reproducibility.

Abstract: Large language models (LLMs) are increasingly used to automate data analysis through executable code generation. Yet, data science tasks often admit multiple statistically valid solutions, e.g. different modeling strategies, making it critical to understand the reasoning behind analyses, not just their outcomes. While manual review of LLM-generated code can help ensure statistical soundness, it is labor-intensive and requires expertise. A more scalable approach is to evaluate the underlying workflows-the logical plans guiding code generation. However, it remains unclear how to assess whether an LLM-generated workflow supports reproducible implementations. To address this, we present AIRepr, an Analyst-Inspector framework for automatically evaluating and improving the reproducibility of LLM-generated data analysis workflows. Our framework is grounded in statistical principles and supports scalable, automated assessment. We introduce two novel reproducibility-enhancing prompting strategies and benchmark them against standard prompting across 15 analyst-inspector LLM pairs and 1,032 tasks from three public benchmarks. Our findings show that workflows with higher reproducibility also yield more accurate analyses, and that reproducibility-enhancing prompts substantially improve both metrics. This work provides a foundation for transparent, reliable, and efficient human-AI collaboration in data science. Our code is publicly available.

[289] ProFL: Performative Robust Optimal Federated Learning

Xue Zheng, Tian Xie, Xuwei Tan, Aylin Yener, Xueru Zhang

Main category: cs.LG

TL;DR: Proposes PROFL algorithm for finding performative optimal points in federated learning from noisy data, overcoming limitations of prior methods that required convex objectives and noiseless data.

Details

Motivation: Address model-induced distribution shifts in federated learning where deployed models affect data generation, causing deviations from original distributions. Prior methods only achieved performative stable points (not optimal) and required unrealistic assumptions.

Method: Develops Performative Robust Optimal Federated Learning (PROFL) algorithm that handles noisy and contaminated data, with convergence analysis under Polyak-Lojasiewicz condition for non-convex objectives.

Result: Extensive experiments on multiple datasets demonstrate PROFL’s advantage over state-of-the-art methods in achieving performative optimal points.

Conclusion: PROFL successfully finds performative optimal points in federated learning under realistic conditions with noisy data and non-convex objectives, overcoming previous limitations.

Abstract: Performative prediction is a framework that captures distribution shifts that occur during the training of machine learning models due to their deployment. As the trained model is used, data generation causes the model to evolve, leading to deviations from the original data distribution. The impact of such model-induced distribution shifts in federated learning is increasingly likely to transpire in real-life use cases. A recently proposed approach extends performative prediction to federated learning with the resulting model converging to a performative stable point, which may be far from the performative optimal point. Earlier research in centralized settings has shown that the performative optimal point can be achieved under model-induced distribution shifts, but these approaches require the performative risk to be convex and the training data to be noiseless, assumptions often violated in realistic federated learning systems. This paper overcomes all of these shortcomings and proposes Performative Robust Optimal Federated Learning, an algorithm that finds performative optimal points in federated learning from noisy and contaminated data. We present the convergence analysis under the Polyak-Lojasiewicz condition, which applies to non-convex objectives. Extensive experiments on multiple datasets demonstrate the advantage of Robust Optimal Federated Learning over the state-of-the-art.

[290] Stochastic Approximation with Unbounded Markovian Noise: A General-Purpose Theorem

Shaan Ul Haque, Siva Theja Maguluri

Main category: cs.LG

TL;DR: Establishes finite-time bounds for TD learning with linear function approximation in unbounded state spaces, achieving optimal O(1/ε²) sample complexity, and provides a general SA theorem applicable to Q-learning and distributed optimization.

Details

Motivation: Addresses engineering applications like resource allocation and inventory systems requiring reinforcement learning with unbounded state spaces and reward functions.

Method: Develops a general theorem for non-linear Stochastic Approximation with Lyapunov functions and drift conditions, applied to TD learning, Q-learning, and distributed optimization with cyclic block coordinate descent.

Result: Achieves optimal sample complexity for TD learning, improves Q-learning bounds, and establishes first finite-time bounds for distributed stochastic optimization with cyclic block coordinate descent.

Conclusion: The general SA theorem provides a powerful black-box tool for extending sample guarantees from i.i.d. noise to unbounded Markovian noise, enabling broad applications in reinforcement learning and optimization.

Abstract: Motivated by engineering applications such as resource allocation in networks and inventory systems, we consider average-reward Reinforcement Learning with unbounded state space and reward function. Recent works studied this problem in the actor-critic framework and established finite sample bounds assuming access to a critic with certain error guarantees. We complement their work by studying Temporal Difference (TD) learning with linear function approximation and establishing finite-time bounds with the optimal $\mathcal{O}\left(1/\epsilon^2\right)$ sample complexity. These results are obtained using the following general-purpose theorem for non-linear Stochastic Approximation (SA). Suppose that one constructs a Lyapunov function for a non-linear SA with certain drift condition. Then, our theorem establishes finite-time bounds when this SA is driven by unbounded Markovian noise under suitable conditions. It serves as a black box tool to generalize sample guarantees on SA from i.i.d. or martingale difference case to potentially unbounded Markovian noise. The generality and the mild assumption of the setup enables broad applicability of our theorem. We illustrate its power by studying two more systems: (i) We improve upon the finite-time bounds of $Q$-learning by tightening the error bounds and also allowing for a larger class of behavior policies. (ii) We establish the first ever finite-time bounds for distributed stochastic optimization of high-dimensional smooth strongly convex function using cyclic block coordinate descent.

[291] Large language models as uncertainty-calibrated optimizers for experimental discovery

Bojana Ranković, Ryan-Rhys Griffiths, Philippe Schwaller

Main category: cs.LG

TL;DR: Training language models with uncertainty-aware objectives enables reliable optimization using natural language, transforming LLM overconfidence into precise calibration and nearly doubling discovery rates in chemical synthesis.

Details

Motivation: Current optimization methods force a choice between domain knowledge (from LLMs) and reliability (from traditional methods), with no principled approach that provides both.

Method: Training language models through uncertainty-aware objectives of traditional optimization methods, teaching LLMs from experimental outcomes under uncertainty.

Result: Nearly doubled discovery rate of high-yielding reaction conditions from 24% to 43% in Buchwald-Hartwig reactions, and ranked first on average across 19 diverse optimization problems spanning multiple scientific domains.

Conclusion: Ensuring reliability through principled uncertainty quantification is critical for realizing AI-guided experimentation’s full potential, establishing a new paradigm for reliable optimization with LLMs.

Abstract: Scientific discovery increasingly depends on efficient experimental optimization to navigate vast design spaces under time and resource constraints. Traditional approaches often require extensive domain expertise and feature engineering. While large language models, with their vast scientific knowledge, circumvent the feature engineering limitations, they lack the calibrated uncertainty estimates required for high-stakes decision making. Hence, current optimization methods force a choice between domain knowledge and reliability, with no principled approach that affords both. In this work, we show that training language models through the uncertainty-aware objectives of traditional optimization methods enables their use as reliable optimizers guided by natural language. By teaching LLMs from experimental outcomes under uncertainty, we transform their overconfidence from a fundamental limitation into a precise calibration mechanism. Applied to Buchwald-Hartwig reactions, a cornerstone of pharmaceutical synthesis, our method nearly doubles the discovery rate of high-yielding reaction conditions, from 24% to 43% in 50 experimental iterations starting from 10 unsuccessful conditions. Across 19 diverse optimization problems spanning organic synthesis, materials science and catalysis, process chemistry, and molecular design, our approach ranks first on average, establishing a new paradigm for reliable, uncertainty-guided optimization with LLMs. Our approach can accelerate discovery by lowering the barrier to using powerful optimization methods, replacing the need for domain-specific feature engineering with more accessible natural language interfaces. These findings highlight that ensuring reliability through principled uncertainty quantification is critical for realizing the full potential of AI-guided experimentation.

[292] Optimizing Anytime Reasoning via Budget Relative Policy Optimization

Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

Main category: cs.LG

TL;DR: AnytimeReasoner is a framework that optimizes anytime reasoning performance in LLMs by introducing verifiable dense rewards and decoupled policy optimization, improving token efficiency and flexibility under varying budget constraints.

Details

Motivation: Existing RL approaches for scaling test-time compute only optimize final performance under fixed token budgets, which limits training and deployment efficiency. The paper aims to improve token efficiency and reasoning flexibility under varying budget constraints.

Method: The framework truncates complete thinking processes to fit sampled token budgets, forcing models to summarize optimal answers for verification. It uses verifiable dense rewards for better credit assignment, decouples thinking and summary policy optimization, and introduces Budget Relative Policy Optimization (BRPO) for variance reduction.

Result: Empirical results in mathematical reasoning tasks show the method consistently outperforms GRPO across all thinking budgets under various prior distributions, improving both training and token efficiency.

Conclusion: AnytimeReasoner successfully enhances reasoning capabilities by optimizing anytime performance, providing better token efficiency and flexibility compared to existing approaches that only focus on final performance under fixed budgets.

Abstract: Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.

[293] Multiplayer Federated Learning: Reaching Equilibrium with Less Communication

TaeHo Yoon, Sayantan Choudhury, Nicolas Loizou

Main category: cs.LG

TL;DR: Introduces Multiplayer Federated Learning (MpFL) framework modeling clients as game-theoretic players with individual objectives, and proposes PEARL-SGD algorithm that achieves equilibrium with reduced communication.

Details

Motivation: Traditional FL assumes collaborative clients with aligned objectives, but real-world clients are rational players with individual goals and strategic behaviors that existing FL frameworks cannot address.

Method: Proposes PEARL-SGD algorithm where each player performs local stochastic gradient descent independently and periodically communicates with other players in a game-theoretic context.

Result: Theoretical analysis proves PEARL-SGD reaches a neighborhood of equilibrium with less communication compared to non-local methods, and numerical experiments verify these findings.

Conclusion: MpFL framework and PEARL-SGD algorithm effectively address the strategic behavior of rational clients in FL, achieving equilibrium with improved communication efficiency.

Abstract: Traditional Federated Learning (FL) approaches assume collaborative clients with aligned objectives working towards a shared global model. However, in many real-world scenarios, clients act as rational players with individual objectives and strategic behaviors, a concept that existing FL frameworks are not equipped to adequately address. To bridge this gap, we introduce Multiplayer Federated Learning (MpFL), a novel framework that models the clients in the FL environment as players in a game-theoretic context, aiming to reach an equilibrium. In this scenario, each player tries to optimize their own utility function, which may not align with the collective goal. Within MpFL, we propose Per-Player Local Stochastic Gradient Descent (PEARL-SGD), an algorithm in which each player/client performs local updates independently and periodically communicates with other players. We theoretically analyze PEARL-SGD and prove that it reaches a neighborhood of equilibrium with less communication in the stochastic setup compared to its non-local counterpart. Finally, we verify our theoretical findings through numerical experiments.

[294] Generating Computational Cognitive Models using Large Language Models

Milena Rmus, Akshay K. Jagadish, Marvin Mathony, Tobias Ludwig, Eric Schulz

Main category: cs.LG

TL;DR: GeCCo pipeline uses LLMs to automatically generate computational cognitive models that match or outperform handcrafted models across multiple cognitive domains.

Details

Motivation: Traditional cognitive model development requires extensive domain expertise and manual effort. LLMs offer potential to automate this process through their pattern recognition and code generation capabilities.

Method: GeCCo pipeline: given task instructions, participant data, and template function, prompts LLM to propose models, fits them to held-out data, and iteratively refines based on predictive performance feedback.

Result: LLM-generated models consistently matched or outperformed best domain-specific models from cognitive science literature across four cognitive domains (decision making, learning, planning, memory) using three different LLMs.

Conclusion: LLMs can generate cognitive models with conceptually plausible theories that rival or surpass the best literature models across diverse task domains, demonstrating the potential of automated cognitive model generation.

Abstract: Computational cognitive models, which formalize theories of cognition, enable researchers to quantify cognitive processes and arbitrate between competing theories by fitting models to behavioral data. Traditionally, these models are handcrafted, which requires significant domain knowledge, coding expertise, and time investment. However, recent advances in machine learning offer solutions to these challenges. In particular, Large Language Models (LLMs) have demonstrated remarkable capabilities for in-context pattern recognition, leveraging knowledge from diverse domains to solve complex problems, and generating executable code that can be used to facilitate the generation of cognitive models. Building on this potential, we introduce a pipeline for Guided generation of Computational Cognitive Models (GeCCo). Given task instructions, participant data, and a template function, GeCCo prompts an LLM to propose candidate models, fits proposals to held-out data, and iteratively refines them based on feedback constructed from their predictive performance. We benchmark this approach across four different cognitive domains – decision making, learning, planning, and memory – using three open-source LLMs, spanning different model sizes, capacities, and families. On four human behavioral data sets, the LLM generated models that consistently matched or outperformed the best domain-specific models from the cognitive science literature. Taken together, our results suggest that LLMs can generate cognitive models with conceptually plausible theories that rival – or even surpass – the best models from the literature across diverse task domains.

[295] Inference-Time Hyper-Scaling with KV Cache Compression

Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti

Main category: cs.LG

TL;DR: Inference-time hyper-scaling improves reasoning accuracy by compressing KV cache to generate more tokens within same compute budget, using Dynamic Memory Sparsification for efficient compression.

Details

Motivation: Transformer LLMs are bottlenecked by KV cache size rather than token count, so compressing KV cache enables generating more tokens within same compute budget for improved accuracy.

Method: Dynamic Memory Sparsification (DMS) - a novel KV cache sparsification method that delays token eviction, implicitly merges representations, and requires only 1K training steps for 8x compression.

Result: DMS maintains better accuracy than training-free sparse attention, boosts Qwen-R1 32B by 12.0 points on AIME 24, 8.6 on GPQA, and 9.7 on LiveCodeBench for equivalent memory reads.

Conclusion: Inference-time hyper-scaling with DMS effectively improves reasoning accuracy while maintaining comparable inference latency and memory load, making scaled inference more practical.

Abstract: Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8$\times$ compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference latency and memory load. For instance, we enhance Qwen-R1 32B by 12.0 points on AIME 24, 8.6 on GPQA, and 9.7 on LiveCodeBench on average for an equivalent number of memory reads.

[296] Rethinking Approximate Gaussian Inference in Classification

Bálint Mucsányi, Nathaël Da Costa, Philipp Hennig

Main category: cs.LG

TL;DR: The paper proposes replacing softmax with normCDF or sigmoid activations to enable sampling-free approximation of predictive distributions for uncertainty quantification, eliminating the computational overhead of Monte Carlo methods.

Details

Motivation: Softmax functions only capture aleatoric uncertainty, and existing methods for epistemic uncertainty require costly Monte Carlo approximations that are noisy and computationally expensive.

Method: Replace softmax with element-wise normCDF or sigmoid activations, enabling accurate sampling-free approximation of predictives and approximating Gaussian pushforwards by Dirichlet distributions using moment matching.

Result: The approach eliminates runtime and memory overhead of MC sampling, and when combined with Gaussian inference methods (Laplace, HET, SNGP), shows improved uncertainty quantification on ImageNet, CIFAR-100, and CIFAR-10 datasets.

Conclusion: Using normCDF or sigmoid instead of softmax enables efficient, sampling-free uncertainty quantification that outperforms softmax-based Monte Carlo methods while being computationally cheaper.

Abstract: In classification tasks, softmax functions are ubiquitously used as output activations to produce predictive probabilities. Such outputs only capture aleatoric uncertainty. To capture epistemic uncertainty, approximate Gaussian inference methods have been proposed. We develop a common formalism to describe such methods, which we view as outputting Gaussian distributions over the logit space. Predictives are then obtained as the expectations of the Gaussian distributions pushed forward through the softmax. However, such softmax Gaussian integrals cannot be solved analytically, and Monte Carlo (MC) approximations can be costly and noisy. We propose to replace the softmax activation by element-wise normCDF or sigmoid, which allows for the accurate sampling-free approximation of predictives. This also enables the approximation of the Gaussian pushforwards by Dirichlet distributions with moment matching. This approach entirely eliminates the runtime and memory overhead associated with MC sampling. We evaluate it combined with several approximate Gaussian inference methods (Laplace, HET, SNGP) on large- and small-scale datasets (ImageNet, CIFAR-100, CIFAR-10), demonstrating improved uncertainty quantification capabilities compared to softmax MC sampling. Our code is available at https://github.com/bmucsanyi/probit.

[297] RCCDA: Adaptive Model Updates in the Presence of Concept Drift under a Constrained Resource Budget

Adam Piaseczny, Md Kamran Chowdhury Shisher, Shiqiang Wang, Christopher G. Brinton

Main category: cs.LG

TL;DR: RCCDA is a dynamic model update policy that optimizes ML training under concept drift while ensuring strict resource constraints, using only past loss information and a tunable drift threshold.

Details

Motivation: Existing solutions for concept drift adaptation have high computational overhead and lack strict resource guarantees, making them unsuitable for resource-constrained environments.

Method: Uses Lyapunov drift-plus-penalty framework to create a lightweight greedy-optimal policy that analytically characterizes model loss evolution under concept drift and limits update frequency/cost.

Result: Outperforms baseline methods in inference accuracy while adhering to strict resource constraints across four domain generalization datasets under various concept drift schedules.

Conclusion: RCCDA provides a provably resource-efficient solution for real-time ML deployments facing concept drift, with theoretical performance assurances and practical effectiveness.

Abstract: Machine learning (ML) algorithms deployed in real-world environments are often faced with the challenge of adapting models to concept drift, where the task data distributions are shifting over time. The problem becomes even more difficult when model performance must be maintained under adherence to strict resource constraints. Existing solutions often depend on drift-detection methods that produce high computational overhead for resource-constrained environments, and fail to provide strict guarantees on resource usage or theoretical performance assurances. To address these shortcomings, we propose RCCDA: a dynamic model update policy that optimizes ML training dynamics while ensuring compliance to predefined resource constraints, utilizing only past loss information and a tunable drift threshold. In developing our policy, we analytically characterize the evolution of model loss under concept drift with arbitrary training update decisions. Integrating these results into a Lyapunov drift-plus-penalty framework produces a lightweight greedy-optimal policy that provably limits update frequency and cost. Experimental results on four domain generalization datasets demonstrate that our policy outperforms baseline methods in inference accuracy while adhering to strict resource constraints under several schedules of concept drift, making our solution uniquely suited for real-time ML deployments.

[298] Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering

Sai Prasanna Teja Reddy Bogireddy, Abrar Majeedi, Viswanatha Reddy Gajjala, Zhuoyan Xu, Siddhant Rai, Vaishnav Potlapalli

Main category: cs.LG

TL;DR: Neural placed second in BioNLP 2025 ArchEHR-QA by decoupling clinical QA into evidence identification and answer synthesis, using DSPy’s MIPROv2 optimizer for prompt optimization and achieving 51.5 overall score.

Details

Motivation: To bridge information gaps in clinical settings through automated QA over EHRs that requires precise evidence retrieval and faithful answer generation with limited supervision.

Method: Decouples task into sentence-level evidence identification and answer synthesis with citations; uses DSPy’s MIPROv2 optimizer for automatic prompt space exploration and self-consistency voting for improved evidence recall.

Result: Achieved overall score of 51.5 on hidden test set, placing second and outperforming standard zero-shot and few-shot prompting by over 20 and 10 points respectively.

Conclusion: Data-driven prompt optimization is a cost-effective alternative to model fine-tuning for high-stakes clinical QA, advancing AI reliability in healthcare.

Abstract: Automated question answering (QA) over electronic health records (EHRs) can bridge critical information gaps for clinicians and patients, yet it demands both precise evidence retrieval and faithful answer generation under limited supervision. In this work, we present Neural, the runner-up in the BioNLP 2025 ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations. For each stage, we automatically explore the prompt space with DSPy’s MIPROv2 optimizer, jointly tuning instructions and few-shot demonstrations on the development set. A self-consistency voting scheme further improves evidence recall without sacrificing precision. On the hidden test set, our method attains an overall score of 51.5, placing second stage while outperforming standard zero-shot and few-shot prompting by over 20 and 10 points, respectively. These results indicate that data-driven prompt optimization is a cost-effective alternative to model fine-tuning for high-stakes clinical QA, advancing the reliability of AI assistants in healthcare.

[299] Two-stage hybrid models for enhancing forecasting accuracy on heterogeneous time series

Junru Ren, Shaomin Wu

Main category: cs.LG

TL;DR: Proposes a two-stage framework combining global and local time series models to handle heterogeneous datasets, achieving superior performance over state-of-the-art methods.

Details

Motivation: Global time series models (tsGMs) can improve forecasting accuracy but may fail on heterogeneous datasets, while increasing model complexity risks overfitting. The definition of data homogeneity remains ambiguous in literature.

Method: Two-stage modeling framework: Stage 1 uses a tsGM to identify homogeneous patterns; Stage 2 uses local models (tsLMs like ARIMA) or sub-tsGMs tailored to different groups to capture heterogeneity.

Result: Numerical experiments on four open datasets show the proposed approach significantly outperforms six state-of-the-art models.

Conclusion: The framework effectively unlocks the full potential of global forecasting models for heterogeneous datasets by addressing data heterogeneity challenges.

Abstract: A time series forecasting model–which is typically built on a single time series–is known as a local time series model (tsLM). In contrast, a forecasting model trained on multiple time series is referred to as a global time series model (tsGM). tsGMs can enhance forecasting accuracy and improve generalisation by learning cross-series information. As such, developing tsGMs has become a prominent research focus within the time series forecasting community. However, the benefits of tsGMs may not always be realised if the given set of time series is heterogeneous. While increasing model complexity can help tsGMs adapt to such a set of data, it can also increase the risk of overfitting and forecasting error. Additionally, the definition of homogeneity remains ambiguous in the literature. To address these challenges, this paper explores how to define data heterogeneity and proposes a two-stage modelling framework: At stage one, a tsGM is learnt to identify homogeneous patterns; and at stage two, tsLMs (e.g., ARIMA) or sub-tsGMs tailored to different groups are learnt to capture the heterogeneity. Numerical experiments on four open datasets demonstrate that the proposed approach significantly outperforms six state-of-the-art models. These results highlight its effectiveness in unlocking the full potential of global forecasting models for heterogeneous datasets.

[300] Learning of Population Dynamics: Inverse Optimization Meets JKO Scheme

Mikhail Persiianov, Jiawei Chen, Petr Mokrov, Alexander Tyurin, Evgeny Burnaev, Alexander Korotin

Main category: cs.LG

TL;DR: iJKOnet combines JKO scheme with inverse optimization to learn population dynamics without restrictive architectural requirements, using end-to-end adversarial training.

Details

Motivation: To improve learning of population dynamics from evolutionary snapshots by addressing limitations of previous JKO-based methods that require restrictive neural network architectures.

Method: Combines JKO framework with inverse optimization techniques using end-to-end adversarial training, avoiding the need for input-convex neural networks.

Result: Establishes theoretical guarantees and demonstrates improved performance over prior JKO-based methods.

Conclusion: iJKOnet provides an effective approach for learning population dynamics with better performance and fewer architectural constraints compared to existing methods.

Abstract: Learning population dynamics involves recovering the underlying process that governs particle evolution, given evolutionary snapshots of samples at discrete time points. Recent methods frame this as an energy minimization problem in probability space and leverage the celebrated JKO scheme for efficient time discretization. In this work, we introduce $\texttt{iJKOnet}$, an approach that combines the JKO framework with inverse optimization techniques to learn population dynamics. Our method relies on a conventional $\textit{end-to-end}$ adversarial training procedure and does not require restrictive architectural choices, e.g., input-convex neural networks. We establish theoretical guarantees for our methodology and demonstrate improved performance over prior JKO-based methods.

[301] When Are Concepts Erased From Diffusion Models?

Kevin Lu, Nicky Kriplani, Rohit Gandikota, Minh Pham, David Bau, Chinmay Hegde, Niv Cohen

Main category: cs.LG

TL;DR: Proposes two conceptual models for concept erasure in diffusion models and introduces a comprehensive suite of independent probing techniques to assess whether concepts have been truly erased.

Details

Motivation: Despite rapid development of concept erasure methods, it remains unclear how thoroughly these approaches remove target concepts from models, highlighting the need for better evaluation methods.

Method: Proposes two conceptual models for erasure mechanisms: (i) interfering with model’s internal guidance, and (ii) reducing unconditional likelihood of target concept. Introduces comprehensive probing techniques including visual context, trajectory modification, classifier guidance, and alternative generation analysis.

Result: Results demonstrate the value of exploring concept erasure robustness beyond adversarial text inputs and emphasize the importance of comprehensive evaluations for diffusion model erasure.

Conclusion: Comprehensive evaluation methods are crucial for properly assessing concept erasure in diffusion models, as current approaches may not fully remove target concepts.

Abstract: In concept erasure, a model is modified to selectively prevent it from generating a target concept. Despite the rapid development of new methods, it remains unclear how thoroughly these approaches remove the target concept from the model. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) interfering with the model’s internal guidance processes, and (ii) reducing the unconditional likelihood of generating the target concept, potentially removing it entirely. To assess whether a concept has been truly erased from the model, we introduce a comprehensive suite of independent probing techniques: supplying visual context, modifying the diffusion trajectory, applying classifier guidance, and analyzing the model’s alternative generations that emerge in place of the erased concept. Our results shed light on the value of exploring concept erasure robustness outside of adversarial text inputs, and emphasize the importance of comprehensive evaluations for erasure in diffusion models.

[302] Neural Attention: A Novel Mechanism for Enhanced Expressive Power in Transformer Models

Andrew DiGiugno, Ausif Mahmood

Main category: cs.LG

TL;DR: Proposes Neural Attention to replace dot products with feed-forward networks in transformers, enabling more expressive token relationships while maintaining compatibility with existing architectures.

Details

Motivation: Dot products in standard attention have limitations in capturing nonlinear relationships between embedding vectors, which restricts the representational capacity of transformers.

Method: Replace dot-product attention with feed-forward networks for attention matrix calculation, preserving matrix dimensions for easy integration into existing transformer architectures.

Result: NLP experiments on WikiText-103 show >2% perplexity reduction; image classification on CIFAR-10/100 shows >4 percentage point accuracy improvements. Higher computational demands are mitigated with optimization techniques.

Conclusion: Neural Attention effectively enhances transformer predictive capabilities across applications while maintaining practical usability through computational optimizations.

Abstract: Transformer models typically calculate attention matrices using dot products, which have limitations when capturing nonlinear relationships between embedding vectors. We propose Neural Attention, a technique that replaces dot products with feed-forward networks, enabling a more expressive representation of relationships between tokens. This approach modifies only the attention matrix calculation while preserving the matrix dimensions, making it easily adaptable to existing transformer-based architectures. We provide a detailed mathematical justification for why Neural Attention increases representational capacity and conduct controlled experiments to validate this claim. When comparing Neural Attention and Dot-Product Attention, NLP experiments on WikiText-103 show a reduction in perplexity of over 2 percent. Similarly, experiments on CIFAR-10 and CIFAR-100 show improvements in accuracy of more than 4 percentage points for image classification tasks. While Neural Attention introduces higher computational demands, we develop techniques to mitigate these challenges, ensuring practical usability without sacrificing the increased expressivity it provides. This work establishes Neural Attention as an effective means of enhancing the predictive capabilities of transformer models across a variety of applications. The code for all experiments is available at https://github.com/awayfromzel/neural-attention-research.

[303] Conformal Prediction Adaptive to Unknown Subpopulation Shifts

Nien-Shao Wang, Duygu Nur Yaldiz, Yavuz Faruk Bakman, Sai Praneeth Karimireddy

Main category: cs.LG

TL;DR: New conformal prediction methods that adapt to unknown subpopulation shifts without requiring explicit group labels, maintaining coverage guarantees when standard methods fail.

Details

Motivation: Standard conformal prediction fails under subpopulation shifts where test data has different subpopulation mixes than calibration data, especially when subpopulation labels are unknown and must be inferred.

Method: Proposed methods that provably adapt conformal prediction to unknown subpopulation shifts by inferring subpopulation labels rather than requiring perfect labels, with algorithms that scale to high-dimensional settings.

Result: Extensive experiments on vision (vision transformers) and language (large language models) benchmarks show the methods reliably maintain coverage and effectively control risks where standard conformal prediction fails.

Conclusion: The framework provides formal coverage guarantees under unknown subpopulation shifts without explicit knowledge of subpopulation structure, making conformal prediction practical for realistic machine learning tasks with distribution shifts.

Abstract: Conformal prediction is widely used to equip black-box machine learning models with uncertainty quantification, offering formal coverage guarantees under exchangeable data. However, these guarantees fail when faced with subpopulation shifts, where the test environment contains a different mix of subpopulations than the calibration data. In this work, we focus on unknown subpopulation shifts where we are not given group-information i.e. the subpopulation labels of datapoints have to be inferred. We propose new methods that provably adapt conformal prediction to such shifts, ensuring valid coverage without explicit knowledge of subpopulation structure. While existing methods in similar setups assume perfect subpopulation labels, our framework explicitly relaxes this requirement and characterizes conditions where formal coverage guarantees remain feasible. Further, our algorithms scale to high-dimensional settings and remain practical in realistic machine learning tasks. Extensive experiments on vision (with vision transformers) and language (with large language models) benchmarks demonstrate that our methods reliably maintain coverage and effectively control risks in scenarios where standard conformal prediction fails.

[304] LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence

Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, Ningbo Dai, Renzhe Xu, Shuyang Li, Tianyang Zhang, Yue He, Yuanrui Wang, Yunjia Zhang, Zijing Xu, Dongzhe Li, Fang Gao, Hao Zou, Jiandong Liu, Jiashuo Liu, Jiawei Xu, Kaijie Cheng, Kehan Li, Linjun Zhou, Qing Li, Shaohua Fan, Xiaoyu Lin, Xinyan Han, Xuanyue Li, Yan Lu, Yuan Xue, Yuanyuan Jiang, Zimu Wang, Zhenlei Wang, Peng Cui

Main category: cs.LG

TL;DR: LimiX-16M and LimiX-2M are large structured-data models that handle diverse tabular tasks through query-based conditional prediction, achieving state-of-the-art performance across 11 benchmarks without task-specific training.

Details

Motivation: Progress toward general intelligence requires foundation models that can handle structured data alongside language and physical world data, addressing the gap in comprehensive tabular data modeling.

Method: Pretrained using masked joint-distribution modeling with episodic, context-conditional objective, treating structured data as joint distribution over variables and missingness, enabling training-free adaptation at inference.

Result: LimiX-16M consistently surpasses strong baselines across classification, regression, missing value imputation, and data generation tasks, often by substantial margins. LimiX-2M delivers strong performance under tight compute constraints.

Conclusion: The models demonstrate the viability of foundation models for structured data, provide the first scaling law study for LDMs, and offer publicly accessible solutions for tabular data tasks under Apache 2.0 license.

Abstract: We argue that progress toward general intelligence requires complementary foundation models grounded in language, the physical world, and structured data. This report presents LimiX-16M and LimiX-2M, two instantiations of our large structured-data models (LDMs). Both models treat structured data as a joint distribution over variables and missingness, thus capable of addressing a wide range of tabular tasks through query-based conditional prediction via a single model. They are pretrained using masked joint-distribution modeling with an episodic, context-conditional objective, supporting rapid, training-free adaptation at inference. We evaluate LimiX models across 11 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios. LimiX-16M consistently surpasses strong baselines, as shown in Figure 1 and Figure 2. The superiority holds across a wide range of tasks, such as classification, regression, missing value imputation, and data generation, often by substantial margins, while avoiding task-specific architectures or bespoke training per task. Notably, LimiX-2M delivers strong results under tight compute and memory budgets. We also present the first scaling law study for LDMs, revealing how data and model scaling jointly influence downstream performance and offering quantitative guidance for tabular foundation modeling. All LimiX models are publicly accessible under Apache 2.0.

[305] Greedy Algorithm for Structured Bandits: A Sharp Characterization of Asymptotic Success / Failure

Aleksandrs Slivkins, Yunzong Xu, Shiliang Zuo

Main category: cs.LG

TL;DR: The paper analyzes when greedy algorithms succeed or fail in bandit problems with known reward structures, identifying partial identifiability as the key condition for asymptotic success.

Details

Motivation: Prior work on greedy algorithms in bandits focused on limited reward structures, so this research aims to extend analysis to arbitrary finite reward structures and provide a complete characterization of when greedy algorithms work.

Method: The authors study the greedy (exploitation-only) algorithm in bandit problems with known reward structures, analyzing arbitrary finite reward structures and characterizing asymptotic success/failure through partial identifiability properties.

Result: The paper fully characterizes when greedy algorithms asymptotically succeed (sublinear regret) or fail (linear regret), identifying partial identifiability as the necessary and sufficient condition for success. The characterization extends to contextual bandits and interactive decision-making.

Conclusion: Partial identifiability is the key property determining greedy algorithm success in bandits. When this property holds, the problem becomes easy and any non-degenerate algorithm will succeed. The results generalize to various bandit settings and infinite reward structures.

Abstract: We study the greedy (exploitation-only) algorithm in bandit problems with a known reward structure. We allow arbitrary finite reward structures, while prior work focused on a few specific ones. We fully characterize when the greedy algorithm asymptotically succeeds or fails, in the sense of sublinear vs. linear regret as a function of time. Our characterization identifies a partial identifiability property of the problem instance as the necessary and sufficient condition for the asymptotic success. Notably, once this property holds, the problem becomes easy – any algorithm will succeed (in the same sense as above), provided it satisfies a mild non-degeneracy condition. Our characterization extends to contextual bandits and interactive decision-making with arbitrary feedback. Examples demonstrating broad applicability and extensions to infinite reward structures are provided.

[306] NVIDIA Nemotron Nano V2 VL

NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Guo Chen, Karan Sapra, Zhiding Yu, Adi Renduchintala, Charles Wang, Peter Jin, Arushi Goel, Mike Ranzinger, Lukas Voegtle, Philipp Fischer, Timo Roman, Wei Ping, Boxin Wang, Zhuolin Yang, Nayeon Lee, Shaokun Zhang, Fuxiao Liu, Zhiqi Li, Di Zhang, Greg Heinrich, Hongxu Yin, Song Han, Pavlo Molchanov, Parth Mannan, Yao Xu, Jane Polak Scowcroft, Tom Balough, Subhashree Radhakrishnan, Paris Zhang, Sean Cha, Ratnesh Kumar, Zaid Pervaiz Bhat, Jian Zhang, Darragh Hanley, Pritam Biswas, Jesse Oliver, Kevin Vasques, Roger Waleffe, Duncan Riach, Oluwatobi Olabiyi, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Pritam Gundecha, Khanh Nguyen, Alexandre Milesi, Eugene Khvedchenia, Ran Zilberstein, Ofri Masad, Natan Bagrov, Nave Assaf, Tomer Asida, Daniel Afrimi, Amit Zuker, Netanel Haber, Zhiyu Cheng, Jingyu Xin, Di Wu, Nik Spirin, Maryam Moosaei, Roman Ageev, Vanshil Atul Shah, Yuting Wu, Daniel Korzekwa, Unnikrishnan Kizhakkemadam Sreekumar, Wanli Jiang, Padmavathy Subramanian, Alejandra Rico, Sandip Bhaskar, Saeid Motiian, Kedi Wu, Annie Surla, Chia-Chih Chen, Hayden Wolff, Matthew Feinberg, Melissa Corpuz, Marek Wawrzos, Eileen Long, Aastha Jhunjhunwala, Paul Hendricks, Farzan Memarian, Benika Hall, Xin-Yu Wang, David Mosallanezhad, Soumye Singhal, Luis Vega, Katherine Cheung, Krzysztof Pawelec, Michael Evans, Katherine Luna, Jie Lou, Erick Galinkin, Akshay Hazare, Kaustubh Purandare, Ann Guan, Anna Warno, Chen Cui, Yoshi Suhara, Shibani Likhite, Seph Mard, Meredith Price, Laya Sleiman, Saori Kaji, Udi Karpas, Kari Briski, Joey Conway, Michael Lightstone, Jan Kautz, Mohammad Shoeybi, Mostofa Patwary, Jonathen Cohen, Oleksii Kuchaiev, Andrew Tao, Bryan Catanzaro

Main category: cs.LG

TL;DR: Nemotron Nano V2 VL is an enhanced vision-language model that improves document understanding, video comprehension, and reasoning through architectural upgrades and token reduction techniques.

Details

Motivation: To create a more efficient vision-language model that excels in real-world document understanding, long video comprehension, and reasoning tasks while achieving higher inference throughput.

Method: Builds on Nemotron Nano V2 (hybrid Mamba-Transformer LLM) with innovative token reduction techniques, enhanced model architecture, datasets, and training recipes.

Result: Significant improvements over previous model (Llama-3.1-Nemotron-Nano-VL-8B) across all vision and text domains with higher inference throughput for long documents and videos.

Conclusion: The model represents a major advancement in vision-language capabilities and is being released with checkpoints (BF16, FP8, FP4) and open-sourcing datasets, recipes, and training code.

Abstract: We introduce Nemotron Nano V2 VL, the latest model of the Nemotron vision-language series designed for strong real-world document understanding, long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers significant improvements over our previous model, Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major enhancements in model architecture, datasets, and training recipes. Nemotron Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and innovative token reduction techniques to achieve higher inference throughput in long document and video scenarios. We are releasing model checkpoints in BF16, FP8, and FP4 formats and sharing large parts of our datasets, recipes and training code.

[307] Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

Guanhua Huang, Tingqiang Xu, Mingze Wang, Qi Yi, Xue Gong, Siheng Li, Ruibin Xiong, Kejiao Li, Yuhao Jiang, Bo Zhou

Main category: cs.LG

TL;DR: RLVR training suffers from exploration collapse as policy entropy drops. The paper identifies that valuable low-probability “reasoning sparks” get eliminated during training. Lp-Reg regularizes policy towards a filtered proxy distribution that amplifies these sparks, enabling stable scaling and SOTA performance on math benchmarks.

Details

Motivation: RLVR training plateaus due to loss of exploration as policy entropy collapses. Previous methods focus on maintaining high entropy but risk amplifying irrelevant tokens. The key issue is the systematic elimination of valuable low-probability exploratory tokens (reasoning sparks) during training.

Method: Introduces Low-probability Regularization (Lp-Reg) which regularizes the policy towards a heuristic proxy distribution. The proxy is constructed by filtering out noise tokens and re-normalizing over remaining candidates, amplifying reasoning sparks. This serves as a soft regularization target via KL divergence to protect valuable tokens.

Result: Lp-Reg enables stable on-policy RL scaling across 3,000 training steps and 81,204 GPU-hours, where baseline entropy-control methods collapse. Achieves 60.17% average accuracy on five math benchmarks, improving by 2.66% over prior methods.

Conclusion: Lp-Reg effectively addresses exploration collapse in RLVR by protecting valuable low-probability reasoning sparks through targeted regularization, enabling sustained scaling and state-of-the-art performance on complex reasoning tasks.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textbf{\textit{reasoning sparks}}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of \textit{reasoning sparks} is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy RL, sustaining continuous scaling across $3,000$ training steps and $81,204$ GPU-hours, where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a $60.17%$ average accuracy on five math benchmarks, an improvement of $2.66%$ over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.

[308] Better Neural Network Expressivity: Subdividing the Simplex

Egor Bakaev, Florestan Brunck, Christoph Hertrich, Jack Stade, Amir Yehudayoff

Main category: cs.LG

TL;DR: This paper disproves a conjecture about ReLU neural network depth requirements, showing that fewer hidden layers than previously thought are sufficient to compute all continuous piecewise linear functions on R^n.

Details

Motivation: To challenge the optimality of the known depth bound for ReLU networks computing CPWL functions, specifically disproving the conjecture that ⌈log₂(n+1)⌉ hidden layers are necessary.

Method: The authors demonstrate that ReLU networks with two hidden layers can exactly represent the maximum function of five inputs, and generalize this to show that ⌈log₃(n-1)⌉+1 layers are sufficient for all CPWL functions on R^n.

Result: The paper proves that ⌈log₃(n-1)⌉+1 hidden layers are sufficient to compute all CPWL functions on R^n, which is fewer than the previously conjectured optimal bound of ⌈log₂(n+1)⌉.

Conclusion: The conjecture about the optimal depth of ReLU networks for CPWL functions is disproven, and the new depth bound nearly matches known lower bounds, with geometric interpretations via polyhedral subdivisions.

Abstract: This work studies the expressivity of ReLU neural networks with a focus on their depth. A sequence of previous works showed that $\lceil \log_2(n+1) \rceil$ hidden layers are sufficient to compute all continuous piecewise linear (CPWL) functions on $\mathbb{R}^n$. Hertrich, Basu, Di Summa, and Skutella (NeurIPS'21 / SIDMA'23) conjectured that this result is optimal in the sense that there are CPWL functions on $\mathbb{R}^n$, like the maximum function, that require this depth. We disprove the conjecture and show that $\lceil\log_3(n-1)\rceil+1$ hidden layers are sufficient to compute all CPWL functions on $\mathbb{R}^n$. A key step in the proof is that ReLU neural networks with two hidden layers can exactly represent the maximum function of five inputs. More generally, we show that $\lceil\log_3(n-2)\rceil+1$ hidden layers are sufficient to compute the maximum of $n\geq 4$ numbers. Our constructions almost match the $\lceil\log_3(n)\rceil$ lower bound of Averkov, Hojny, and Merkert (ICLR'25) in the special case of ReLU networks with weights that are decimal fractions. The constructions have a geometric interpretation via polyhedral subdivisions of the simplex into ``easier’’ polytopes.

[309] Progressive Inference-Time Annealing of Diffusion Models for Sampling from Boltzmann Densities

Tara Akhound-Sadegh, Jungyoon Lee, Avishek Joey Bose, Valentin De Bortoli, Arnaud Doucet, Michael M. Bronstein, Dominique Beaini, Siamak Ravanbakhsh, Kirill Neklyudov, Alexander Tong

Main category: cs.LG

TL;DR: PITA is a novel diffusion-based sampling framework that combines temperature annealing with diffusion smoothing to enable efficient equilibrium sampling of complex molecular systems with dramatically fewer energy evaluations.

Details

Motivation: Existing diffusion-based samplers cannot handle distributions at the scale of molecular systems, creating a need for more efficient sampling methods for scientific applications.

Method: PITA trains a sequence of diffusion models from high to low temperatures, using temperature annealing of Boltzmann distributions and diffusion smoothing. It leverages engineered sample access and inference-time annealing with Feynman-Kac PDE and Sequential Monte Carlo.

Result: PITA enables first-time equilibrium sampling of N-body particle systems, Alanine Dipeptide, and tripeptides in Cartesian coordinates with significantly reduced energy function evaluations.

Conclusion: PITA represents a breakthrough in diffusion-based sampling, making equilibrium sampling of complex molecular systems feasible for the first time with improved computational efficiency.

Abstract: Sampling efficiently from a target unnormalized probability density remains a core challenge, with relevance across countless high-impact scientific applications. A promising approach towards this challenge is the design of amortized samplers that borrow key ideas, such as probability path design, from state-of-the-art generative diffusion models. However, all existing diffusion-based samplers remain unable to draw samples from distributions at the scale of even simple molecular systems. In this paper, we propose Progressive Inference-Time Annealing (PITA), a novel framework to learn diffusion-based samplers that combines two complementary interpolation techniques: I.) Annealing of the Boltzmann distribution and II.) Diffusion smoothing. PITA trains a sequence of diffusion models from high to low temperatures by sequentially training each model at progressively higher temperatures, leveraging engineered easy access to samples of the temperature-annealed target density. In the subsequent step, PITA enables simulating the trained diffusion model to procure training samples at a lower temperature for the next diffusion model through inference-time annealing using a novel Feynman-Kac PDE combined with Sequential Monte Carlo. Empirically, PITA enables, for the first time, equilibrium sampling of N-body particle systems, Alanine Dipeptide, and tripeptides in Cartesian coordinates with dramatically lower energy function evaluations. Code available at: https://github.com/taraak/pita

[310] Efficient and Unbiased Sampling from Boltzmann Distributions via Variance-Tuned Diffusion Models

Fengzhe Zhang, Laurence I. Midgley, José Miguel Hernández-Lobato

Main category: cs.LG

TL;DR: VT-DIS is a lightweight post-training method that adapts pretrained score-based diffusion models to enable efficient importance sampling by minimizing α-divergence between forward and reverse trajectories, achieving high effective sample sizes with minimal computational overhead.

Details

Motivation: Score-based diffusion models suffer from biased Monte Carlo estimates due to imperfect score estimates, and traditional importance sampling correction requires solving expensive probability-flow ODEs that scale poorly with dimensionality.

Method: Variance-Tuned Diffusion Importance Sampling (VT-DIS) adapts per-step noise covariance of pretrained SBDMs by minimizing α-divergence (α=2) between forward diffusion and reverse denoising trajectories, assigning trajectory-wise importance weights.

Result: On DW-4, LJ-13, and alanine-dipeptide benchmarks, VT-DIS achieves effective sample sizes of ~80%, 35%, and 3.5% respectively, using only a fraction of computational budget compared to vanilla diffusion + IS or PF-ODE-based IS.

Conclusion: VT-DIS provides an efficient post-training method for unbiased expectation estimates with negligible overhead compared to standard sampling, overcoming computational limitations of traditional importance sampling approaches.

Abstract: Score-based diffusion models (SBDMs) are powerful amortized samplers for Boltzmann distributions; however, imperfect score estimates bias downstream Monte Carlo estimates. Classical importance sampling (IS) can correct this bias, but computing exact likelihoods requires solving the probability-flow ordinary differential equation (PF-ODE), a procedure that is prohibitively costly and scales poorly with dimensionality. We introduce Variance-Tuned Diffusion Importance Sampling (VT-DIS), a lightweight post-training method that adapts the per-step noise covariance of a pretrained SBDM by minimizing the $\alpha$-divergence ($\alpha=2$) between its forward diffusion and reverse denoising trajectories. VT-DIS assigns a single trajectory-wise importance weight to the joint forward-reverse process, yielding unbiased expectation estimates at test time with negligible overhead compared to standard sampling. On the DW-4, LJ-13, and alanine-dipeptide benchmarks, VT-DIS achieves effective sample sizes of approximately 80 %, 35 %, and 3.5 %, respectively, while using only a fraction of the computational budget required by vanilla diffusion + IS or PF-ODE-based IS.

[311] Less Greedy Equivalence Search

Adiba Ejaz, Elias Bareinboim

Main category: cs.LG

TL;DR: LGES is an improved version of GES that provides faster computation and better accuracy by avoiding unnecessary edge insertions based on conditional independence scores, while maintaining theoretical guarantees.

Details

Motivation: GES faces practical challenges in computational cost and finite-sample accuracy despite its theoretical soundness.

Method: LGES modifies GES by avoiding edge insertions between variables where score implies conditional independence, uses prior knowledge, and can handle interventional data.

Result: LGES achieves up to 10-fold speed-up, substantial reduction in structural error, and outperforms GES and other baselines in speed, accuracy, and robustness.

Conclusion: LGES successfully addresses GES limitations while preserving theoretical guarantees, making it a practical and robust causal discovery algorithm.

Abstract: Greedy Equivalence Search (GES) is a classic score-based algorithm for causal discovery from observational data. In the sample limit, it recovers the Markov equivalence class of graphs that describe the data. Still, it faces two challenges in practice: computational cost and finite-sample accuracy. In this paper, we develop Less Greedy Equivalence Search (LGES), a variant of GES that retains its theoretical guarantees while partially addressing these limitations. LGES modifies the greedy step; rather than always applying the highest-scoring insertion, it avoids edge insertions between variables for which the score implies some conditional independence. This more targeted search yields up to a (10)-fold speed-up and a substantial reduction in structural error relative to GES. Moreover, LGES can guide the search using prior knowledge, and can correct this knowledge when contradicted by data. Finally, LGES can use interventional data to refine the learned observational equivalence class. We prove that LGES recovers the true equivalence class in the sample limit, even with misspecified knowledge. Experiments demonstrate that LGES outperforms GES and other baselines in speed, accuracy, and robustness to misspecified knowledge. Our code is available at https://github.com/CausalAILab/lges.

[312] FedFACT: A Provable Framework for Controllable Group-Fairness Calibration in Federated Learning

Li Zhang, Zhongxuan Han, Xiaohua Feng, Jiaming Zhang, Yuyuan Li, Chaochao Chen

Main category: cs.LG

TL;DR: FedFACT is a federated learning framework that addresses global and local fairness challenges by formulating fair FL as personalized cost-sensitive learning and bi-level optimization, achieving optimal accuracy-fairness trade-offs.

Details

Motivation: Current FL research lacks methods to harmonize global and local fairness in multi-class settings and enable controllable accuracy-fairness trade-offs due to the non-decomposable, non-differentiable nature of fairness criteria.

Method: FedFACT identifies Bayes-optimal classifiers under fairness constraints, reformulates fair FL as personalized cost-sensitive learning (in-processing) and bi-level optimization (post-processing).

Result: The framework provides convergence and generalization guarantees, and experiments show it outperforms baselines in balancing accuracy and global-local fairness across various data heterogeneity scenarios.

Conclusion: FedFACT effectively addresses fundamental fairness challenges in FL by providing a controllable framework that harmonizes global and local fairness while maintaining optimal accuracy.

Abstract: With the emerging application of Federated Learning (FL) in decision-making scenarios, it is imperative to regulate model fairness to prevent disparities across sensitive groups (e.g., female, male). Current research predominantly focuses on two concepts of group fairness within FL: Global Fairness (overall model disparity across all clients) and Local Fairness (the disparity within each client). However, the non-decomposable, non-differentiable nature of fairness criteria poses two fundamental, unresolved challenges for fair FL: (i) Harmonizing global and local fairness, especially in multi-class setting; (ii) Enabling a controllable, optimal accuracy-fairness trade-off. To tackle these challenges, we propose a novel controllable federated group-fairness calibration framework, named FedFACT. FedFACT identifies the Bayes-optimal classifiers under both global and local fairness constraints, yielding models with minimal performance decline while guaranteeing fairness. Building on the characterization of the optimal fair classifiers, we reformulate fair federated learning as a personalized cost-sensitive learning problem for in-processing and a bi-level optimization for post-processing. Theoretically, we provide convergence and generalization guarantees for FedFACT to approach the near-optimal accuracy under given fairness levels. Extensive experiments on multiple datasets across various data heterogeneity demonstrate that FedFACT consistently outperforms baselines in balancing accuracy and global-local fairness.

[313] Conformal Information Pursuit for Interactively Guiding Large Language Models

Kwan Ho Ryan Chan, Yuyan Ge, Edgar Dobriban, Hamed Hassani, René Vidal

Main category: cs.LG

TL;DR: Conformal Information Pursuit (C-IP) improves sequential querying in LLMs by using conformal prediction sets instead of traditional information gain measures, leading to better performance and shorter query chains.

Details

Motivation: Traditional Information Pursuit (IP) suffers from inaccurate uncertainty estimation due to LLMs' over- or under-confident probabilities, resulting in suboptimal query selection and predictive performance in interactive question-answering tasks.

Method: Proposes Conformal Information Pursuit (C-IP) that leverages conformal prediction sets to estimate uncertainty, using the average size of these sets as a distribution-free and robust alternative to conditional entropy for measuring uncertainty at each iteration.

Result: C-IP achieves better predictive performance and shorter query-answer chains compared to traditional IP and uncertainty-based chain-of-thought methods on 20 Questions, and competitive performance with direct single-turn prediction on MediQ medical dataset while offering greater interpretability.

Conclusion: Conformal prediction sets provide a more reliable method for uncertainty estimation in sequential querying tasks, enabling LLMs to make more efficient and interpretable interactive predictions.

Abstract: A significant use case of instruction-finetuned Large Language Models (LLMs) is to solve question-answering tasks interactively. In this setting, an LLM agent is tasked with making a prediction by sequentially querying relevant information from the user, as opposed to a single-turn conversation. This paper explores sequential querying strategies that aim to minimize the expected number of queries. One such strategy is Information Pursuit (IP), a greedy algorithm that at each iteration selects the query that maximizes information gain or equivalently minimizes uncertainty. However, obtaining accurate estimates of mutual information or conditional entropy for LLMs is very difficult in practice due to over- or under-confident LLM proba- bilities, which leads to suboptimal query selection and predictive performance. To better estimate the uncertainty at each iteration, we propose Conformal Information Pursuit (C-IP), an alternative approach to sequential information gain based on conformal prediction sets. More specifically, C-IP leverages a relationship between prediction sets and conditional entropy at each iteration to estimate uncertainty based on the average size of conformal prediction sets. In contrast to conditional entropy, we find that conformal prediction sets are a distribution-free and robust method of measuring uncertainty. Experiments with 20 Questions show that C-IP obtains better predictive performance and shorter query-answer chains compared to previous approaches to IP and uncertainty-based chain-of-thought methods. Furthermore, extending to an interactive medical setting between a doctor and a patient on the MediQ dataset, C-IP achieves competitive performance with direct single-turn prediction while offering greater interpretability.

[314] Auto-Compressing Networks

Vaggelis Dorovatas, Georgios Paraskevopoulos, Alexandros Potamianos

Main category: cs.LG

TL;DR: Auto-Compressing Networks (ACNs) replace short residual connections with long feedforward connections from each layer to output, enabling automatic information compression during training that reduces redundancy and improves efficiency.

Details

Motivation: Deep neural networks with residual connections often suffer from computational redundancy as depth increases, without corresponding improvements in representation quality. The goal is to create architectures that automatically adapt their computational footprint to task complexity.

Method: Replace traditional short residual connections with additive long feedforward connections from each layer directly to the output. This architectural modification induces auto-compression dynamics where information is pushed into early layers during training.

Result: ACNs achieve up to 18% reduction in catastrophic forgetting and 30-80% architectural compression while maintaining accuracy. They also demonstrate enhanced noise robustness, superior low-data performance, improved transfer learning, and better generalization with fewer parameters across vision transformers, MLP-mixers, and BERT architectures.

Conclusion: ACNs provide a practical approach for developing efficient neural architectures that automatically adapt computational footprint to task complexity, learn robust representations suitable for noisy real-world tasks, and mitigate catastrophic forgetting in continual learning scenarios.

Abstract: Deep neural networks with short residual connections have demonstrated remarkable success across domains, but increasing depth often introduces computational redundancy without corresponding improvements in representation quality. We introduce Auto-Compressing Networks (ACNs), an architectural variant where additive long feedforward connections from each layer to the output replace traditional short residual connections. By analyzing the distinct dynamics induced by this modification, we reveal a unique property we coin as auto-compression, the ability of a network to organically compress information during training with gradient descent, through architectural design alone. Through auto-compression, information is dynamically “pushed” into early layers during training, enhancing their representational quality and revealing potential redundancy in deeper ones. We theoretically show that this property emerges from layer-wise training patterns present in ACNs, where layers are dynamically utilized during training based on task requirements. We also find that ACNs exhibit enhanced noise robustness compared to residual networks, superior performance in low-data settings, improved transfer learning capabilities, and mitigate catastrophic forgetting suggesting that they learn representations that generalize better despite using fewer parameters. Our results demonstrate up to 18% reduction in catastrophic forgetting and 30-80% architectural compression while maintaining accuracy across vision transformers, MLP-mixers, and BERT architectures. These findings establish ACNs as a practical approach to developing efficient neural architectures that automatically adapt their computational footprint to task complexity, while learning robust representations suitable for noisy real-world tasks and continual learning scenarios.

[315] Graph Learning

Feng Xia, Ciyuan Peng, Jing Ren, Falih Gozi Febrinanto, Renqiang Luo, Vidya Saikrishna, Shuo Yu, Xiangjie Kong

Main category: cs.LG

TL;DR: This survey provides a comprehensive overview of graph learning, covering key dimensions like scalable, temporal, multimodal, generative, explainable, and responsible graph learning, along with emerging trends and future directions.

Details

Motivation: Graph learning is significant due to its ability to model complex, non-Euclidean relationships that traditional machine learning struggles to capture, supporting real-world applications from drug discovery to recommender systems.

Method: The survey reviews state-of-the-art techniques for handling large-scale graphs, capturing dynamic temporal dependencies, integrating heterogeneous data modalities, generating novel graph samples, and enhancing interpretability.

Result: The paper serves as a comprehensive resource that identifies and discusses emerging topics, highlighting recent integration of graph learning with other AI paradigms.

Conclusion: This survey provides valuable insights into the rapidly evolving landscape of graph learning, addressing challenges like scalability, generalization, heterogeneity, interpretability, and trustworthiness to unlock the field’s full potential.

Abstract: Graph learning has rapidly evolved into a critical subfield of machine learning and artificial intelligence (AI). Its development began with early graph-theoretic methods, gaining significant momentum with the advent of graph neural networks (GNNs). Over the past decade, progress in scalable architectures, dynamic graph modeling, multimodal learning, generative AI, explainable AI (XAI), and responsible AI has broadened the applicability of graph learning to various challenging environments. Graph learning is significant due to its ability to model complex, non-Euclidean relationships that traditional machine learning struggles to capture, thus better supporting real-world applications ranging from drug discovery and fraud detection to recommender systems and scientific reasoning. However, challenges like scalability, generalization, heterogeneity, interpretability, and trustworthiness must be addressed to unlock its full potential. This survey provides a comprehensive introduction to graph learning, focusing on key dimensions including scalable, temporal, multimodal, generative, explainable, and responsible graph learning. We review state-of-the-art techniques for efficiently handling large-scale graphs, capturing dynamic temporal dependencies, integrating heterogeneous data modalities, generating novel graph samples, and enhancing interpretability to foster trust and transparency. We also explore ethical considerations, such as privacy and fairness, to ensure responsible deployment of graph learning models. Additionally, we identify and discuss emerging topics, highlighting recent integration of graph learning and other AI paradigms and offering insights into future directions. This survey serves as a valuable resource for researchers and practitioners seeking to navigate the rapidly evolving landscape of graph learning.

[316] Differentially Private Bilevel Optimization: Efficient Algorithms with Near-Optimal Rates

Andrew Lowy, Daogao Liu

Main category: cs.LG

TL;DR: This paper studies differentially private bilevel optimization, providing nearly tight bounds for convex outer objectives and developing efficient algorithms for both convex and non-convex settings with dimension-independent inner problem complexity.

Details

Motivation: Bilevel optimization underlies many machine learning applications (meta-learning, hyperparameter optimization) that involve sensitive training data, raising privacy concerns that motivate the study of differentially private bilevel optimization.

Method: The authors provide novel upper and lower bounds for both pure and approximate differential privacy, achieved via efficient implementations of exponential and regularized exponential mechanisms. A key technical contribution is a new method for log-concave sampling under inexact function evaluations.

Result: The bounds are nearly tight and essentially match optimal rates for standard single-level differentially private ERM, with additional terms capturing the nested bilevel structure complexity. For non-convex settings, they develop algorithms with state-of-the-art rates for finding approximate stationary points.

Conclusion: The paper establishes fundamental privacy-utility tradeoffs for bilevel optimization, with bounds that don’t depend on the dimension of the inner problem, making the approach scalable and practical for real-world applications.

Abstract: Bilevel optimization, in which one optimization problem is nested inside another, underlies many machine learning applications with a hierarchical structure – such as meta-learning and hyperparameter optimization. Such applications often involve sensitive training data, raising pressing concerns about individual privacy. Motivated by this, we study differentially private bilevel optimization. We first focus on settings where the outer-level objective is convex, and provide novel upper and lower bounds on the excess empirical risk for both pure and approximate differential privacy. These bounds are nearly tight and essentially match the optimal rates for standard single-level differentially private ERM, up to additional terms that capture the intrinsic complexity of the nested bilevel structure. We also provide population loss bounds for bilevel stochastic optimization. The bounds are achieved in polynomial time via efficient implementations of the exponential and regularized exponential mechanisms. A key technical contribution is a new method and analysis of log-concave sampling under inexact function evaluations, which may be of independent interest. In the non-convex setting, we develop novel algorithms with state-of-the-art rates for privately finding approximate stationary points. Notably, our bounds do not depend on the dimension of the inner problem.

[317] Optimism Without Regularization: Constant Regret in Zero-Sum Games

John Lazarsfeld, Georgios Piliouras, Ryann Sim, Stratis Skoulakis

Main category: cs.LG

TL;DR: Optimistic Fictitious Play achieves constant regret in two-strategy zero-sum games without regularization, while Alternating Fictitious Play has Ω(√T) regret lower bound.

Details

Motivation: To understand whether non-no-regret algorithms can achieve fast learning rates in games, specifically investigating if optimal regret rates are achievable without regularization.

Method: Analyzed Optimistic Fictitious Play in two-strategy zero-sum games using geometric analysis in the dual space of payoff vectors and studied an energy function of iterates.

Result: Proved that Optimistic Fictitious Play obtains only constant regret in two-strategy games, and showed Alternating Fictitious Play has Ω(√T) regret lower bound.

Conclusion: Optimism enables fast learning without regularization, separating it from alternation in achieving sublinear regret in the unregularized regime.

Abstract: This paper studies the optimistic variant of Fictitious Play for learning in two-player zero-sum games. While it is known that Optimistic FTRL – a regularized algorithm with a bounded stepsize parameter – obtains constant regret in this setting, we show for the first time that similar, optimal rates are also achievable without regularization: we prove for two-strategy games that Optimistic Fictitious Play (using any tiebreaking rule) obtains only constant regret, providing surprising new evidence on the ability of non-no-regret algorithms for fast learning in games. Our proof technique leverages a geometric view of Optimistic Fictitious Play in the dual space of payoff vectors, where we show a certain energy function of the iterates remains bounded over time. Additionally, we also prove a regret lower bound of $\Omega(\sqrt{T})$ for Alternating Fictitious Play. In the unregularized regime, this separates the ability of optimism and alternation in achieving $o(\sqrt{T})$ regret.

[318] Advanced Hybrid Transformer LSTM Technique with Attention and TS Mixer for Drilling Rate of Penetration Prediction

Saddam Hussain Khan

Main category: cs.LG

TL;DR: A hybrid LSTM-Transformer-Mixer-Attention framework for Rate of Penetration (ROP) prediction that combines sequential memory, static feature interactions, global context learning, and dynamic feature weighting to handle the nonlinear, dynamic, and heterogeneous nature of drilling data.

Details

Motivation: ROP prediction is challenging due to nonlinear, dynamic, and heterogeneous drilling data characteristics. Conventional models rely on oversimplified assumptions or intensive feature engineering, limiting their ability to model long-term dependencies and complex feature interactions.

Method: Hybrid LSTM-Trans-Mixer-Att framework: Custom LSTM for multi-scale temporal dependencies, Enhanced Transformer with drilling-specific positional encodings, parallel TS-Mixer for cross-feature interaction of static/categorical parameters, fusion layer integration, and adaptive attention mechanism for dynamic feature weighting.

Result: Superior performance on real-world drilling datasets with R-square of 0.9991 and MAPE of 1.447%, significantly outperforming existing baseline and hybrid models.

Conclusion: The proposed framework provides a comprehensive solution for heterogeneous and event-driven drilling dynamics by combining sequential memory, static feature interactions, global context learning, and dynamic feature weighting, demonstrating superior ROP prediction capabilities.

Abstract: Rate of Penetration (ROP) prediction is critical for drilling optimization yet remains challenging due to the nonlinear, dynamic, and heterogeneous characteristics of drilling data. Conventional empirical, physics-based, and standard machine learning models rely on oversimplified assumptions or intensive feature engineering, constraining their capacity to model long-term dependencies and intricate feature interactions. To address these issues, this study presents a new deep learning Hybrid LSTM-Trans-Mixer-Att framework that first processes input data through a customized Long Short-Term Memory (LSTM) network to capture multi-scale temporal dependencies aligned with drilling cycles. Subsequently, an Enhanced Transformer encoder with drilling-specific positional encodings and real-time optimization refines the features. Concurrently, a parallel Time-Series Mixer (TS-Mixer) block introduced facilitates efficient cross-feature interaction modeling of static and categorical parameters, including lithological indices and mud properties. The feature representations extracted from the Enhanced Transformer and TS-Mixer modules are integrated through a dedicated fusion layer. Finally, an adaptive attention mechanism then dynamically assigns contextual weights to salient features, enhancing discriminative representation learning and enabling high-fidelity ROP prediction. The proposed framework combines sequential memory, static feature interactions, global context learning, and dynamic feature weighting, providing a comprehensive solution for the heterogeneous and event-driven nature of drilling dynamics. Experimental validation on real-world drilling datasets demonstrates superior performance, achieving an Rsquare of 0.9991 and a MAPE of 1.447%, significantly outperforming existing baseline and hybrid models.

[319] Diverse Mini-Batch Selection in Reinforcement Learning for Efficient Chemical Exploration in de novo Drug Design

Hampus Gummesson Svensson, Ola Engkvist, Jon Paul Janet, Christian Tyrchan, Morteza Haghir Chehreghani

Main category: cs.LG

TL;DR: Introduces mini-batch diversification for reinforcement learning to enhance exploration and prevent mode collapse, specifically applied to drug discovery where finding diverse high-quality solutions is crucial.

Details

Motivation: In many real-world applications like drug discovery, evaluating instance quality is costly while proposing new instances is easier. Reinforcement learning requires sufficient exploration to find diverse high-reward solutions, but current methods may suffer from mode collapse.

Method: Proposes mini-batch diversification framework for reinforcement learning, where learning occurs from diverse mini-batches of experiences rather than random sampling.

Result: Extensive evaluation in drug discovery shows the framework substantially enhances solution diversity while maintaining high-quality solutions.

Conclusion: Mini-batch diversification can potentially accelerate drug discovery by enabling faster identification of diverse high-quality compounds to fulfill unmet medical needs.

Abstract: In many real-world applications, evaluating the quality of instances is costly and time-consuming, e.g., human feedback and physics simulations, in contrast to proposing new instances. In particular, this is even more critical in reinforcement learning, since it relies on interactions with the environment (i.e., new instances) that must be evaluated to provide a reward signal for learning. At the same time, performing sufficient exploration is crucial in reinforcement learning to find high-rewarding solutions, meaning that the agent should observe and learn from a diverse set of experiences to find different solutions. Thus, we argue that learning from a diverse mini-batch of experiences can have a large impact on the exploration and help mitigate mode collapse.In this paper, we introduce mini-batch diversification for reinforcement learning and study this framework in the context of a real-world problem, namely, drug discovery. We extensively evaluate how our proposed framework can enhance the effectiveness of chemical exploration in de novo drug design, where finding diverse and high-quality solutions is crucial. Our experiments demonstrate that our proposed diverse mini-batch selection framework can substantially enhance the diversity of solutions while maintaining high-quality solutions. In drug discovery, such an outcome can potentially lead to fulfilling unmet medical needs faster.

[320] Ethics-Aware Safe Reinforcement Learning for Rare-Event Risk Control in Interactive Urban Driving

Dianzhao Li, Ostap Okhrin

Main category: cs.LG

TL;DR: A hierarchical Safe Reinforcement Learning framework that integrates ethical reasoning into autonomous vehicle decision-making to protect vulnerable road users while maintaining driving performance.

Details

Motivation: Widespread adoption of autonomous vehicles requires embedding credible and transparent ethical reasoning to protect vulnerable road users (pedestrians and cyclists) in both routine and emergency maneuvers.

Method: Hierarchical Safe RL framework with ethics-aware cost signals combining collision probability and harm severity. Uses risk-sensitive Prioritized Experience Replay for critical events, with polynomial path planning and PID/Stanley controllers for trajectory execution.

Result: Outperforms baseline methods in reducing risk to others while maintaining ego performance and comfort. Decreases conflict frequency by 25-45% compared to matched task successes while keeping comfort metrics within 5%.

Conclusion: Combining formal control theory and data-driven learning advances ethically accountable autonomy that explicitly protects those most at risk in urban traffic environments.

Abstract: Autonomous vehicles hold great promise for reducing traffic fatalities and improving transportation efficiency, yet their widespread adoption hinges on embedding credible and transparent ethical reasoning into routine and emergency maneuvers, particularly to protect vulnerable road users (VRUs) such as pedestrians and cyclists. Here, we present a hierarchical Safe Reinforcement Learning (Safe RL) framework that augments standard driving objectives with ethics-aware cost signals. At the decision level, a Safe RL agent is trained using a composite ethical risk cost, combining collision probability and harm severity, to generate high-level motion targets. A dynamic, risk-sensitive Prioritized Experience Replay mechanism amplifies learning from rare but critical, high-risk events. At the execution level, polynomial path planning coupled with Proportional-Integral-Derivative (PID) and Stanley controllers translates these targets into smooth, feasible trajectories, ensuring both accuracy and comfort. We train and validate our approach on closed-loop simulation environments derived from large-scale, real-world traffic datasets encompassing diverse vehicles, cyclists, and pedestrians, and demonstrate that it outperforms baseline methods in reducing risk to others while maintaining ego performance and comfort. This work provides a reproducible benchmark for Safe RL with explicitly ethics-aware objectives in human-mixed traffic scenarios. Our results highlight the potential of combining formal control theory and data-driven learning to advance ethically accountable autonomy that explicitly protects those most at risk in urban traffic environments. Across two interactive benchmarks and five random seeds, our policy decreases conflict frequency by 25-45% compared to matched task successes while maintaining comfort metrics within 5%.

[321] Learning Latent Graph Geometry via Fixed-Point Schrödinger-Type Activation: A Theoretical Study

Dmitry Pasechnyuk-Vilensky, Martin Takáč

Main category: cs.LG

TL;DR: A unified theoretical framework for neural architectures using stationary Schrödinger-type dynamics on learned latent graphs, with training formulated as stochastic optimization on graph moduli spaces and generalization bounds derived from geometric quantities.

Details

Motivation: To develop a geometrically interpretable and analytically tractable foundation for neural networks that learn latent graph geometry through fixed-point Schrödinger-type activations, unifying various graph-based architectures.

Method: Define neural layers as fixed-point Schrödinger-type equations on weighted Laplacians with convex local potentials. Train via stochastic optimization on stratified moduli spaces of graphs with Kähler-Hessian metrics. Use feed-forward composition equivalent to global stationary diffusion on supra-graphs.

Result: Proved existence, uniqueness, and smooth dependence of equilibria. Established equivalence to norm-preserving Landau-Lifshitz flows. Derived generalization bounds (PAC-Bayes, stability, Rademacher complexity) controlled by geometric quantities like edge count, degree, and Gromov-Hausdorff distortion.

Conclusion: The framework provides a compact, geometrically interpretable foundation for learning latent graph geometry through Schrödinger-type activations, unifying scalar graph, directed, and sheaf-based architectures with analytical tractability.

Abstract: We develop a unified theoretical framework for neural architectures whose internal representations evolve as stationary states of dissipative Schr"odinger-type dynamics on learned latent graphs. Each layer is defined by a fixed-point Schr"odinger-type equation depending on a weighted Laplacian encoding latent geometry and a convex local potential. We prove existence, uniqueness, and smooth dependence of equilibria, and show that the dynamics are equivalent under the Bloch map to norm-preserving Landau–Lifshitz flows. Training over graph weights and topology is formulated as stochastic optimization on a stratified moduli space of graphs equipped with a natural K"{a}hler–Hessian metric, ensuring convergence and differentiability across strata. We derive generalization bounds – PAC-Bayes, stability, and Rademacher complexity – in terms of geometric quantities such as edge count, maximal degree, and Gromov–Hausdorff distortion, establishing that sparsity and geometric regularity control capacity. Feed-forward composition of stationary layers is proven equivalent to a single global stationary diffusion on a supra-graph; backpropagation is its adjoint stationary system. Finally, directed and vector-valued extensions are represented as sheaf Laplacians with unitary connections, unifying scalar graph, directed, and sheaf-based architectures. The resulting model class provides a compact, geometrically interpretable, and analytically tractable foundation for learning latent graph geometry via fixed-point Schr"odinger-type activations.

[322] What Matters in Data for DPO?

Yu Pan, Zhongze Cai, Guanting Chen, Huaiyang Zhong, Chonghuan Wang

Main category: cs.LG

TL;DR: DPO performance depends primarily on chosen response quality, not rejected response quality. Contrastive learning mainly helps by improving chosen samples, and online DPO reduces to supervised fine-tuning on chosen responses.

Details

Motivation: To systematically study how preference data distribution influences DPO performance and identify what characteristics of preference data are most critical for effective LLM alignment.

Method: Combined theoretical analysis of optimal response distribution under DPO with extensive empirical experiments across diverse tasks, including investigation of online DPO setting and mixing on-policy data.

Result: Quality of chosen responses plays dominant role in optimizing DPO objective, while quality of rejected responses has limited impact. Improving chosen response quality consistently boosts performance regardless of rejected response quality.

Conclusion: Focus on improving chosen response quality is key for effective DPO alignment, offering practical insights for constructing high-impact preference datasets for LLM alignment.

Abstract: Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning large language models (LLMs) with human preferences, bypassing the need for a learned reward model. Despite its growing adoption, a fundamental question remains open: what characteristics of preference data are most critical for DPO performance? In this work, we provide a systematic study of how preference data distribution influences DPO, from both theoretical and empirical perspectives. We show that the quality of chosen responses plays a dominant role in optimizing the DPO objective, while the quality of rejected responses may have relatively limited impact. Our theoretical analysis characterizes the optimal response distribution under DPO and reveals how contrastiveness between responses helps primarily by improving the chosen samples. We further study an online DPO setting and show it effectively reduces to supervised fine-tuning on the chosen responses. Extensive experiments across diverse tasks confirm our findings: improving the quality of chosen responses consistently boosts performance regardless of the quality of the rejected responses. We also investigate the benefit of mixing the on-policy data. Our results interpret the mechanism behind some widely adopted strategies and offer practical insights for constructing high-impact preference datasets for LLM alignment.

[323] A Perfectly Truthful Calibration Measure

Jason Hartline, Lunjia Hu, Yifan Wu

Main category: cs.LG

TL;DR: ATB is a perfectly truthful calibration measure that prevents predictors from lying about their calibration, with efficient computation and linear-time testing.

Details

Motivation: Existing calibration measures incentivize predictors to lie to appear more calibrated, lacking truthfulness. No perfectly truthful measure existed even in batch settings.

Method: Designed averaged two-bin calibration error (ATB) using variance additivity of independent random variables. Also introduced general recipe for truthful measures.

Result: ATB is perfectly truthful, strictly truthful, sound and complete. Quadratically related to existing measures smCal and distCal. Enables first linear-time calibration testing algorithm.

Conclusion: ATB provides the first perfectly truthful calibration measure with efficient computation, solving the truthfulness problem in batch calibration evaluation.

Abstract: Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. A calibration measure quantifies how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Predicting the true probabilities guarantees perfect calibration, but in reality, when calibration is evaluated on a random sample, all known calibration measures incentivize predictors to lie in order to appear more calibrated. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a simple, perfectly and strictly truthful, sound and complete calibration measure in the batch setting: averaged two-bin calibration error (ATB). ATB is quadratically related to two existing calibration measures: the smooth calibration error smCal and the lower distance to calibration distCal. The simplicity in our definition of ATB makes it efficient and straightforward to compute, allowing us to give the first linear-time calibration testing algorithm, improving a result of Hu et al. (2024). We also introduce a general recipe for constructing truthful measures based on the variance additivity of independent random variables, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.

[324] Boardwalk: Towards a Framework for Creating Board Games with LLMs

Álvaro Guglielmin Becker, Gabriel Bauer de Oliveira, Lana Bertoldo Rossato, Anderson Rocha Tavares

Main category: cs.LG

TL;DR: LLMs can generate playable board game code from natural language rules, with Claude 3.7 Sonnet achieving 55.6% error-free implementations.

Details

Motivation: To investigate whether LLMs can implement digital versions of board games from natural language rules, aiming for an LLM-assisted framework for quick board game code generation.

Method: Tested three state-of-the-art LLMs (Claude, DeepSeek, ChatGPT) on 12 popular and obscure games using free-form coding and within Boardwalk API, with anonymized games to avoid pre-trained knowledge bias.

Result: The approach proved viable with Claude 3.7 Sonnet yielding 55.6% of games without any errors. API compliance increased error frequency but error severity depended more on the LLM.

Conclusion: LLMs show promise for board game implementation from natural language, with future work needed to create an integrated framework for more accessible board game development.

Abstract: Implementing board games in code can be a time-consuming task. However, Large Language Models (LLMs) have been proven effective at generating code for domain-specific tasks with simple contextual information. We aim to investigate whether LLMs can implement digital versions of board games from rules described in natural language. This would be a step towards an LLM-assisted framework for quick board game code generation. We expect to determine the main challenges for LLMs to implement the board games, and how different approaches and models compare to one another. We task three state-of-the-art LLMs (Claude, DeepSeek and ChatGPT) with coding a selection of 12 popular and obscure games in free-form and within Boardwalk, our proposed General Game Playing API. We anonymize the games and components to avoid evoking pre-trained LLM knowledge. The implementations are tested for playability and rule compliance. We evaluate success rate and common errors across LLMs and game popularity. Our approach proves viable, with the best performing model, Claude 3.7 Sonnet, yielding 55.6% of games without any errors. While compliance with the API increases error frequency, the severity of errors is more significantly dependent on the LLM. We outline future steps for creating a framework to integrate this process, making the elaboration of board games more accessible.

[325] TimeCopilot

Azul Garza, Renée Rosillo

Main category: cs.LG

TL;DR: TimeCopilot is the first open-source agentic framework that combines Time Series Foundation Models with LLMs through a unified API for automated forecasting pipelines with natural language explanations.

Details

Motivation: To create a practical, reproducible, and accessible agentic forecasting system that automates the entire forecasting pipeline while providing explainable results through natural language.

Method: Combines multiple Time Series Foundation Models with Large Language Models through a unified API, automating feature analysis, model selection, cross-validation, and forecast generation. The framework is LLM-agnostic and supports ensembles across diverse forecasting families.

Result: Achieves state-of-the-art probabilistic forecasting performance on the GIFT-Eval benchmark at low cost.

Conclusion: TimeCopilot provides a practical foundation for reproducible, explainable, and accessible agentic forecasting systems.

Abstract: We introduce TimeCopilot, the first open-source agentic framework for forecasting that combines multiple Time Series Foundation Models (TSFMs) with Large Language Models (LLMs) through a single unified API. TimeCopilot automates the forecasting pipeline: feature analysis, model selection, cross-validation, and forecast generation, while providing natural language explanations and supporting direct queries about the future. The framework is LLM-agnostic, compatible with both commercial and open-source models, and supports ensembles across diverse forecasting families. Results on the large-scale GIFT-Eval benchmark show that TimeCopilot achieves state-of-the-art probabilistic forecasting performance at low cost. Our framework provides a practical foundation for reproducible, explainable, and accessible agentic forecasting systems.

[326] A Certifiable Machine Learning-Based Pipeline to Predict Fatigue Life of Aircraft Structures

Ángel Ladrón, Miguel Sánchez-Domínguez, Javier Rozalén, Fernando R. Sánchez, Javier de Vicente, Lucas Lacasa, Eusebio Valero, Gonzalo Rubio

Main category: cs.LG

TL;DR: Machine learning pipeline for aircraft wing fatigue life prediction using flight parameters, complementing traditional methods by reducing computational costs.

Details

Motivation: Traditional fatigue life prediction methods are time-consuming, require complex workflows and multiple teams. ML offers faster iterations and generalization to complement conventional simulations.

Method: ML-based pipeline that estimates fatigue life of aircraft wing locations using flight parameters from different missions throughout operational life.

Result: Accurate fatigue life predictions with thorough statistical validation and uncertainty quantification in realistic use cases.

Conclusion: The pipeline reduces costly simulations and lowers computational/human resources while maintaining accuracy, serving as a valuable complement to traditional methodologies.

Abstract: Fatigue life prediction is essential in both the design and operational phases of any aircraft, and in this sense safety in the aerospace industry requires early detection of fatigue cracks to prevent in-flight failures. Robust and precise fatigue life predictors are thus essential to ensure safety. Traditional engineering methods, while reliable, are time consuming and involve complex workflows, including steps such as conducting several Finite Element Method (FEM) simulations, deriving the expected loading spectrum, and applying cycle counting techniques like peak-valley or rainflow counting. These steps often require collaboration between multiple teams and tools, added to the computational time and effort required to achieve fatigue life predictions. Machine learning (ML) offers a promising complement to traditional fatigue life estimation methods, enabling faster iterations and generalization, providing quick estimates that guide decisions alongside conventional simulations. In this paper, we present a ML-based pipeline that aims to estimate the fatigue life of different aircraft wing locations given the flight parameters of the different missions that the aircraft will be operating throughout its operational life. We validate the pipeline in a realistic use case of fatigue life estimation, yielding accurate predictions alongside a thorough statistical validation and uncertainty quantification. Our pipeline constitutes a complement to traditional methodologies by reducing the amount of costly simulations and, thereby, lowering the required computational and human resources.

[327] Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization

Nathan Egbuna, Saatvik Gaur, Sunishchal Dev, Ashwinee Panda, Maheep Chaudhary

Main category: cs.LG

TL;DR: ALS is an amortized latent steering method that replaces expensive iterative test-time optimization with a single offline-computed vector, achieving 2-5x speedup while matching or surpassing baseline performance.

Details

Motivation: Current test-time optimization methods are impractical at scale due to prohibitive inference costs from iterative refinement and multi-step verification, requiring 10-100x more compute per query than standard decoding.

Method: ALS computes the mean difference between hidden states from successful versus unsuccessful generations offline, then uses this direction to calibrate the model’s hidden representations during inference at constant cost.

Result: Across GSM8K and MATH-500 benchmarks, ALS achieves 2-5x speedup over iterative methods while matching or surpassing greedy Chain-of-Thought and Self-Consistency baselines, with up to 101% improvement in efficiency-accuracy trade-off.

Conclusion: Much of latent optimization’s benefit can be captured offline, making sophisticated reasoning techniques viable for production deployment without the computational overhead of per-query optimization loops.

Abstract: Test-time optimization remains impractical at scale due to prohibitive inference costs–techniques like iterative refinement and multi-step verification can require $10-100\times$ more compute per query than standard decoding. Latent space test-time optimization methods like LatentSeek offer a more direct approach by steering hidden representations, but still demand expensive per-query optimization loops with multiple backward passes. We propose Amortized Latent Steering (ALS), which collapses this iterative optimization into a single offline-computed vector applied at constant cost during inference. ALS computes the mean difference between hidden states from successful versus unsuccessful generations, then uses this direction to calibrate the model’s hidden representations: when decoding drifts away from the success manifold, ALS nudges activations back toward it. Across GSM8K and MATH-500 benchmarks, ALS achieves $2-5\times$ speedup over iterative methods while matching or surpassing greedy Chain-of-Thought (CoT) and Self-Consistency baselines, yielding up to 101% improvement in efficiency–accuracy trade-off. These results show that much of latent optimization’s benefit can be captured offline, making sophisticated reasoning techniques viable for production deployment. Code is available at https://github.com/negbuna/ALS.

[328] PPG-Distill: Efficient Photoplethysmography Signals Analysis via Foundation Model Distillation

Juntong Ni, Saurabh Kataria, Shengpu Tang, Carl Yang, Xiao Hu, Wei Jin

Main category: cs.LG

TL;DR: PPG-Distill is a knowledge distillation framework that enables efficient PPG analysis on wearable devices by transferring knowledge from large foundation models to smaller student models through multi-level distillation.

Details

Motivation: Large PPG foundation models are difficult to deploy on resource-limited wearable devices due to computational constraints, creating a need for efficient model compression techniques.

Method: Uses knowledge distillation with prediction-, feature-, and patch-level distillation, incorporating morphology distillation for local waveform patterns and rhythm distillation for inter-patch temporal structures.

Result: Improves student performance by up to 21.8% on heart rate estimation and atrial fibrillation detection, achieves 7X faster inference and reduces memory usage by 19X.

Conclusion: PPG-Distill enables efficient PPG analysis on wearable devices while maintaining high performance through effective knowledge transfer from large foundation models.

Abstract: Photoplethysmography (PPG) is widely used in wearable health monitoring, yet large PPG foundation models remain difficult to deploy on resource-limited devices. We present PPG-Distill, a knowledge distillation framework that transfers both global and local knowledge through prediction-, feature-, and patch-level distillation. PPG-Distill incorporates morphology distillation to preserve local waveform patterns and rhythm distillation to capture inter-patch temporal structures. On heart rate estimation and atrial fibrillation detection, PPG-Distill improves student performance by up to 21.8% while achieving 7X faster inference and reducing memory usage by 19X, enabling efficient PPG analysis on wearables.

[329] Fine-Tuning Masked Diffusion for Provable Self-Correction

Jaeyeon Kim, Seunggeun Kim, Taekyun Lee, David Z. Pan, Hyeji Kim, Sham Kakade, Sitan Chen

Main category: cs.LG

TL;DR: PRISM is a lightweight, model-agnostic method for self-correction in Masked Diffusion Models that learns per-token quality scores without requiring architectural changes or reinforcement learning.

Details

Motivation: Current approaches for self-correction in Masked Diffusion Models either require major architectural/training modifications or rely on imprecise proxies for token quality, limiting their practical applicability.

Method: PRISM defines a self-correction loss that provably learns per-token quality scores in the same forward pass with MDM, using these scores to detect and correct low-quality tokens during inference.

Result: PRISM advances MDM inference performance across multiple domains including Sudoku puzzles, unconditional text generation (170M parameters), and code generation with LLaDA (8B parameters).

Conclusion: PRISM provides an effective, lightweight solution for inference-time self-correction in Masked Diffusion Models that works with any pretrained model without requiring architectural changes.

Abstract: A natural desideratum for generative models is self-correction–detecting and revising low-quality tokens at inference. While Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces, their capacity for self-correction remains poorly understood. Prior attempts to incorporate self-correction into MDMs either require overhauling MDM architectures/training or rely on imprecise proxies for token quality, limiting their applicability. Motivated by this, we introduce PRISM–Plug-in Remasking for Inference-time Self-correction of Masked Diffusions–a lightweight, model-agnostic approach that applies to any pretrained MDM. Theoretically, PRISM defines a self-correction loss that provably learns per-token quality scores, without RL or a verifier. These quality scores are computed in the same forward pass with MDM and used to detect low-quality tokens. Empirically, PRISM advances MDM inference across domains and scales: Sudoku; unconditional text (170M); and code with LLaDA (8B).

Jose Tupayachi, Mustafa C. Camur, Kevin Heaslip, Xueping Li

Main category: cs.LG

TL;DR: TW-GCN framework combines Graph Convolutional Networks with temporal models to predict EV charging demand using traffic, weather, and infrastructure data, achieving best performance with 3-hour forecasts and 1DCNN temporal model.

Details

Motivation: Transition to electric vehicles faces challenges from uneven charging infrastructure distribution and utilization, creating issues for power grid stability and investment planning.

Method: TW-GCN spatio-temporal forecasting framework combining Graph Convolutional Networks with temporal architectures, using real-world traffic flows, weather conditions, and proprietary EV infrastructure data.

Result: Mid-horizon (3-hour) forecasts achieve best balance between responsiveness and stability, with 1DCNN consistently outperforming other temporal models. Regional analysis shows predictive accuracy disparities across Tennessee regions.

Conclusion: TW-GCN framework advances integration of data-driven intelligence into EV infrastructure planning, supporting sustainable mobility transitions and resilient grid management.

Abstract: Transportation remains a major contributor to greenhouse gas emissions, highlighting the urgency of transitioning toward sustainable alternatives such as electric vehicles (EVs). Yet, uneven spatial distribution and irregular utilization of charging infrastructure create challenges for both power grid stability and investment planning. This study introduces TW-GCN, a spatio-temporal forecasting framework that combines Graph Convolutional Networks with temporal architectures to predict EV charging demand in Tennessee, United States (U.S.). We utilize real-world traffic flows, weather conditions, and proprietary data provided by one of the largest EV infrastructure company in the U.S. to capture both spatial dependencies and temporal dynamics. Extensive experiments across varying lag horizons, clustering strategies, and sequence lengths reveal that mid-horizon (3-hour) forecasts achieve the best balance between responsiveness and stability, with 1DCNN consistently outperforming other temporal models. Regional analysis shows disparities in predictive accuracy across East, Middle, and West Tennessee, reflecting how station density, population, and local demand variability shape model performance. The proposed TW-GCN framework advances the integration of data-driven intelligence into EV infrastructure planning, supporting both sustainable mobility transitions and resilient grid management.

[331] Bidirectional Time-Frequency Pyramid Network for Enhanced Robust EEG Classification

Jiahui Hong, Siqing Li, Muqing Jian, Luming Yang

Main category: cs.LG

TL;DR: BITE is a bidirectional time-frequency pyramid network that achieves state-of-the-art EEG recognition performance across multiple paradigms with superior cross-subject generalization.

Details

Motivation: Existing EEG recognition models suffer from poor cross-paradigm generalization due to dataset-specific constraints and individual variability.

Method: End-to-end unified architecture with aligned time-frequency streams, pyramid time-frequency attention (PTFA), bidirectional adaptive convolutions (BiTCN), and learnable fusion for forward/backward neural dynamics.

Result: Achieves state-of-the-art performance across four divergent paradigms (BCICIV-2A/2B, HGD, SD-SSVEP), excelling in both within-subject accuracy and cross-subject generalization with exceptional computational efficiency.

Conclusion: Paradigm-aligned spectral-temporal processing is essential for reliable BCI systems, and BITE successfully combines robust performance across both MI and SSVEP tasks.

Abstract: Existing EEG recognition models suffer from poor cross-paradigm generalization due to dataset-specific constraints and individual variability. To overcome these limitations, we propose BITE (Bidirectional Time-Freq Pyramid Network), an end-to-end unified architecture featuring robust multistream synergy, pyramid time-frequency attention (PTFA), and bidirectional adaptive convolutions. The framework uniquely integrates: 1) Aligned time-frequency streams maintaining temporal synchronization with STFT for bidirectional modeling, 2) PTFA-based multi-scale feature enhancement amplifying critical neural patterns, 3) BiTCN with learnable fusion capturing forward/backward neural dynamics. Demonstrating enhanced robustness, BITE achieves state-of-the-art performance across four divergent paradigms (BCICIV-2A/2B, HGD, SD-SSVEP), excelling in both within-subject accuracy and cross-subject generalization. As a unified architecture, it combines robust performance across both MI and SSVEP tasks with exceptional computational efficiency. Our work validates that paradigm-aligned spectral-temporal processing is essential for reliable BCI systems. Just as its name suggests, BITE “takes a bite out of EEG.” The source code is available at https://github.com/cindy-hong/BiteEEG.

[332] LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits

Amir Reza Mirzaei, Yuqiao Wen, Yanshuai Cao, Lili Mou

Main category: cs.LG

TL;DR: LoRAQuant is a mixed-precision quantization method for LoRA adapters that uses SVD reparameterization to enable ultra-low bitwidth quantization while maintaining performance.

Details

Motivation: Multiple LoRA adapters are used simultaneously for LLM customization, but their aggregate computational cost becomes substantial at scale, requiring efficient quantization methods.

Method: Reparameterizes each LoRA adapter using SVD to concentrate important information into specific rows and columns, then applies mixed-precision quantization with higher precision for important components and ultra-low bitwidth for the rest.

Result: LoRAQuant uses significantly lower bits than other quantization methods while achieving comparable or even higher performance on mathematical reasoning, coding, and summarization tasks with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B models.

Conclusion: LoRAQuant effectively reduces the computational cost of multiple LoRA adapters through intelligent mixed-precision quantization while maintaining model performance.

Abstract: Low-Rank Adaptation (LoRA) has become a popular technique for parameter-efficient fine-tuning of large language models (LLMs). In many real-world scenarios, multiple adapters are loaded simultaneously to enable LLM customization for personalized user experiences or to support a diverse range of tasks. Although each adapter is lightweight in isolation, their aggregate cost becomes substantial at scale. To address this, we propose LoRAQuant, a mixed-precision post-training quantization method tailored to LoRA. Specifically, LoRAQuant reparameterizes each adapter by singular value decomposition (SVD) to concentrate the most important information into specific rows and columns. This makes it possible to quantize the important components to higher precision, while quantizing the rest to ultra-low bitwidth. We conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B models on mathematical reasoning, coding, and summarization tasks. Results show that our LoRAQuant uses significantly lower bits than other quantization methods, but achieves comparable or even higher performance.

[333] Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

Bozhi You, Irene Wang, Zelal Su Mustafaoglu, Abhinav Jangda, Angélica Moreira, Roshan Dathathri, Divya Mahajan, Keshav Pingali

Main category: cs.LG

TL;DR: Flashlight is a compiler-native framework in PyTorch that automatically generates efficient FlashAttention-style kernels for arbitrary attention variants, outperforming FlexAttention in flexibility and supporting data-dependent attention patterns.

Details

Motivation: Existing attention optimization methods like FlashAttention and FlexAttention have limitations - FlashAttention requires specialized kernels, while FlexAttention uses static templates that can't handle data-dependent attention patterns. There's a need for a more flexible solution that supports arbitrary attention variants without sacrificing performance.

Method: Flashlight integrates with PyTorch’s compilation workflow to automatically fuse and tile attention computations. It generates optimized kernels on-the-fly without relying on static templates or predefined specializations, supporting both FlexAttention-compatible variants and more general data-dependent attention formulations.

Result: Flashlight produces kernels with competitive or superior performance compared to FlexAttention. It supports all variants expressible in FlexAttention plus more general data-dependent attention patterns that FlexAttention cannot handle.

Conclusion: Flashlight provides a flexible, high-performance solution for attention optimization that enables developers to rapidly explore new attention models using native PyTorch code without performance trade-offs, addressing limitations of previous approaches.

Abstract: Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch’s compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.

[334] Periodic Skill Discovery

Jonghae Park, Daesol Cho, Jusuk Lee, Dongseok Shim, Inkyu Jang, H. Jin Kim

Main category: cs.LG

TL;DR: Proposes Periodic Skill Discovery (PSD) - an unsupervised RL framework that discovers diverse periodic behaviors by mapping states to a circular latent space, enabling learning of skills with varying periods for robotic tasks.

Details

Motivation: Current unsupervised skill discovery methods overlook periodic nature of behaviors needed for robotic locomotion tasks, focusing instead on mutual information or latent space distance metrics.

Method: Trains an encoder to map states to a circular latent space that naturally encodes periodicity, capturing temporal distance to learn skills with diverse periods.

Result: PSD effectively learns periodic skills in complex robotic tasks with pixel observations, achieves high performance on downstream tasks like hurdling, and when combined with existing methods provides more diverse behaviors.

Conclusion: The circular latent space approach enables discovery of diverse periodic skills essential for robotic locomotion, demonstrating practical utility in downstream tasks and enhanced behavioral diversity.

Abstract: Unsupervised skill discovery in reinforcement learning (RL) aims to learn diverse behaviors without relying on external rewards. However, current methods often overlook the periodic nature of learned skills, focusing instead on increasing the mutual dependence between states and skills or maximizing the distance traveled in latent space. Considering that many robotic tasks - particularly those involving locomotion - require periodic behaviors across varying timescales, the ability to discover diverse periodic skills is essential. Motivated by this, we propose Periodic Skill Discovery (PSD), a framework that discovers periodic behaviors in an unsupervised manner. The key idea of PSD is to train an encoder that maps states to a circular latent space, thereby naturally encoding periodicity in the latent representation. By capturing temporal distance, PSD can effectively learn skills with diverse periods in complex robotic tasks, even with pixel-based observations. We further show that these learned skills achieve high performance on downstream tasks such as hurdling. Moreover, integrating PSD with an existing skill discovery method offers more diverse behaviors, thus broadening the agent’s repertoire. Our code and demos are available at https://jonghaepark.github.io/psd/

[335] On Joint Regularization and Calibration in Deep Ensembles

Laurits Fredsgaard, Mikkel N. Schmidt

Main category: cs.LG

TL;DR: Joint tuning of deep ensembles improves performance and uncertainty calibration, with a proposed overlapping holdout strategy balancing data usage and evaluation.

Details

Motivation: To investigate the impact of jointly tuning ensemble components (weight decay, temperature scaling, early stopping) rather than tuning models individually, aiming to improve both predictive performance and uncertainty quantification.

Method: Proposed joint tuning of weight decay, temperature scaling, and early stopping for deep ensembles, along with a partially overlapping holdout strategy to enable joint evaluation while maximizing training data usage.

Result: Joint tuning generally matches or improves performance compared to individual tuning, with significant variation in effect size across different tasks and metrics. The overlapping holdout strategy provides a practical compromise solution.

Conclusion: Joint optimization in deep ensemble training offers performance benefits, with the overlapping holdout strategy being an attractive practical approach for practitioners optimizing deep ensemble models.

Abstract: Deep ensembles are a powerful tool in machine learning, improving both model performance and uncertainty calibration. While ensembles are typically formed by training and tuning models individually, evidence suggests that jointly tuning the ensemble can lead to better performance. This paper investigates the impact of jointly tuning weight decay, temperature scaling, and early stopping on both predictive performance and uncertainty quantification. Additionally, we propose a partially overlapping holdout strategy as a practical compromise between enabling joint evaluation and maximizing the use of data for training. Our results demonstrate that jointly tuning the ensemble generally matches or improves performance, with significant variation in effect size across different tasks and metrics. We highlight the trade-offs between individual and joint optimization in deep ensemble training, with the overlapping holdout strategy offering an attractive practical solution. We believe our findings provide valuable insights and guidance for practitioners looking to optimize deep ensemble models. Code is available at: https://github.com/lauritsf/ensemble-optimality-gap

cs.MA

[336] TAMAS: Benchmarking Adversarial Risks in Multi-Agent LLM Systems

Ishan Kavathekar, Hemang Jain, Ameya Rathod, Ponnurangam Kumaraguru, Tanuja Ganu

Main category: cs.MA

TL;DR: TAMAS is a benchmark for evaluating safety and robustness of multi-agent LLM systems against adversarial attacks, revealing significant vulnerabilities in current deployments.

Details

Motivation: Existing safety benchmarks focus on single-agent settings, failing to address unique vulnerabilities in multi-agent coordination and dynamics, creating a critical gap in security evaluation.

Method: Developed TAMAS benchmark with 300 adversarial instances across 5 scenarios, 6 attack types, 211 tools, and 100 harmless tasks. Evaluated 10 LLMs across 3 agent interaction configurations using Effective Robustness Score (ERS).

Result: Multi-agent LLM systems are highly vulnerable to adversarial attacks, with critical challenges and failure modes identified across different frameworks and models.

Conclusion: Urgent need for stronger defenses in multi-agent LLM systems. TAMAS provides foundational framework for systematic safety improvement and vulnerability assessment.

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents through tool use, planning, and decision-making abilities, leading to their widespread adoption across diverse tasks. As task complexity grows, multi-agent LLM systems are increasingly used to solve problems collaboratively. However, safety and security of these systems remains largely under-explored. Existing benchmarks and datasets predominantly focus on single-agent settings, failing to capture the unique vulnerabilities of multi-agent dynamics and co-ordination. To address this gap, we introduce $\textbf{T}$hreats and $\textbf{A}$ttacks in $\textbf{M}$ulti-$\textbf{A}$gent $\textbf{S}$ystems ($\textbf{TAMAS}$), a benchmark designed to evaluate the robustness and safety of multi-agent LLM systems. TAMAS includes five distinct scenarios comprising 300 adversarial instances across six attack types and 211 tools, along with 100 harmless tasks. We assess system performance across ten backbone LLMs and three agent interaction configurations from Autogen and CrewAI frameworks, highlighting critical challenges and failure modes in current multi-agent deployments. Furthermore, we introduce Effective Robustness Score (ERS) to assess the tradeoff between safety and task effectiveness of these frameworks. Our findings show that multi-agent systems are highly vulnerable to adversarial attacks, underscoring the urgent need for stronger defenses. TAMAS provides a foundation for systematically studying and improving the safety of multi-agent LLM systems.

cs.MM

[337] Automatización de Informes Geotécnicos para Macizos Rocosos con IA

Christofer Valencia, Alexis Llumigusín, Silvia Alvarez, Abrahan Arias, Christian Mejia-Escobar

Main category: cs.MM

TL;DR: AI-based system for automatic geotechnical report generation using multimodal LLM processing of rock outcrop images and field data, replacing traditional manual methods.

Details

Motivation: Traditional geotechnical reporting is slow, error-prone, and subjective due to manual field observations with basic tools like compasses and notebooks.

Method: Used photographs of rock outcrops and manual samples with descriptions, along with course reports, for prompt engineering and validation of a multimodal LLM without fine-tuning.

Result: Achieved BLEU score of 0.455 and ROUGE-L score of 0.653, indicating automatic descriptions are comparable to expert-made ones. Web-accessible tool with intuitive interface and export capabilities.

Conclusion: The AI system represents an innovative contribution for geology professionals and students, providing efficient and standardized geotechnical reporting.

Abstract: Geotechnical reports are crucial for assessing the stability of rock formations and ensuring safety in modern engineering. Traditionally, these reports are prepared manually based on field observations using compasses, magnifying glasses, and notebooks. This method is slow, prone to errors, and subjective in its interpretations. To overcome these limitations, the use of artificial intelligence techniques is proposed for the automatic generation of reports through the processing of images and field data. The methodology was based on the collection of photographs of rock outcrops and manual samples with their respective descriptions, as well as on the reports prepared during the Geotechnical Studies course. These resources were used to define the report outline, prompt engineering, and validate the responses of a multimodal large language model (MLLM). The iterative refinement of prompts until structured and specific instructions were obtained for each section of the report proved to be an effective alternative to the costly process of fine-tuning the MLLM. The system evaluation establishes values of 0.455 and 0.653 for the BLEU and ROUGE-L metrics, respectively, suggesting that automatic descriptions are comparable to those made by experts. This tool, accessible via the web, with an intuitive interface and the ability to export to standardized formats, represents an innovation and an important contribution for professionals and students of field geology.

[338] On the Brittleness of CLIP Text Encoders

Allie Tran, Luca Rossetto

Main category: cs.MM

TL;DR: Analysis of CLIP model robustness against non-semantic query perturbations in multimedia retrieval, showing syntactic and semantic changes cause largest instabilities.

Details

Motivation: CLIP models trained on contrastive alignment lack stability towards small input perturbations, especially problematic for manually expressed queries where minor variations can significantly alter search results.

Method: Systematic analysis of lexical, syntactic, and semantic query perturbations using TRECVID Ad-Hoc Video Search queries and V3C1 video collection across multiple CLIP variants.

Result: Syntactic and semantic perturbations cause largest instabilities, while brittleness is concentrated in trivial surface edits like punctuation and case changes.

Conclusion: Robustness is a critical evaluation dimension for vision-language models beyond benchmark accuracy, highlighting vulnerability to query variations.

Abstract: Multimodal co-embedding models, especially CLIP, have advanced the state of the art in zero-shot classification and multimedia information retrieval in recent years by aligning images and text in a shared representation space. However, such modals trained on a contrastive alignment can lack stability towards small input perturbations. Especially when dealing with manually expressed queries, minor variations in the query can cause large differences in the ranking of the best-matching results. In this paper, we present a systematic analysis of the effect of multiple classes of non-semantic query perturbations in an multimedia information retrieval scenario. We evaluate a diverse set of lexical, syntactic, and semantic perturbations across multiple CLIP variants using the TRECVID Ad-Hoc Video Search queries and the V3C1 video collection. Across models, we find that syntactic and semantic perturbations drive the largest instabilities, while brittleness is concentrated in trivial surface edits such as punctuation and case. Our results highlight robustness as a critical dimension for evaluating vision-language models beyond benchmark accuracy.

eess.AS

[339] Synthesizing speech with selected perceptual voice qualities - A case study with creaky voice

Frederik Rautenberg, Fritz Seebauer, Jana Wiechmann, Michael Kuhlmann, Petra Wagner, Reinhold Haeb-Umbach

Main category: eess.AS

TL;DR: A TTS system with normalizing flows can manipulate creaky voice without needing unreliable frame-wise creak prediction.

Details

Motivation: Control of perceptual voice qualities in TTS is useful for illustrating phonetic concepts through manipulated speech probes.

Method: Augmented TTS system with global speaker attribute manipulation block based on normalizing flows to manipulate creaky voice quality.

Result: Successfully manipulated creak without frame-wise predictor; subjective tests confirmed manipulation with slightly reduced MOS compared to original recording.

Conclusion: Normalizing flows enable effective creaky voice manipulation in TTS systems, avoiding unreliable frame-wise prediction methods.

Abstract: The control of perceptual voice qualities in a text-to-speech (TTS) system is of interest for applications where unmanipu- lated and manipulated speech probes can serve to illustrate pho- netic concepts that are otherwise difficult to grasp. Here, we show that a TTS system, that is augmented with a global speaker attribute manipulation block based on normalizing flows1 , is capable of correctly manipulating the non-persistent, localized quality of creaky voice, thus avoiding the necessity of a, typi- cally unreliable, frame-wise creak predictor. Subjective listen- ing tests confirm successful creak manipulation at a slightly re- duced MOS score compared to the original recording.

[340] Comparative Study on Noise-Augmented Training and its Effect on Adversarial Robustness in ASR Systems

Karla Pizzi, Matías Pizarro, Asja Fischer

Main category: eess.AS

TL;DR: Noise-augmented training improves both performance on noisy speech and adversarial robustness in ASR systems across different architectures.

Details

Motivation: To investigate whether noise-augmented training can concurrently improve adversarial robustness in automatic speech recognition systems.

Method: Comparative analysis of four ASR architectures trained under three augmentation conditions: (1) background noise, speed variations, and reverberations; (2) speed variations only; (3) no data augmentation. Models evaluated against white-box and black-box adversarial attacks.

Result: Noise augmentation enhances model performance on noisy speech and improves robustness to adversarial attacks.

Conclusion: Noise-augmented training provides dual benefits of improved performance on noisy speech and enhanced adversarial robustness in ASR systems.

Abstract: In this study, we investigate whether noise-augmented training can concurrently improve adversarial robustness in automatic speech recognition (ASR) systems. We conduct a comparative analysis of the adversarial robustness of four different ASR architectures, each trained under three different augmentation conditions: (1) background noise, speed variations, and reverberations; (2) speed variations only; (3) no data augmentation. We then evaluate the robustness of all resulting models against attacks with white-box or black-box adversarial examples. Our results demonstrate that noise augmentation not only enhances model performance on noisy speech but also improves the model’s robustness to adversarial attacks.

eess.IV

[341] LG-NuSegHop: A Local-to-Global Self-Supervised Pipeline For Nuclei Instance Segmentation

Vasileios Magoulianitis, Catherine A. Alexander, Jiaxin Yang, C. -C. Jay Kuo

Main category: eess.IV

TL;DR: LG-NuSegHop is a self-supervised nuclei segmentation method that uses no manual annotations but achieves competitive performance with supervised methods through local processing, novel feature extraction, and global post-processing.

Details

Motivation: Nuclei segmentation is crucial for disease diagnosis but labor-intensive and requires expert physicians. Deep learning models struggle with generalization across different organ tissues and domains due to expensive data annotations.

Method: Three-module pipeline: (1) local processing operations to generate pseudolabels, (2) NuSegHop - novel data-driven feature extraction model, (3) global operations for post-processing predictions. Entirely self-supervised with no manual annotations.

Result: Outperforms other self-supervised and weakly supervised methods, achieves competitive performance with fully supervised methods on three public datasets. Maintains good generalization to unseen organs/domains without domain adaptation.

Conclusion: LG-NuSegHop provides an effective self-supervised solution for nuclei segmentation with strong generalization, transparency, and explainability for physicians, eliminating the need for expensive manual annotations.

Abstract: Nuclei segmentation is the cornerstone task in histology image reading, shedding light on the underlying molecular patterns and leading to disease or cancer diagnosis. Yet, it is a laborious task that requires expertise from trained physicians. The large nuclei variability across different organ tissues and acquisition processes challenges the automation of this task. On the other hand, data annotations are expensive to obtain, and thus, Deep Learning (DL) models are challenged to generalize to unseen organs or different domains. This work proposes Local-to-Global NuSegHop (LG-NuSegHop), a self-supervised pipeline developed on prior knowledge of the problem and molecular biology. There are three distinct modules: (1) a set of local processing operations to generate a pseudolabel, (2) NuSegHop a novel data-driven feature extraction model and (3) a set of global operations to post-process the predictions of NuSegHop. Notably, even though the proposed pipeline uses { no manually annotated training data} or domain adaptation, it maintains a good generalization performance on other datasets. Experiments in three publicly available datasets show that our method outperforms other self-supervised and weakly supervised methods while having a competitive standing among fully supervised methods. Remarkably, every module within LG-NuSegHop is transparent and explainable to physicians.

[342] UHDRes: Ultra-High-Definition Image Restoration via Dual-Domain Decoupled Spectral Modulation

S. Zhao, W. Lu, B. Wang, T. Wang, K. Zhang, H. Zhao

Main category: eess.IV

TL;DR: UHDRes is a lightweight dual-domain decoupled spectral modulation framework for UHD image restoration that achieves state-of-the-art performance with only 400K parameters, significantly reducing inference latency and memory usage.

Details

Motivation: Ultra-high-definition (UHD) images often suffer from severe degradations like blur, haze, rain, or low-light conditions, posing significant challenges for image restoration due to high resolution and computational demands.

Method: Proposes a dual-domain decoupled spectral modulation framework that explicitly models amplitude spectrum via lightweight spectrum-domain modulation while restoring phase implicitly through spatial-domain refinement. Uses spatio-spectral fusion mechanism with multi-scale context aggregator and shared gated feed-forward network.

Result: Extensive experiments on five public UHD benchmarks demonstrate state-of-the-art restoration performance with only 400K parameters, while significantly reducing inference latency and memory usage.

Conclusion: UHDRes provides an effective and efficient solution for UHD image restoration, achieving superior performance with minimal computational overhead.

Abstract: Ultra-high-definition (UHD) images often suffer from severe degradations such as blur, haze, rain, or low-light conditions, which pose significant challenges for image restoration due to their high resolution and computational demands. In this paper, we propose UHDRes, a novel lightweight dual-domain decoupled spectral modulation framework for UHD image restoration. It explicitly models the amplitude spectrum via lightweight spectrum-domain modulation, while restoring phase implicitly through spatial-domain refinement. We introduce the spatio-spectral fusion mechanism, which first employs a multi-scale context aggregator to extract local and global spatial features, and then performs spectral modulation in a decoupled manner. It explicitly enhances amplitude features in the frequency domain while implicitly restoring phase information through spatial refinement. Additionally, a shared gated feed-forward network is designed to efficiently promote feature interaction through shared-parameter convolutions and adaptive gating mechanisms. Extensive experimental comparisons on five public UHD benchmarks demonstrate that our UHDRes achieves the state-of-the-art restoration performance with only 400K parameters, while significantly reducing inference latency and memory usage. The codes and models are available at https://github.com/Zhao0100/UHDRes.

[343] J-SGFT: Joint Spatial and Graph Fourier Domain Learning for Point Cloud Attribute Deblocking

Muhammad Talha, Qi Yang, Zhu Li, Anique Akhtar, Geert Van Der Auwera

Main category: eess.IV

TL;DR: A multi-scale postprocessing framework using graph-Fourier latent attributes, sparse convolutions, and channel-wise attention to reduce blocky artifacts in compressed point clouds, achieving significant BD-rate reductions.

Details

Motivation: Point cloud compression methods like MPEG's GPCC introduce blocky artifacts in reconstructed point clouds, which degrade visual quality despite successful bitrate reduction.

Method: Multi-scale postprocessing framework that fuses graph-Fourier latent attribute representations with sparse convolutions and channel-wise attention for efficient deblocking.

Result: Achieved BD-rate reduction of 18.81% in Y channel and 18.14% in joint YUV on 8iVFBv2 dataset, with markedly improved visual fidelity and minimal overhead.

Conclusion: The proposed framework effectively reduces blocky artifacts in compressed point clouds while maintaining visual quality with minimal computational overhead.

Abstract: Point clouds (PC) are essential for AR/VR and autonomous driving but challenge compression schemes with their size, irregular sampling, and sparsity. MPEG’s Geometry-based Point Cloud Compression (GPCC) methods successfully reduce bitrate; however, they introduce significant blocky artifacts in the reconstructed point cloud. We introduce a novel multi-scale postprocessing framework that fuses graph-Fourier latent attribute representations with sparse convolutions and channel-wise attention to efficiently deblock reconstructed point clouds. Against the GPCC TMC13v14 baseline, our approach achieves BD-rate reduction of 18.81% in the Y channel and 18.14% in the joint YUV on the 8iVFBv2 dataset, delivering markedly improved visual fidelity with minimal overhead.

[344] Transporter: A 128$\times$4 SPAD Imager with On-chip Encoder for Spiking Neural Network-based Processing

Yang Lin, Claudio Bruschini, Edoardo Charbon

Main category: eess.IV

TL;DR: Proposes a novel SPAD sensor architecture that eliminates TDCs using in-sensor spike encoders, enabling histogram-free processing with reduced data transfer and real-time edge capabilities.

Details

Motivation: Traditional SPAD architectures rely on TDCs and histogram processing, causing significant data transfer and processing challenges that limit real-time applications.

Method: Integrates in-sensor spike encoders that fold multiple laser periods, transforming phase-based spike trains into density-based spike trains optimized for spiking neural networks and backpropagation through time training.

Result: Developed Transporter, a 128×4 SPAD sensor with per-pixel D flip-flop ring-based spike encoder, demonstrating efficient neuromorphic imaging with compressed data and real-time processing.

Conclusion: This approach enables more efficient neuromorphic SPAD imaging systems with reduced data overhead and enhanced real-time processing capabilities.

Abstract: Single-photon avalanche diodes (SPADs) are widely used today in time-resolved imaging applications. However, traditional architectures rely on time-to-digital converters (TDCs) and histogram-based processing, leading to significant data transfer and processing challenges. Previous work based on recurrent neural networks has realized histogram-free processing. To further address these limitations, we propose a novel paradigm that eliminates TDCs by integrating in-sensor spike encoders. This approach enables preprocessing of photon arrival events in the sensor while significantly compressing data, reducing complexity, and maintaining real-time edge processing capabilities. A dedicated spike encoder folds multiple laser repetition periods, transforming phase-based spike trains into density-based spike trains optimized for spiking neural network processing and training via backpropagation through time. As a proof of concept, we introduce Transporter, a 128$\times$4 SPAD sensor with a per-pixel D flip-flop ring-based spike encoder, designed for intelligent active time-resolved imaging. This work demonstrates a path toward more efficient, neuromorphic SPAD imaging systems with reduced data overhead and enhanced real-time processing.

[345] Generative Autoregressive Transformers for Model-Agnostic Federated MRI Reconstruction

Valiyeh A. Nezhad, Gokberk Elmas, Bilal Kabas, Fuat Arslan, Emine U. Saritas, Tolga Çukur

Main category: eess.IV

TL;DR: FedGAT is a model-agnostic federated learning method that trains a global generative prior for MRI reconstruction and uses site-specific prompting to adapt to multi-site data, enabling heterogeneous model architectures across institutions.

Details

Motivation: Single-site MRI reconstruction models trained on limited local datasets show poor generalization, and conventional federated learning requires architectural homogeneity, restricting sites from using models tailored to their resources or needs.

Method: Two-tier approach: 1) Collaboratively train a global generative prior adapted from a natural image foundation model (VAE + transformer), then fine-tune with site-specific prompting; 2) Each site independently trains preferred reconstruction models using local data augmented with synthetic MRI data from other sites generated by the tuned prior.

Result: FedGAT outperforms state-of-the-art FL baselines in both within- and cross-site reconstruction performance under model-heterogeneous settings on multi-institutional datasets.

Conclusion: FedGAT enables effective collaborative MRI reconstruction across institutions while preserving privacy and allowing model heterogeneity, improving generalization performance compared to conventional FL approaches.

Abstract: While learning-based models hold great promise for MRI reconstruction, single-site models trained on limited local datasets often show poor generalization. This has motivated collaborative training across institutions via federated learning (FL)-a privacy-preserving framework that aggregates model updates instead of sharing raw data. Conventional FL requires architectural homogeneity, restricting sites from using models tailored to their resources or needs. To address this limitation, we propose FedGAT, a model-agnostic FL technique that first collaboratively trains a global generative prior for MR images, adapted from a natural image foundation model composed of a variational autoencoder (VAE) and a transformer that generates images via spatial-scale autoregression. We fine-tune the transformer module after injecting it with a lightweight site-specific prompting mechanism, keeping the VAE frozen, to efficiently adapt the model to multi-site MRI data. In a second tier, each site independently trains its preferred reconstruction model by augmenting local data with synthetic MRI data from other sites, generated by site-prompting the tuned prior. This decentralized augmentation improves generalization while preserving privacy. Experiments on multi-institutional datasets show that FedGAT outperforms state-of-the-art FL baselines in both within- and cross-site reconstruction performance under model-heterogeneous settings.

[346] Cyst-X: A Federated AI System Outperforms Clinical Guidelines to Detect Pancreatic Cancer Precursors and Reduce Unnecessary Surgery

Hongyi Pan, Gorkem Durak, Elif Keles, Deniz Seyithanoglu, Zheyuan Zhang, Alpay Medetalibeyoglu, Halil Ertugrul Aktas, Andrea Mia Bejar, Ziliang Hong, Yavuz Taktak, Gulbiz Dagoglu Kartal, Mehmet Sukru Erturk, Timurhan Cebeci, Maria Jaramillo Gonzalez, Yury Velichko, Lili Zhao, Emil Agarunov, Federica Proietto Salanitri, Concetto Spampinato, Pallavi Tiwari, Ziyue Xu, Sachin Jambawalikar, Ivo G. Schoots, Marco J. Bruno, Chenchan Huang, Candice W. Bolan, Tamas Gonda, Frank H. Miller, Rajesh N. Keswani, Michael B. Wallace, Ulas Bagci

Main category: eess.IV

TL;DR: Cyst-X is an AI framework for predicting malignancy risk in pancreatic IPMN cysts using MRI scans, achieving higher accuracy than current guidelines and expert radiologists, with potential to improve early pancreatic cancer detection.

Details

Motivation: Pancreatic cancer is becoming increasingly deadly, and current guidelines for IPMN risk stratification are inadequate, leading to unnecessary surgeries or missed diagnoses of high-risk lesions.

Method: Developed Cyst-X AI framework trained on 1,461 MRI scans from 764 patients across multiple centers, using federated learning to maintain patient privacy while enabling collaborative training.

Result: Cyst-X achieved AUC of 0.82, outperforming Kyoto guidelines (AUC = 0.75) and expert radiologists, with 87.8% sensitivity for high-risk lesions vs 64.1% for current methods.

Conclusion: Cyst-X provides superior IPMN risk stratification and the authors release the dataset and models publicly to accelerate research in early pancreatic cancer detection.

Abstract: Pancreatic cancer is projected to be the second-deadliest cancer by 2030, making early detection critical. Intraductal papillary mucinous neoplasms (IPMNs), key cancer precursors, present a clinical dilemma, as current guidelines struggle to stratify malignancy risk, leading to unnecessary surgeries or missed diagnoses. Here, we developed Cyst-X, an AI framework for IPMN risk prediction trained on a unique, multi-center dataset of 1,461 MRI scans from 764 patients. Cyst-X achieves significantly higher accuracy (AUC = 0.82) than both the established Kyoto guidelines (AUC = 0.75) and expert radiologists, particularly in correct identification of high-risk lesions. Clinically, this translates to a 20% increase in cancer detection sensitivity (87.8% vs. 64.1%) for high-risk lesions. We demonstrate that this performance is maintained in a federated learning setting, allowing for collaborative model training without compromising patient privacy. To accelerate research in early pancreatic cancer detection, we publicly release the Cyst-X dataset and models, providing the first large-scale, multi-center MRI resource for pancreatic cyst analysis.

[347] USF-MAE: Ultrasound Self-Supervised Foundation Model with Masked Autoencoding

Youssef Megahed, Robin Ducharme, Aylin Erman, Mark Walker, Steven Hawken, Adrian D. C. Chan

Main category: eess.IV

TL;DR: USF-MAE is the first large-scale self-supervised masked autoencoding framework pretrained exclusively on ultrasound data, achieving state-of-the-art performance on multiple downstream classification tasks while demonstrating strong cross-anatomical generalization.

Details

Motivation: Ultrasound image interpretation is challenging due to noise, operator dependence, and limited field of view. Current deep learning approaches are limited by scarce labeled datasets and domain gaps between general and sonographic images.

Method: Developed USF-MAE using Vision Transformer encoder-decoder architecture pretrained on 370,000 2D/3D ultrasound images from 46 datasets (OpenUS-46) spanning 20+ anatomical regions. The model reconstructs masked image patches to learn modality-specific representations from unlabeled data.

Result: Outperformed conventional CNN and ViT baselines on three downstream classification benchmarks: BUS-BRA (breast cancer, 81.6% F1), MMOTU-2D (ovarian tumors, 79.6% F1), and GIST514-DB (gastrointestinal stromal tumors, 82.4% F1). Approached supervised foundation model performance without using labels during pretraining.

Conclusion: USF-MAE demonstrates that large-scale self-supervised pretraining on ultrasound-specific data enables strong cross-anatomical generalization and outperforms models pretrained on non-medical data, addressing key challenges in ultrasound AI.

Abstract: Ultrasound imaging is one of the most widely used diagnostic modalities, offering real-time, radiation-free assessment across diverse clinical domains. However, interpretation of ultrasound images remains challenging due to high noise levels, operator dependence, and limited field of view, resulting in substantial inter-observer variability. Current Deep Learning approaches are hindered by the scarcity of large labeled datasets and the domain gap between general and sonographic images, which limits the transferability of models pretrained on non-medical data. To address these challenges, we introduce the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), the first large-scale self-supervised MAE framework pretrained exclusively on ultrasound data. The model was pre-trained on 370,000 2D and 3D ultrasound images curated from 46 open-source datasets, collectively termed OpenUS-46, spanning over twenty anatomical regions. This curated dataset has been made publicly available to facilitate further research and reproducibility. Using a Vision Transformer encoder-decoder architecture, USF-MAE reconstructs masked image patches, enabling it to learn rich, modality-specific representations directly from unlabeled data. The pretrained encoder was fine-tuned on three public downstream classification benchmarks: BUS-BRA (breast cancer), MMOTU-2D (ovarian tumors), and GIST514-DB (gastrointestinal stromal tumors). Across all tasks, USF-MAE consistently outperformed conventional CNN and ViT baselines, achieving F1-scores of 81.6%, 79.6%, and 82.4%, respectively. Despite not using labels during pretraining, USF-MAE approached the performance of the supervised foundation model UltraSam on breast cancer classification and surpassed it on the other tasks, demonstrating strong cross-anatomical generalization.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Evaluating LLMs’ Reasoning Over Ordered Procedural Steps

[2] Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks

[3] SARC: Sentiment-Augmented Deep Role Clustering for Fake News Detection

[4] Reasoning Up the Instruction Ladder for Controllable Language Models

[5] EncouRAGe: Evaluating RAG Local, Fast, and Reliable

[6] multiMentalRoBERTa: A Fine-tuned Multiclass Classifier for Mental Health Disorder

[7] Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding

[8] Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation

[9] Measuring what Matters: Construct Validity in Large Language Model Benchmarks

[10] POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios

[11] GEMMA-SQL: A Novel Text-to-SQL Model Based on Large Language Models

[12] First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation

[13] Learning to reason about rare diseases through retrieval-augmented agents

[14] Surprisal reveals diversity gaps in image captioning and different scorers change the story

[15] Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models

[16] Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

[17] Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs

[18] SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents

[19] BudgetMem: Learning Selective Memory Policies for Cost-Efficient Long-Context Processing in Language Models

[20] AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

[21] Diagnosing and Mitigating Semantic Inconsistencies in Wikidata’s Classification Hierarchy

[22] LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model

[23] Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

[24] Acquiring Common Chinese Emotional Events Using Large Language Model

[25] Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies

[26] UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian

[27] Order-Level Attention Similarity Across Language Models: A Latent Commonality

[28] Reasoning-Guided Claim Normalization for Noisy Multilingual Social Media Posts

[29] On Text Simplification Metrics and General-Purpose LLMs for Accessible Health Information, and A Potential Architectural Advantage of The Instruction-Tuned LLM class

[30] Iterative Layer-wise Distillation for Efficient Compression of Large Language Models

[31] A Toolbox for Improving Evolutionary Prompt Search

[32] ManufactuBERT: Efficient Continual Pretraining for Manufacturing

[33] Mind the Gap… or Not? How Translation Errors and Evaluation Details Skew Multilingual Results

[34] Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

[35] Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese

[36] Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

[37] Listening Between the Lines: Decoding Podcast Narratives with Language Modeling

[38] What Are the Facts? Automated Extraction of Court-Established Facts from Criminal-Court Opinions

[39] Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE

[40] A multimodal multiplex of the mental lexicon for multilingual individuals

[41] Large Language Models for Explainable Threat Intelligence

[42] Minority-Aware Satisfaction Estimation in Dialogue Systems via Preference-Adaptive Reinforcement Learning

[43] Steering Language Models with Weight Arithmetic

[44] MIMIC-SR-ICD11: A Dataset for Narrative-Based Diagnosis

[45] To Word Senses and Beyond: Inducing Concepts with Contextualized Language Models

[46] LEME: Open Large Language Models for Ophthalmology with Advanced Reasoning and Clinical Validation

[47] Extracting narrative signals from public discourse: a network-based approach

[48] iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use

[49] Activation-Informed Merging of Large Language Models

[50] NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

[51] InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

[52] Mind the Blind Spots: A Focus-Level Evaluation Framework for LLM Reviews

[53] Exploring Multimodal Perception in Large Language Models Through Perceptual Strength Ratings

[54] MorphTok: Morphologically Grounded Tokenization for Indian Languages

[55] Fair Document Valuation in LLM Summaries via Shapley Values

[56] ProRefine: Inference-Time Prompt Refinement with Textual Feedback

[57] Scalable Medication Extraction and Discontinuation Identification from Electronic Health Records Using Large Language Models

[58] NMIXX: Domain-Adapted Neural Embeddings for Cross-Lingual eXploration of Finance

[59] P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication

[60] Learning Dynamics of Meta-Learning in Small Model Pretraining

[61] Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration

[62] DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models

[63] SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM

[64] Are Humans as Brittle as Large Language Models?

[65] MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems

[66] Policy-as-Prompt: Turning AI Governance Rules into Guardrails for AI Agents

[67] Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs

[68] What Can String Probability Tell Us About Grammaticality?

[69] Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

[70] Re:Member: Emotional Question Generation from Personal Memories

[71] Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation

[72] Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

[73] ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai

cs.CV

[74] GSE: Evaluating Sticker Visual Semantic Similarity via a General Sticker Encoder

[75] Splatography: Sparse multi-view dynamic Gaussian Splatting for filmmaking challenges

[76] IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs