Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 66]
- cs.CV [Total: 118]
- cs.AI [Total: 55]
- cs.SD [Total: 4]
- cs.LG [Total: 183]
- cs.MA [Total: 9]
- cs.MM [Total: 3]
- eess.AS [Total: 6]
- eess.IV [Total: 10]
cs.CL
[1] Multi-Personality Generation of LLMs at Decoding-time
Rongxin Chen, Yunfan Li, Yige Yuan, Bingbing Xu, Huawei Shen
Main category: cs.CL
TL;DR: Proposes MPG framework for multi-personality generation in LLMs using decoding-time combination with speculative chunk-level rejection sampling, achieving up to 16-18% improvements without retraining.
Details
Motivation: Existing methods for multi-personality generation are either costly (retraining-based) or limited (external models/heuristics). Need flexible, robust approach without extra training.Method: MPG framework uses implicit density ratios from single-dimensional models to sample from target strategy. Implements Speculative Chunk-level Rejection sampling (SCR) for efficient generation with parallel validation.
Result: Experiments on MBTI personality and Role-Playing show effectiveness with improvements up to 16-18% over existing methods.
Conclusion: MPG provides flexible multi-personality control without extra training, leveraging implicit density ratios as ‘free lunch’ and achieving significant performance gains.
Abstract: Multi-personality generation for LLMs, enabling simultaneous embodiment of multiple personalization attributes, is a fundamental challenge. Existing retraining-based approaches are costly and poorly scalable, while decoding-time methods often rely on external models or heuristics, limiting flexibility and robustness. In this paper, we propose a novel Multi-Personality Generation (MPG) framework under the decoding-time combination paradigm. It flexibly controls multi-personality without relying on scarce multi-dimensional models or extra training, leveraging implicit density ratios in single-dimensional models as a “free lunch” to reformulate the task as sampling from a target strategy aggregating these ratios. To implement MPG efficiently, we design Speculative Chunk-level based Rejection sampling (SCR), which generates responses in chunks and parallelly validates them via estimated thresholds within a sliding window. This significantly reduces computational overhead while maintaining high-quality generation. Experiments on MBTI personality and Role-Playing demonstrate the effectiveness of MPG, showing improvements up to 16%-18%. Code and data are available at https://github.com/Libra117/MPG .
[2] Rethinking LLM Human Simulation: When a Graph is What You Need
Joseph Suh, Suhong Moon, Serina Chang
Main category: cs.CL
TL;DR: GEMS uses graph neural networks as a lightweight alternative to LLMs for human simulation tasks, achieving comparable accuracy with much greater efficiency and interpretability.
Details
Motivation: To determine if smaller, domain-grounded models can effectively replace large language models for human simulation tasks, particularly in discrete choice scenarios.Method: Proposes Graph-basEd Models for human Simulation (GEMS) that frames discrete choice simulation as link prediction on graphs, using GNNs and incorporating language representations only when necessary.
Result: GEMS matches or surpasses LLM performance across three simulation datasets while being three orders of magnitude smaller, with better efficiency, interpretability, and transparency.
Conclusion: Graph-based modeling offers a promising lightweight alternative to LLMs for human simulation, particularly for discrete choice problems.
Abstract: Large language models (LLMs) are increasingly used to simulate humans, with applications ranging from survey prediction to decision-making. However, are LLMs strictly necessary, or can smaller, domain-grounded models suffice? We identify a large class of simulation problems in which individuals make choices among discrete options, where a graph neural network (GNN) can match or surpass strong LLM baselines despite being three orders of magnitude smaller. We introduce Graph-basEd Models for human Simulation (GEMS), which casts discrete choice simulation tasks as a link prediction problem on graphs, leveraging relational knowledge while incorporating language representations only when needed. Evaluations across three key settings on three simulation datasets show that GEMS achieves comparable or better accuracy than LLMs, with far greater efficiency, interpretability, and transparency, highlighting the promise of graph-based modeling as a lightweight alternative to LLMs for human simulation. Our code is available at https://github.com/schang-lab/gems.
[3] IG-Pruning: Input-Guided Block Pruning for Large Language Models
Kangyu Qiao, Shaolei Zhang, Yang Feng
Main category: cs.CL
TL;DR: IG-Pruning is an input-aware block-wise pruning method that dynamically selects layer masks at inference time to reduce computational costs of large language models, outperforming static depth pruning methods.
Details
Motivation: With growing computational demands of LLMs, efficient inference is critical for practical deployment. Existing depth pruning methods use fixed block masks leading to suboptimal performance across different tasks and inputs.Method: Two-stage approach: (1) Discovering diverse mask candidates through semantic clustering and L0 optimization, (2) Implementing efficient dynamic pruning without extensive training.
Result: Experimental results show the method consistently outperforms state-of-the-art static depth pruning methods.
Conclusion: IG-Pruning is particularly suitable for resource-constrained deployment scenarios due to its dynamic input-aware approach.
Abstract: With the growing computational demands of large language models (LLMs), efficient inference has become increasingly critical for practical deployment. Depth pruning has emerged as a promising approach for reducing the computational costs of large language models by removing transformer layers. However, existing methods typically rely on fixed block masks, which can lead to suboptimal performance across different tasks and inputs. In this paper, we propose IG-Pruning, a novel input-aware block-wise pruning method that dynamically selects layer masks at inference time. Our approach consists of two stages: (1) Discovering diverse mask candidates through semantic clustering and L0 optimization, and (2) Implementing efficient dynamic pruning without the need for extensive training. Experimental results demonstrate that our method consistently outperforms state-of-the-art static depth pruning methods, making it particularly suitable for resource-constrained deployment scenarios.
[4] Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results
Jonathan Liu, Haoling Qiu, Jonathan Lasko, Damianos Karakos, Mahsa Yarmohammadi, Mark Dredze
Main category: cs.CL
TL;DR: This paper develops an infrastructure to automatically generate queries and evaluate LLM responses in medical contexts, finding low inter-LLM agreement and recommending multiple evaluators for reliable results.
Details
Motivation: To understand when medical chatbots fail due to non-medical factors like demographics, as hallucinations and biases are prevalent in LLMs used in medical contexts.Method: Created infrastructure with: 1) automated query generation sampling patient demographics, histories, disorders, and writing styles; 2) evaluation pipeline using multiple LLM-as-a-judge setups for hallucination/omission detection and treatment category analysis.
Result: Found low inter-LLM agreement (average Cohen’s Kappa κ=0.118), with only specific LLM pairs showing significant differences across writing styles, genders, and races.
Conclusion: Recommend using multiple LLM evaluators to avoid non-generalizable results and publishing inter-LLM agreement metrics for transparency, especially when ground-truth data is unavailable.
Abstract: Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen’s Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: https://github.com/BBN-E/medic-neurips-2025-demo.
[5] Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation
Wongyu Kim, Hochang Lee, Sanghak Lee, Yoonsung Kim, Jaehyun Park
Main category: cs.CL
TL;DR: M-Solomon is a universal multimodal embedder that adaptively determines when to augment queries, improving retrieval performance while reducing latency by only augmenting queries that need it.
Details
Motivation: Current LLM-based embedders augment every query, causing substantial latency and potentially harming performance for some queries. Previous methods also haven't been explored in multimodal environments.Method: Divides training queries into two groups (need augmentation vs don’t), generates appropriate augmentations using MLLM, and implements adaptive query augmentation where the model learns to generate /augment prefix for queries needing augmentation and /embed for others.
Result: M-Solomon surpassed the baseline without augmentation by a large margin and outperformed the baseline that always used augmentation, while providing much faster embedding latency.
Conclusion: Adaptive query augmentation with M-Solomon effectively balances retrieval performance and computational efficiency in multimodal environments.
Abstract: Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.
[6] LTD-Bench: Evaluating Large Language Models by Letting Them Draw
Liuhao Lin, Ke Li, Zihan Xu, Yuchen Shi, Yulei Qin, Yan Zhang, Xing Sun, Rongrong Ji
Main category: cs.CL
TL;DR: LTD-Bench is a visual evaluation benchmark that transforms LLM assessment from numerical scores to observable drawings, exposing fundamental spatial reasoning limitations in current models.
Details
Motivation: Current LLM evaluation relies on opaque numerical metrics that conceal spatial reasoning limitations and create a dangerous disconnect between reported performance and practical abilities for real-world applications.Method: Uses drawing generation through dot matrices or executable code with complementary generation tasks (spatial imagination) and recognition tasks (spatial perception) across three difficulty levels, evaluating bidirectional language-spatial mapping.
Result: Exposes alarming capability gap - even state-of-the-art LLMs show profound deficiencies in establishing bidirectional mappings between language and spatial concepts, undermining their potential as genuine world models.
Conclusion: LTD-Bench’s visual outputs enable immediate identification of spatial reasoning limitations and offer powerful diagnostic analysis for investigating model similarity, bridging the gap between statistical performance and intuitive assessment.
Abstract: Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research–relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language-spatial mapping. Our extensive experiments with state-of-the-art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concept–a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD-Bench’s visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
[7] LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context
Yudong Li, Zhongliang Yang, Kejiang Chen, Wenxuan Wang, Tianxin Zhang, Sifang Wan, Kecheng Wang, Haitian Li, Xu Wang, Lefan Cheng, Youdan Yang, Baocheng Chen, Ziyu Liu, Yufei Sun, Liyan Wu, Wenya Wen, Xingchi Gu, Peiru Yang
Main category: cs.CL
TL;DR: LiveSecBench is a dynamic safety benchmark for Chinese-language LLMs that evaluates models across six safety dimensions based on Chinese legal and social frameworks, with continuous updates to maintain relevance.
Details
Motivation: To address the need for a comprehensive safety evaluation framework specifically designed for Chinese-language LLM applications, considering China's unique legal and social context.Method: Developed a benchmark that evaluates models across six dimensions: Legality, Ethics, Factuality, Privacy, Adversarial Robustness, and Reasoning Safety, with a dynamic update schedule to incorporate new threat vectors.
Result: Evaluated 18 LLMs in the current version (v251030), providing insights into AI safety landscape for Chinese language models. The benchmark includes a publicly accessible leaderboard.
Conclusion: LiveSecBench establishes a foundational framework for continuously monitoring and improving safety in Chinese-language LLMs, with plans to expand to additional safety dimensions like Text-to-Image Generation Safety and Agentic Safety.
Abstract: In this work, we propose LiveSecBench, a dynamic and continuously updated safety benchmark specifically for Chinese-language LLM application scenarios. LiveSecBench evaluates models across six critical dimensions (Legality, Ethics, Factuality, Privacy, Adversarial Robustness, and Reasoning Safety) rooted in the Chinese legal and social frameworks. This benchmark maintains relevance through a dynamic update schedule that incorporates new threat vectors, such as the planned inclusion of Text-to-Image Generation Safety and Agentic Safety in the next update. For now, LiveSecBench (v251030) has evaluated 18 LLMs, providing a landscape of AI safety in the context of Chinese language. The leaderboard is publicly accessible at https://livesecbench.intokentech.cn/.
[8] AyurParam: A State-of-the-Art Bilingual Language Model for Ayurveda
Mohd Nauman, Sravan Gvm, Vijay Devane, Shyam Pawar, Viraj Thakur, Kundeshwar Pundalik, Piyush Sawarkar, Rohit Saluja, Maunendra Desarkar, Ganesh Ramakrishnan
Main category: cs.CL
TL;DR: AyurParam-2.9B is a specialized bilingual language model for Ayurveda that outperforms larger models in traditional medical knowledge tasks through expert-curated training data.
Details
Motivation: Mainstream LLMs underperform in specialized domains like Ayurveda that require deep cultural, linguistic, and subject-matter expertise from centuries of traditional medical knowledge.Method: Fine-tuned Param-1-2.9B model using extensive, expertly curated Ayurveda dataset with context-aware reasoning and bilingual Q&A in English and Hindi, with rigorous annotation for factual precision.
Result: Surpasses all open-source instruction-tuned models in its size class (1.5-3B parameters) and shows competitive/superior performance compared to much larger models on BhashaBench-Ayur benchmark.
Conclusion: Authentic domain adaptation and high-quality supervision are essential for delivering reliable, culturally congruent AI in specialized medical domains like Ayurveda.
Abstract: Current large language models excel at broad, general-purpose tasks, but consistently underperform when exposed to highly specialized domains that require deep cultural, linguistic, and subject-matter expertise. In particular, traditional medical systems such as Ayurveda embody centuries of nuanced textual and clinical knowledge that mainstream LLMs fail to accurately interpret or apply. We introduce AyurParam-2.9B, a domain-specialized, bilingual language model fine-tuned from Param-1-2.9B using an extensive, expertly curated Ayurveda dataset spanning classical texts and clinical guidance. AyurParam’s dataset incorporates context-aware, reasoning, and objective-style Q&A in both English and Hindi, with rigorous annotation protocols for factual precision and instructional clarity. Benchmarked on BhashaBench-Ayur, AyurParam not only surpasses all open-source instruction-tuned models in its size class (1.5–3B parameters), but also demonstrates competitive or superior performance compared to much larger models. The results from AyurParam highlight the necessity for authentic domain adaptation and high-quality supervision in delivering reliable, culturally congruent AI for specialized medical knowledge.
[9] AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Aashray Reddy, Andrew Zagula, Nicholas Saban
Main category: cs.CL
TL;DR: AutoAdv is a training-free framework for automated multi-turn jailbreaking attacks on LLMs, achieving up to 95% success rate on Llama-3.1-8B within six turns, showing current safety mechanisms fail against adaptive multi-turn conversations.
Details
Motivation: Current LLM safety evaluations focus mainly on single-turn interactions, while real-world attacks occur through adaptive multi-turn conversations, creating a gap in understanding and defending against persistent vulnerabilities.Method: AutoAdv combines three adaptive mechanisms: pattern manager that learns from successful attacks, temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them.
Result: Achieves up to 95% attack success rate on Llama-3.1-8B within six turns (24% improvement over single-turn baselines), with consistent vulnerabilities found across commercial and open-source models including GPT-4o-mini, Qwen3-235B, and Mistral-7B.
Conclusion: Current alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses in LLM safety mechanisms.
Abstract: Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.
[10] Merging Continual Pretraining Models for Domain-Specialized LLMs: A Case Study in Finance
Kentaro Ueda, François Portet, Hirohiko Suwa, Keiichi Yasumoto
Main category: cs.CL
TL;DR: This paper explores merging Continual Pre-training (CPT) experts in specialized domains like finance, proposing a three-stage evaluation framework and comparing three merging methods to build multi-skill LLMs from existing domain experts.
Details
Motivation: LLMs struggle in specialized domains requiring diverse skills like finance knowledge, mathematical reasoning, and multilingual processing. Merging CPT experts offers a practical alternative to costly multi-skill training, but CPT model merging remains largely unexplored compared to established SFT model merging.Method: Created financial LLMs from experts in finance, math, and Japanese. Proposed three-stage evaluation (knowledge recovery, complementarity, emergence) and assessed three merging methods (Task Arithmetic, TIES, DARE-TIES) on a comprehensive financial benchmark with 18 tasks across 8 datasets.
Result: Merging an expert with its base model recovers general knowledge lost during CPT, while merging experts improves performance and can yield emergent cross-domain skills. Task Arithmetic performs strongly but is hyperparameter-sensitive, while TIES is more robust. Model similarity correlates with merging success, but emergent skills depend on more complex factors.
Conclusion: This work presents the first foundational analysis of CPT model merging, establishing a principled framework and providing clear guidance for building multi-skill LLMs from existing assets in specialized domains.
Abstract: While LLMs excel at general tasks, they struggle in specialized domains like finance, requiring diverse skills in domain knowledge, mathematical reasoning, and multilingual processing. Merging domain-specific Continual Pre-training (CPT) “experts” offers a practical alternative to costly and unstable multi-skill training. However, unlike established Supervised Fine-Tuning (SFT) model-based merging, CPT model merging remains largely unexplored. We address this gap by creating financial LLMs from experts in finance, math, and Japanese. We propose a three-stage evaluation focusing on knowledge recovery, complementarity, and emergence, and assess three merging methods (Task Arithmetic, TIES, and DARE-TIES) on a comprehensive financial benchmark curated from 18 tasks across 8 established datasets. Results show that merging an expert with its base model recovers general knowledge lost during CPT, while merging experts improves performance and can yield emergent cross-domain skills. Among the methods, Task Arithmetic performs strongly but is hyperparameter-sensitive, whereas TIES is more robust. Our findings also suggest that while model similarity correlates with merging success, emergent skills depend on more complex factors. This work presents the first foundational analysis of CPT model merging, establishing a principled framework and providing clear guidance for building multi-skill LLMs from existing assets.
[11] Prompting for Policy: Forecasting Macroeconomic Scenarios with Synthetic LLM Personas
Giulia Iadisernia, Carolina Camassa
Main category: cs.CL
TL;DR: Persona-based prompting doesn’t improve LLM macroeconomic forecasting accuracy; GPT-4o achieves competitive performance comparable to human experts without needing persona descriptions.
Details
Motivation: To evaluate whether persona-based prompting improves LLM performance on macroeconomic forecasting tasks compared to human experts.Method: Used 2,368 economics personas from PersonaHub to prompt GPT-4o for ECB Survey of Professional Forecasters across 50 quarterly rounds (2013-2025), comparing against human experts and baseline forecasts without personas.
Result: GPT-4o and human forecasters achieved remarkably similar accuracy levels; persona descriptions provided no measurable forecasting advantage; GPT-4o maintained competitive performance on out-of-sample data (2024-2025).
Conclusion: Persona descriptions can be omitted to reduce computational costs without sacrificing accuracy; GPT-4o can achieve competitive macroeconomic forecasting with relevant context data alone.
Abstract: We evaluate whether persona-based prompting improves Large Language Model (LLM) performance on macroeconomic forecasting tasks. Using 2,368 economics-related personas from the PersonaHub corpus, we prompt GPT-4o to replicate the ECB Survey of Professional Forecasters across 50 quarterly rounds (2013-2025). We compare the persona-prompted forecasts against the human experts panel, across four target variables (HICP, core HICP, GDP growth, unemployment) and four forecast horizons. We also compare the results against 100 baseline forecasts without persona descriptions to isolate its effect. We report two main findings. Firstly, GPT-4o and human forecasters achieve remarkably similar accuracy levels, with differences that are statistically significant yet practically modest. Our out-of-sample evaluation on 2024-2025 data demonstrates that GPT-4o can maintain competitive forecasting performance on unseen events, though with notable differences compared to the in-sample period. Secondly, our ablation experiment reveals no measurable forecasting advantage from persona descriptions, suggesting these prompt components can be omitted to reduce computational costs without sacrificing accuracy. Our results provide evidence that GPT-4o can achieve competitive forecasting accuracy even on out-of-sample macroeconomic events, if provided with relevant context data, while revealing that diverse prompts produce remarkably homogeneous forecasts compared to human panels.
[12] Smart-Hiring: An Explainable end-to-end Pipeline for CV Information Extraction and Job Matching
Kenza Khelkhal, Dihia Lanasri
Main category: cs.CL
TL;DR: Smart-Hiring is an NLP pipeline that automatically extracts information from resumes and matches candidates to jobs using semantic similarity, achieving competitive accuracy while maintaining interpretability.
Details
Motivation: Manual resume screening is time-consuming, error-prone, and subject to human bias, creating a need for automated, data-driven hiring solutions.Method: Combines document parsing, named-entity recognition, and contextual text embedding to encode resumes and job descriptions in a shared vector space for similarity scoring.
Result: The system demonstrates robust performance on real-world datasets across multiple domains, achieving competitive matching accuracy with high interpretability.
Conclusion: Smart-Hiring provides a scalable, practical NLP framework for recruitment with potential for bias mitigation and large-scale deployment of data-driven hiring solutions.
Abstract: Hiring processes often involve the manual screening of hundreds of resumes for each job, a task that is time and effort consuming, error-prone, and subject to human bias. This paper presents Smart-Hiring, an end-to-end Natural Language Processing (NLP) pipeline de- signed to automatically extract structured information from unstructured resumes and to semantically match candidates with job descriptions. The proposed system combines document parsing, named-entity recognition, and contextual text embedding techniques to capture skills, experience, and qualifications. Using advanced NLP technics, Smart-Hiring encodes both resumes and job descriptions in a shared vector space to compute similarity scores between candidates and job postings. The pipeline is modular and explainable, allowing users to inspect extracted entities and matching rationales. Experiments were conducted on a real-world dataset of resumes and job descriptions spanning multiple professional domains, demonstrating the robustness and feasibility of the proposed approach. The system achieves competitive matching accuracy while preserving a high degree of interpretability and transparency in its decision process. This work introduces a scalable and practical NLP frame- work for recruitment analytics and outlines promising directions for bias mitigation, fairness-aware modeling, and large-scale deployment of data-driven hiring solutions.
[13] The Analysis of Lexical Errors in Machine Translation from English into Romanian
Angela Stamatie
Main category: cs.CL
TL;DR: Analysis of lexical errors in Google Translate’s English-to-Romanian translations of COVID-19 related official documents from WHO, Gavi, and patient information leaflets.
Details
Motivation: To improve Google Translate's accuracy and fluency by identifying and analyzing lexical errors in medical/official document translations, particularly for COVID-19 content.Method: Comprehensive analysis of 230 texts translated from English to Romanian using Google Translate, focusing on official COVID-19 documents from WHO, Gavi, and patient information leaflets.
Result: Identification of specific lexical errors in machine translations of medical and official COVID-19 documents, providing insights into areas needing improvement.
Conclusion: The research contributes to enhancing Google Translate’s lexical selection accuracy and reducing errors in medical/official document translations, supporting better machine translation quality for critical health information.
Abstract: The research explores error analysis in the performance of translating by Machine Translation from English into Romanian, and it focuses on lexical errors found in texts which include official information, provided by the World Health Organization (WHO), the Gavi Organization, by the patient information leaflet (the information about the active ingredients of the vaccines or the medication, the indications, the dosage instructions, the storage instructions, the side effects and warning, etc.). All of these texts are related to Covid-19 and have been translated by Google Translate, a multilingual Machine Translation that was created by Google. In the last decades, Google has actively worked to develop a more accurate and fluent automatic translation system. This research, specifically focused on improving Google Translate, aims to enhance the overall quality of Machine Translation by achieving better lexical selection and by reducing errors. The investigation involves a comprehensive analysis of 230 texts that have been translated from English into Romanian.
[14] I Want to Break Free! Persuasion and Anti-Social Behavior of LLMs in Multi-Agent Settings with Social Hierarchy
Gian Maria Campedelli, Nicolò Penzo, Massimo Stefan, Roberto Dessì, Marco Guerini, Bruno Lepri, Jacopo Staiano
Main category: cs.CL
TL;DR: Analysis of LLM agent interactions in simulated hierarchical social environments inspired by prison experiments, revealing how goal setting and personas influence persuasion and anti-social behavior across different models.
Details
Motivation: To study emergent phenomena and potential risks as LLM-based agents become more autonomous and interact freely with each other, particularly in hierarchical power dynamics.Method: Used 2,400 conversations across six LLMs (LLama3, Orca2, Command-r, Mixtral, Mistral2, gpt4.1) in 240 experimental scenarios simulating guard-prisoner interactions with differing objectives, narrowing to 1,600 conversations after filtering for successful interactions.
Result: Goal setting significantly influences persuasiveness but not anti-social behavior; agent personas (especially guard’s) substantially impact successful persuasion and anti-social actions; anti-social conduct emerges even without explicit negative personality prompts.
Conclusion: The findings have important implications for developing interactive LLM agents and understanding their societal impact, highlighting the need to consider emergent behaviors in multi-agent systems.
Abstract: As LLM-based agents become increasingly autonomous and will more freely interact with each other, studying the interplay among them becomes crucial to anticipate emergent phenomena and potential risks. In this work, we provide an in-depth analysis of the interactions among agents within a simulated hierarchical social environment, drawing inspiration from the Stanford Prison Experiment. Leveraging 2,400 conversations across six LLMs (i.e., LLama3, Orca2, Command-r, Mixtral, Mistral2, and gpt4.1) and 240 experimental scenarios, we analyze persuasion and anti-social behavior between a guard and a prisoner agent with differing objectives. We first document model-specific conversational failures in this multi-agent power dynamic context, thereby narrowing our analytic sample to 1,600 conversations. Among models demonstrating successful interaction, we find that goal setting significantly influences persuasiveness but not anti-social behavior. Moreover, agent personas, especially the guard’s, substantially impact both successful persuasion by the prisoner and the manifestation of anti-social actions. Notably, we observe the emergence of anti-social conduct even in absence of explicit negative personality prompts. These results have important implications for the development of interactive LLM agents and the ongoing discussion of their societal impact.
[15] Next Token Knowledge Tracing: Exploiting Pretrained LLM Representations to Decode Student Behaviour
Max Norris, Kobi Gal, Sahan Bulathwela
Main category: cs.CL
TL;DR: NTKT reframes Knowledge Tracing as next-token prediction using LLMs, incorporating question text to improve performance and generalization over state-of-the-art models.
Details
Motivation: Existing KT models overlook question text, missing pedagogical insights and limiting predictive performance, creating an opportunity to leverage LLMs' language understanding capabilities.Method: Proposes NTKT that represents student histories and question content as text sequences, using pretrained LLMs for next-token prediction to learn behavioral and linguistic patterns.
Result: Significantly improves performance over state-of-the-art neural KT models and demonstrates better generalization to cold-start questions and users.
Conclusion: Question content is crucial for KT, and leveraging LLMs’ pretrained representations enables more effective student learning modeling.
Abstract: Modelling student knowledge is a key challenge when leveraging AI in education, with major implications for personalised learning. The Knowledge Tracing (KT) task aims to predict how students will respond to educational questions in learning environments, based on their prior interactions. Existing KT models typically use response correctness along with metadata like skill tags and timestamps, often overlooking the question text, which is an important source of pedagogical insight. This omission poses a lost opportunity while limiting predictive performance. We propose Next Token Knowledge Tracing (NTKT), a novel approach that reframes KT as a next-token prediction task using pretrained Large Language Models (LLMs). NTKT represents both student histories and question content as sequences of text, allowing LLMs to learn patterns in both behaviour and language. Our series of experiments significantly improves performance over state-of-the-art neural KT models and generalises much better to cold-start questions and users. These findings highlight the importance of question content in KT and demonstrate the benefits of leveraging pretrained representations of LLMs to model student learning more effectively.
[16] CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency
Ehsan Aghazadeh, Ahmad Ghasemi, Hedyeh Beyhaghi, Hossein Pishro-Nik
Main category: cs.CL
TL;DR: CGES is a Bayesian framework that uses confidence signals to adaptively stop sampling when a candidate answer’s posterior probability exceeds a threshold, reducing model calls by ~69% while maintaining accuracy comparable to self-consistency.
Details
Motivation: Self-consistency requires fixed multiple queries and fails when correct answers are rare, needing more efficient and adaptive stopping strategies.Method: Bayesian framework using confidence signals from token probabilities or reward models to form posteriors over candidate answers and stop sampling adaptively when posterior mass exceeds threshold.
Result: Reduces average model calls by ~69% (e.g., from 16.0 to 4.9) while matching self-consistency accuracy within 0.06 percentage points across five reasoning benchmarks.
Conclusion: CGES provides an efficient alternative to self-consistency with theoretical guarantees and significant computational savings while maintaining comparable performance.
Abstract: Large language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency strategy (arXiv:2203.11171) requires a fixed number of calls and can fail when the correct answer is rare. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers using scalar confidence signals derived from token probabilities or reward models. CGES adaptively halts sampling once the posterior mass of a candidate exceeds a threshold. We provide theoretical guarantees for both perfectly calibrated confidences and realistic noisy confidence signals. Across five reasoning benchmarks, CGES reduces the average number of model calls by about 69 percent (for example, from 16.0 to 4.9) while matching the accuracy of self-consistency within 0.06 percentage points.
[17] The Realignment Problem: When Right becomes Wrong in LLMs
Aakash Sen Sharma, Debdeep Sanyal, Vivek Srivastava, Shirish Karande, Murari Mandal
Main category: cs.CL
TL;DR: TRACE is a framework for dynamic LLM alignment that enables precise policy updates through programmatic unlearning, addressing the Alignment-Reality Gap without degrading model utility.
Details
Motivation: Current LLM alignment methods produce static, brittle models that cannot adapt to evolving human values and policies, creating an Alignment-Reality Gap that makes long-term deployment unreliable.Method: TRACE uses programmatic triage of existing preference data against new policies, identifies conflicts via alignment impact scores, and applies hybrid optimization to invert, discard, or preserve preferences while maintaining model performance.
Result: TRACE achieves robust re-alignment across multiple model families (Qwen2.5-7B, Gemma-2-9B, Llama-3.1-8B) on synthetic benchmarks and PKU-SafeRLHF dataset, enforcing new principles without degrading general capabilities.
Conclusion: TRACE establishes a scalable, dynamic, and cost-effective paradigm for maintaining LLM alignment, providing a foundation for sustainable and responsible AI deployment.
Abstract: The alignment of Large Language Models (LLMs) with human values is central to their safe deployment, yet current practice produces static, brittle, and costly-to-maintain models that fail to keep pace with evolving norms and policies. This misalignment, which we term the Alignment-Reality Gap, poses a growing challenge for reliable long-term use. Existing remedies are inadequate: large-scale re-annotation is economically prohibitive, and standard unlearning methods act as blunt instruments that erode utility rather than enable precise policy updates. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework for principled unlearning that reconceives re-alignment as a programmatic policy application problem. TRACE programmatically triages existing preference data against a new policy, identifies high-impact conflicts via a alignment impact score, and applies a hybrid optimization that cleanly inverts, discards, or preserves preferences while safeguarding model performance. Empirical results show that TRACE achieves robust re-alignment across diverse model families (Qwen2.5-7B, Gemma-2-9B, Llama-3.1-8B). On both synthetic benchmarks and the PKU-SafeRLHF dataset under complex policy shift, TRACE enforces new principles without degrading general capabilities. Our work establishes a scalable, dynamic, and cost-effective paradigm for maintaining LLM alignment, providing a foundation for sustainable and responsible AI deployment.
[18] Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis, Solution, and Interpretation
Renfei Dang, Peng Hu, Changjiang Gao, Shujian Huang
Main category: cs.CL
TL;DR: Fine-tuning LLMs with new knowledge causes factual hallucinations, especially when specific knowledge types are entirely unfamiliar. The proposed KnownPatch method mitigates this by adding known knowledge samples during training.
Details
Motivation: Previous studies showed that introducing new knowledge during LLM fine-tuning leads to factual hallucinations, but lacked deep investigation into specific manifestations and underlying mechanisms.Method: Created controlled dataset Biography-Reasoning, conducted fine-grained analysis across knowledge types and tasks (QA and reasoning), and proposed KnownPatch method that patches known knowledge samples in later training stages.
Result: LLMs exhibit significantly increased hallucination when fine-tuned on datasets where specific knowledge types consist entirely of new knowledge. High unfamiliarity of particular knowledge types drives hallucinations more than overall proportion of new knowledge.
Conclusion: Learning new knowledge reduces model’s attention to key entities, causing excessive focus on surrounding context and increasing hallucination risk. KnownPatch effectively mitigates this disruption and improves performance.
Abstract: Previous studies show that introducing new knowledge during large language models (LLMs) fine-tuning can lead to the generation of erroneous output when tested on known information, thereby triggering factual hallucinations. However, existing studies have not deeply investigated the specific manifestations and underlying mechanisms of these hallucinations. Our work addresses this gap by designing a controlled dataset Biography-Reasoning, and conducting a fine-grained analysis across multiple knowledge types and two task types, including knowledge question answering (QA) and knowledge reasoning tasks. We find that when fine-tuned on a dataset in which a specific knowledge type consists entirely of new knowledge, LLMs exhibit significantly increased hallucination tendencies. This suggests that the high unfamiliarity of a particular knowledge type, rather than the overall proportion of new knowledge, is a stronger driver of hallucinations, and these tendencies can even affect other knowledge types in QA tasks. To mitigate such factual hallucinations, we propose KnownPatch, which patches a small number of known knowledge samples in the later stages of training, effectively alleviating new-knowledge-induced hallucinations. Through attention analysis, we find that learning new knowledge reduces the model’s attention to key entities in the question, thus causing excessive focus on the surrounding context, which may increase the risk of hallucination. Moreover, the attention pattern can propagate to similar contexts, facilitating the spread of hallucinations to textually similar questions. Our method effectively mitigates the disruption of new knowledge learning to the model’s attention on key entities, accompanied by improved performance.
[19] Tongyi DeepResearch Technical Report
Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Minpeng Liao, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang
Main category: cs.CL
TL;DR: Tongyi DeepResearch is a 30.5B parameter agentic LLM designed for deep information-seeking research tasks, achieving SOTA performance on multiple benchmarks through an automated training framework.
Details
Motivation: To create an autonomous AI system capable of long-horizon, deep information-seeking research tasks without relying on costly human annotation.Method: End-to-end training framework combining agentic mid-training and post-training, using a fully automatic data synthesis pipeline and customized environments for stable interactions.
Result: Achieved state-of-the-art performance across multiple agentic deep research benchmarks including Humanity’s Last Exam, BrowseComp, WebWalkerQA, and others.
Conclusion: The model, framework, and complete solutions are open-sourced to empower the community in autonomous deep research capabilities.
Abstract: We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity’s Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
[20] Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes
Mohammadsajad Alipour, Mohammad Mohammadi Amiri
Main category: cs.CL
TL;DR: Proposes optimal singular damage method for efficient storage of fine-tuned LLMs by combining low-rank approximation and selective sparsification to retain critical parameters while reducing storage needs.
Details
Motivation: Large language models are too big for most applications, and even fine-tuned versions require excessive storage. Fine-tuning mainly affects a small fraction of parameters, creating opportunity for efficient storage.Method: Leverages observation that fine-tuning updates are low-rank and sparse. Uses sparsified low-rank approximations with larger ranks, and introduces optimal singular damage to selectively sparsify updates while preserving most impactful singular components.
Result: Significant storage efficiency and superior accuracy within same memory budget compared to using low-rank approximation or sparsification alone.
Conclusion: Combining low-rank approximation with selective sparsification through optimal singular damage provides effective solution for efficient storage of fine-tuned LLMs while maintaining performance.
Abstract: Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.
[21] PragExTra: A Multilingual Corpus of Pragmatic Explicitation in Translation
Doreen Osmelak, Koel Dutta Chowdhury, Uliana Sentsova, Cristina España-Bonet, Josef van Genabith
Main category: cs.CL
TL;DR: PragExTra is the first multilingual corpus and detection framework for pragmatic explicitation in translation, covering 8 language pairs with entity descriptions, measurement conversions, and translator remarks.
Details
Motivation: To computationally model pragmatic explicitation - where translators add background details to make implicit cultural meanings explicit for new audiences - which has been widely discussed in translation theory but rarely modeled computationally.Method: Created PragExTra corpus from TED-Multi and Europarl, identified candidate explicitation cases through null alignments, and refined using active learning with human annotation.
Result: Entity and system-level explicitations are most frequent; active learning improves classifier accuracy by 7-8 percentage points, achieving up to 0.88 accuracy and 0.82 F1 across languages.
Conclusion: PragExTra establishes pragmatic explicitation as a measurable, cross-linguistic phenomenon and takes a step towards building culturally aware machine translation.
Abstract: Translators often enrich texts with background details that make implicit cultural meanings explicit for new audiences. This phenomenon, known as pragmatic explicitation, has been widely discussed in translation theory but rarely modeled computationally. We introduce PragExTra, the first multilingual corpus and detection framework for pragmatic explicitation. The corpus covers eight language pairs from TED-Multi and Europarl and includes additions such as entity descriptions, measurement conversions, and translator remarks. We identify candidate explicitation cases through null alignments and refined using active learning with human annotation. Our results show that entity and system-level explicitations are most frequent, and that active learning improves classifier accuracy by 7-8 percentage points, achieving up to 0.88 accuracy and 0.82 F1 across languages. PragExTra establishes pragmatic explicitation as a measurable, cross-linguistic phenomenon and takes a step towards building culturally aware machine translation. Keywords: translation, multilingualism, explicitation
[22] AI Diffusion in Low Resource Language Countries
Amit Misra, Syed Waqas Zamir, Wassim Hamidouche, Inbal Becker-Reshef, Juan Lavista Ferres
Main category: cs.CL
TL;DR: LRLCs have 20% lower AI adoption due to poor LLM performance in low-resource languages, showing linguistic accessibility is a key barrier to equitable AI diffusion.
Details
Motivation: To understand why AI adoption is uneven globally, specifically testing if poor LLM performance in low-resource languages reduces AI utility and slows adoption in LRLCs.Method: Used weighted regression model to isolate language effect from socioeconomic and demographic factors.
Result: LRLCs have approximately 20% lower share of AI users relative to their baseline after controlling for other factors.
Conclusion: Linguistic accessibility is a significant, independent barrier to equitable AI diffusion globally.
Abstract: Artificial intelligence (AI) is diffusing globally at unprecedented speed, but adoption remains uneven. Frontier Large Language Models (LLMs) are known to perform poorly on low-resource languages due to data scarcity. We hypothesize that this performance deficit reduces the utility of AI, thereby slowing adoption in Low-Resource Language Countries (LRLCs). To test this, we use a weighted regression model to isolate the language effect from socioeconomic and demographic factors, finding that LRLCs have a share of AI users that is approximately 20% lower relative to their baseline. These results indicate that linguistic accessibility is a significant, independent barrier to equitable AI diffusion.
[23] Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning
Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang, Mengyu Li, Jay Tang, Tian Qin, Zhiyang Xu, Jiarui Lu, Guoli Yin, Jiawei Han, Zirui Wang
Main category: cs.CL
TL;DR: CoRL is a centralized reinforcement learning framework that optimizes multi-LLM systems by selectively coordinating expert models to maximize task performance while minimizing inference costs across different budget conditions.
Details
Motivation: Existing decentralized multi-LLM systems invoke multiple models for every input, leading to uncontrolled high inference costs. There's a need for cost-efficient and cost-controllable coordination of specialized LLMs with complementary strengths.Method: Proposes CoRL, a reinforcement learning framework with a controller LLM that selectively coordinates a pool of expert models. Formulates coordination as RL with dual objectives: maximizing task performance while minimizing overall inference cost, adaptable to different budget conditions.
Result: Experiments on four benchmarks show CoRL enables a single system to surpass the best expert LLM under high-budget settings while maintaining strong performance in low-budget modes.
Conclusion: Centralized coordination through CoRL provides scalable and cost-efficient multi-agent LLM systems that can adapt to varying budget constraints while maintaining high performance.
Abstract: Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi-agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi-LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost-efficient and cost-controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi-agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes, highlighting the effectiveness of centralized coordination for scalable and cost-efficient multi-agent LLM systems.
[24] Beyond Single Embeddings: Capturing Diverse Targets with Multi-Query Retrieval
Hung-Ting Chen, Xiang Liu, Shauli Ravfogel, Eunsol Choi
Main category: cs.CL
TL;DR: AMER is a new retriever architecture that generates multiple query vectors autoregressively to capture multimodal document distributions, outperforming single-vector retrievers especially when target documents are dissimilar.
Details
Motivation: Existing retrievers use only one query vector, but the conditional distribution of relevant documents can be multimodal (different query interpretations), which single-vector approaches struggle to capture.Method: Developed AMER (Autoregressive Multi-Embedding Retriever) that autoregressively generates multiple query vectors, all of which are used to retrieve documents from the corpus.
Result: On synthetic data: 4x better performance than single-embedding models. On real-world datasets: 4% and 21% relative gains over baselines, with larger gains when target document embeddings are less similar.
Conclusion: Multi-query vector retrievers show significant potential, especially for handling multimodal document distributions, opening a new research direction.
Abstract: Most text retrievers generate \emph{one} query vector to retrieve relevant documents. Yet, the conditional distribution of relevant documents for the query may be multimodal, e.g., representing different interpretations of the query. We first quantify the limitations of existing retrievers. All retrievers we evaluate struggle more as the distance between target document embeddings grows. To address this limitation, we develop a new retriever architecture, \emph{A}utoregressive \emph{M}ulti-\emph{E}mbedding \emph{R}etriever (AMER). Our model autoregressively generates multiple query vectors, and all the predicted query vectors are used to retrieve documents from the corpus. We show that on the synthetic vectorized data, the proposed method could capture multiple target distributions perfectly, showing 4x better performance than single embedding model. We also fine-tune our model on real-world multi-answer retrieval datasets and evaluate in-domain. AMER presents 4 and 21% relative gains over single-embedding baselines on two datasets we evaluate on. Furthermore, we consistently observe larger gains on the subset of dataset where the embeddings of the target documents are less similar to each other. We demonstrate the potential of using a multi-query vector retriever and open up a new direction for future work.
[25] MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, Xianpei Han
Main category: cs.CL
TL;DR: MemSearcher is a search agent workflow that maintains compact memory across turns, combining current queries with memory to stabilize context length and improve efficiency without sacrificing accuracy, achieving significant performance gains over baselines.
Details
Motivation: Traditional search agents face a trade-off between preserving full interaction history (high computational cost) vs using only current turn (loss of essential information), limiting scalability.Method: Proposes MemSearcher with iterative memory maintenance and multi-context GRPO RL framework that jointly optimizes reasoning, search strategies, and memory management by propagating trajectory-level advantages across conversations.
Result: Achieves +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains across seven benchmarks. 3B-based MemSearcher outperforms 7B-based baselines.
Conclusion: Striking balance between information integrity and efficiency yields both higher accuracy and lower computational overhead, demonstrating the effectiveness of the memory management approach.
Abstract: Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user’s question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher
[26] Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities
Amanda Bertsch, Adithya Pratapa, Teruko Mitamura, Graham Neubig, Matthew R. Gormley
Main category: cs.CL
TL;DR: Oolong is a new benchmark for evaluating long-context reasoning that requires analyzing individual text chunks and aggregating analyses to answer distributional questions, unlike existing benchmarks that mainly test retrieval.
Details
Motivation: Current long-context evaluations primarily test retrieval from context sections, allowing models to disregard most context tokens as noise, which doesn't capture the full range of long-context reasoning tasks.Method: Oolong consists of two task sets: Oolong-synth (naturalistic synthetic tasks for ablation studies) and Oolong-real (real-world conversational data reasoning). It requires models to analyze chunks, aggregate analyses, perform classification/counting in-context, and reason over temporal/user relations.
Result: Even frontier models like GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro struggle, achieving less than 50% accuracy on both splits at 128K context length.
Conclusion: Oolong reveals significant limitations in current models’ long-context reasoning capabilities and provides a benchmark to drive development of models that can effectively reason over large text quantities.
Abstract: As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.
[27] How Teachers Can Use Large Language Models and Bloom’s Taxonomy to Create Educational Quizzes
Sabina Elkins, Ekaterina Kochmar, Jackie C. K. Cheung, Iulian Serban
Main category: cs.CL
TL;DR: This paper presents a large language model-based question generation approach using Bloom’s taxonomy learning goals, showing teachers prefer and effectively use automatically generated questions in quizzes without quality loss.
Details
Motivation: Question generation has great potential in education but needs validation with real teachers and students. Current approaches lack pedagogical input from actual educators.Method: Used large language model-based QG with Bloom’s taxonomy learning goals, conducted experiments assessing teacher usage of automatically generated questions in practice.
Result: Teachers prefer writing quizzes with automatically generated questions, with no quality loss compared to handwritten versions. Some metrics show quality improvement.
Conclusion: Automatically generated questions show promise for large-scale classroom use, demonstrating potential to enhance educational quiz quality.
Abstract: Question generation (QG) is a natural language processing task with an abundance of potential benefits and use cases in the educational domain. In order for this potential to be realized, QG systems must be designed and validated with pedagogical needs in mind. However, little research has assessed or designed QG approaches with the input from real teachers or students. This paper applies a large language model-based QG approach where questions are generated with learning goals derived from Bloom’s taxonomy. The automatically generated questions are used in multiple experiments designed to assess how teachers use them in practice. The results demonstrate that teachers prefer to write quizzes with automatically generated questions, and that such quizzes have no loss in quality compared to handwritten versions. Further, several metrics indicate that automatically generated questions can even improve the quality of the quizzes created, showing the promise for large scale use of QG in the classroom setting.
[28] Path-Consistency with Prefix Enhancement for Efficient Inference in LLMs
Jiace Zhu, Yuanzhe Huang, Yingtao Shen, Jie Zhao, An Zou
Main category: cs.CL
TL;DR: Path-consistency improves LLM reasoning by using early answer confidence to guide subsequent generation branches, reducing computational costs while maintaining accuracy.
Details
Motivation: To address the computational expense and time consumption of current self-consistency methods that require numerous samplings with majority voting.Method: Leverages confidence of earlier-generated answers to identify the most promising prefix and dynamically guide generation of subsequent branches, reducing errors and redundancies from random sampling.
Result: Improves inference latency by up to 40.5% while maintaining task accuracy across mathematical reasoning, commonsense reasoning, and symbolic reasoning tasks.
Conclusion: Path-consistency effectively accelerates LLM inference by minimizing token consumption through guided generation, achieving significant speed improvements without sacrificing accuracy.
Abstract: To enhance the reasoning capabilities of large language models (LLMs), self-consistency has become a popular approach, combining multiple samplings with majority voting. However, current methods are computationally expensive and time-consuming due to the need for numerous samplings. To address this, this paper introduces path-consistency, which leverages the confidence of earlier-generated answers to identify the most promising prefix and guide the generation of subsequent branches. By dynamically guiding the generation of subsequent branches based on this prefix, path-consistency mitigates both the errors and redundancies from random or less useful sampling in self-consistency. This approach reduces errors and redundancies from random sampling, significantly accelerating inference by minimizing token consumption. Our extensive empirical results demonstrate that path-consistency improves inference latency by up to 40.5%, while maintaining task accuracy across various tasks, including mathematical reasoning, commonsense reasoning, and symbolic reasoning.
[29] On Extending Direct Preference Optimization to Accommodate Ties
Jinghong Chen, Guangyu Yang, Weizhe Lin, Jingbiao Mei, Bill Byrne
Main category: cs.CL
TL;DR: The paper introduces two DPO variants that explicitly model ties in pairwise comparisons, showing improved performance over standard DPO in translation and mathematical reasoning tasks.
Details
Motivation: Current DPO methods discard tied pairs, but explicitly modeling ties could provide better regularization and performance without degradation.Method: Replace Bradley-Terry model in DPO with Rao-Kupper and Davidson extensions that assign probability to ties, and test on neural machine translation and summarization tasks.
Result: Tie-inclusive DPO variants show stronger regularization (lower KL divergence) and performance improvements over standard DPO, without the degradation seen when tied pairs are given to original DPO.
Conclusion: Explicitly modeling ties in preference optimization is beneficial and should be preferred over discarding tied pairs as in current practice.
Abstract: We derive and investigate two DPO variants that explicitly model the possibility of declaring a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well-known modeling extensions, by Rao and Kupper and by Davidson, that assign probability to ties as alternatives to clear preferences. Our experiments in neural machine translation and summarization show that explicitly labeled ties can be added to the datasets for these DPO variants without the degradation in task performance that is observed when the same tied pairs are presented to DPO. We find empirically that the inclusion of ties leads to stronger regularization with respect to the reference policy as measured by KL divergence, and we see this even for DPO in its original form. We provide a theoretical explanation for this regularization effect using ideal DPO policy theory. We further show performance improvements over DPO in translation and mathematical reasoning using our DPO variants. We find it can be beneficial to include ties in preference optimization rather than simply discard them, as is done in common practice.
[30] Scaffolded Language Models with Language Supervision for Mixed-Autonomy: A Survey
Matthieu Lin, Jenny Sheng, Andrew Zhao, Shenzhi Wang, Yang Yue, Victor Shea Jay Huang, Huan Liu, Jun Liu, Gao Huang, Yong-Jin Liu
Main category: cs.CL
TL;DR: Survey on scaffolded LMs - semi-parametric models combining pre-trained LMs with non-parametric components (prompts, tools, code) that learn from language supervision and real-time feedback.
Details
Motivation: To organize literature on emerging LM structures that integrate tools into multi-step processes and enable learning from language-based supervision rather than traditional loss functions.Method: View scaffolded LMs as semi-parametric models where non-parametric variables (prompts, tools, scaffold code) are trained using LMs as optimizers to interpret language supervision and update according to complex objectives.
Result: Language-based optimization enables rich, interpretable objectives while mitigating catastrophic forgetting and supporting closed-source model compatibility. Agents can continuously improve from real-time language feedback in mixed-autonomy settings.
Conclusion: Scaffolded LMs represent a paradigm shift where language supervision enables training of non-parametric components, supporting continuous improvement from human feedback in real-world applications like Copilot and mixed-autonomy environments.
Abstract: This survey organizes the intricate literature on the design and optimization of emerging structures around post-trained LMs. We refer to this overarching structure as scaffolded LMs and focus on LMs that are integrated into multi-step processes with tools. We view scaffolded LMs as semi-parametric models wherein we train non-parametric variables, including the prompt, tools, and scaffold’s code. In particular, they interpret instructions, use tools, and receive feedback all in language. Recent works use an LM as an optimizer to interpret language supervision and update non-parametric variables according to intricate objectives. In this survey, we refer to this paradigm as training of scaffolded LMs with language supervision. A key feature of non-parametric training is the ability to learn from language. Parametric training excels in learning from demonstration (supervised learning), exploration (reinforcement learning), or observations (unsupervised learning), using well-defined loss functions. Language-based optimization enables rich, interpretable, and expressive objectives, while mitigating issues like catastrophic forgetting and supporting compatibility with closed-source models. Furthermore, agents are increasingly deployed as co-workers in real-world applications such as Copilot in Office tools or software development. In these mixed-autonomy settings, where control and decision-making are shared between human and AI, users point out errors or suggest corrections. Accordingly, we discuss agents that continuously improve by learning from this real-time, language-based feedback and refer to this setting as streaming learning from language supervision.
[31] ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding
Kimihiro Hasegawa, Wiradee Imrattanatrai, Zhi-Qi Cheng, Masaki Asada, Susan Holm, Yuran Wang, Ken Fukuda, Teruko Mitamura
Main category: cs.CL
TL;DR: ProMQA is a novel multimodal procedural QA dataset with 401 QA pairs on cooking activities, created using human-LLM collaboration, revealing significant performance gaps between current systems and humans.
Details
Motivation: Current multimodal systems are typically evaluated on traditional classification tasks rather than application-oriented scenarios like procedural activities where people follow instructions to achieve goals.Method: Created ProMQA dataset with 401 multimodal procedural QA pairs on cooking activities using cost-effective human-LLM collaborative annotation approach, where LLM-generated QA pairs are verified by humans.
Result: Benchmark results show significant performance gap between human performance and current systems, including competitive proprietary multimodal models.
Conclusion: ProMQA sheds light on new aspects of models’ multimodal understanding capabilities and highlights the need for better evaluation in application-oriented procedural scenarios.
Abstract: Multimodal systems have great potential to assist humans in procedural activities, where people follow instructions to achieve their goals. Despite diverse application scenarios, systems are typically evaluated on traditional classification tasks, e.g., action recognition or temporal action segmentation. In this paper, we present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios. ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities, i.e., cooking, coupled with their corresponding instructions/recipes. For QA annotation, we take a cost-effective human-LLM collaborative approach, where the existing annotation is augmented with LLM-generated QA pairs that are later verified by humans. We then provide the benchmark results to set the baseline performance on ProMQA. Our experiment reveals a significant gap between human performance and that of current systems, including competitive proprietary multimodal models. We hope our dataset sheds light on new aspects of models’ multimodal understanding capabilities.
[32] Composing or Not Composing? Towards Distributional Construction Grammars
Philippe Blache, Emmanuele Chersoni, Giulia Rambelli, Alessandro Lenci
Main category: cs.CL
TL;DR: The paper proposes Distributional Construction Grammars, integrating distributional semantics into construction grammar to account for both compositional and non-compositional language comprehension mechanisms.
Details
Motivation: Traditional incremental, compositional approaches to language comprehension cannot fully explain non-compositional phenomena, requiring a unified framework that incorporates both mechanisms.Method: Extends Sign-Based Construction Grammars with formal definitions, introduces meaning representation through construction-frame-event interactions, and incorporates distributional semantics via activation-based processing using similarity and unification.
Result: Developed a comprehensive framework called Distributional Construction Grammars that bridges compositional and non-compositional language processing approaches.
Conclusion: The proposed framework successfully integrates distributional semantics into construction grammar, enabling a unified processing mechanism for language comprehension that accounts for both compositional and non-compositional phenomena.
Abstract: The mechanisms of comprehension during language processing remains an open question. Classically, building the meaning of a linguistic utterance is said to be incremental, step-by-step, based on a compositional process. However, many different works have shown for a long time that non-compositional phenomena are also at work. It is therefore necessary to propose a framework bringing together both approaches. We present in this paper an approach based on Construction Grammars and completing this framework in order to account for these different mechanisms. We propose first a formal definition of this framework by completing the feature structure representation proposed in Sign-Based Construction Grammars. In a second step, we present a general representation of the meaning based on the interaction of constructions, frames and events. This framework opens the door to a processing mechanism for building the meaning based on the notion of activation evaluated in terms of similarity and unification. This new approach integrates features from distributional semantics into the constructionist framework, leading to what we call Distributional Construction Grammars.
[33] The exponential distribution of the order of demonstrative, numeral, adjective and noun
Ramon Ferrer-i-Cancho
Main category: cs.CL
TL;DR: The paper finds that noun phrase word order frequencies follow an exponential distribution rather than a power law, challenging the inevitability of power-law distributions like Zipf’s law.
Details
Motivation: To investigate the actual distribution of 24 possible noun phrase word orders and resolve the debate between exponential vs power law distributions in linguistic patterns.Method: Analyzed the distribution of 24 possible orders for noun phrases (demonstrative, numeral, adjective, noun) and compared exponential vs power law models, including two types of exponential distributions.
Result: Exponential distribution is a much better fit than power law. Among exponential models, the one where all 24 orders have non-zero probability (truncated geometric distribution) shows higher support when considering consistency and generalizability.
Conclusion: No hard constraints on word order variation exist; unattested orders result from undersampling, supporting Cysouw’s view that all orders are theoretically possible.
Abstract: The frequency of the preferred order for a noun phrase formed by demonstrative, numeral, adjective and noun has received significant attention over the last two decades. We investigate the actual distribution of the 24 possible orders. There is no consensus on whether it is well-fitted by an exponential or a power law distribution. We find that an exponential distribution is a much better model. This finding and other circumstances where an exponential-like distribution is found challenge the view that power-law distributions, e.g., Zipf’s law for word frequencies, are inevitable. We also investigate which of two exponential distributions gives a better fit: an exponential model where the 24 orders have non-zero probability (a geometric distribution truncated at rank 24) or an exponential model where the number of orders that can have non-zero probability is variable (a right-truncated geometric distribution). When consistency and generalizability are prioritized, we find higher support for the exponential model where all 24 orders have non-zero probability. These findings strongly suggest that there is no hard constraint on word order variation and then unattested orders merely result from undersampling, consistently with Cysouw’s view.
[34] Readability Formulas, Systems and LLMs are Poor Predictors of Reading Ease
Keren Gruteke Klein, Shachar Frenkel, Omer Shubi, Yevgeni Berzak
Main category: cs.CL
TL;DR: Existing readability scoring methods are poor predictors of real-time reading ease measured through eye tracking, across different reader groups and text lengths.
Details
Motivation: To evaluate readability scoring methods based on real-time reading ease (using eye tracking) rather than traditional offline measures like comprehension tests and readability ratings.Method: Introduced an evaluation framework that quantifies readability methods’ ability to predict reading ease while controlling for content variation. Applied this to traditional formulas, ML systems, LLMs, and commercial systems.
Result: All tested methods were poor predictors of reading ease, often outperformed by simple word properties from psycholinguistics. Results held across native/non-native speakers and different text lengths.
Conclusion: Existing readability approaches have fundamental limitations, highlighting the need for new cognitively-driven methods that better account for reading ease, and demonstrating the utility of psycholinguistics for readability research.
Abstract: Methods for scoring text readability have been studied for over a century, and are widely used in research and in user-facing applications in many domains. Thus far, the development and evaluation of such methods have primarily relied on two types of offline behavioral data, performance on reading comprehension tests and ratings of text readability levels. In this work, we instead focus on a fundamental and understudied aspect of readability, real-time reading ease, captured with online reading measures using eye tracking. We introduce an evaluation framework for readability scoring methods which quantifies their ability to account for reading ease, while controlling for content variation across texts. Applying this evaluation to prominent traditional readability formulas, modern machine learning systems, frontier Large Language Models and commercial systems used in education, suggests that they are all poor predictors of reading ease in English. This outcome holds across native and non-native speakers, reading regimes, and textual units of different lengths. The evaluation further reveals that existing methods are often outperformed by word properties commonly used in psycholinguistics for prediction of reading times. Our results highlight a fundamental limitation of existing approaches to readability scoring, the utility of psycholinguistics for readability research, and the need for new, cognitively driven readability scoring approaches that can better account for reading ease.
[35] ExpertLens: Activation steering features are highly interpretable
Masha Fedzechkina, Eleonora Gualdoni, Sinead Williamson, Katherine Metcalf, Skyler Seto, Barry-John Theobald
Main category: cs.CL
TL;DR: ExpertLens method analyzes neurons in LLMs to interpret concept representations, showing alignment with human understanding and outperforming traditional embeddings.
Details
Motivation: To determine if features discovered by activation steering methods in LLMs are interpretable and to provide insights into model representations.Method: Used “finding experts” method from activation steering research to identify neurons responsible for specific concepts, then analyzed these neurons through ExpertLens approach.
Result: ExpertLens representations are stable across models/datasets, align closely with human representations (matching inter-human alignment), and significantly outperform word/sentence embeddings in capturing concept organization.
Conclusion: ExpertLens is a flexible, lightweight approach for capturing and analyzing model representations, enabling granular view of LLM concept organization.
Abstract: Activation steering methods in large language models (LLMs) have emerged as
an effective way to perform targeted updates to enhance generated language
without requiring large amounts of adaptation data. We ask whether the features
discovered by activation steering methods are interpretable. We identify
neurons responsible for specific concepts (e.g., cat'') using the finding
experts’’ method from research on activation steering and show that the
ExpertLens, i.e., inspection of these neurons provides insights about model
representation. We find that ExpertLens representations are stable across
models and datasets and closely align with human representations inferred from
behavioral data, matching inter-human alignment levels. ExpertLens
significantly outperforms the alignment captured by word/sentence embeddings.
By reconstructing human concept organization through ExpertLens, we show that
it enables a granular view of LLM concept representation. Our findings suggest
that ExpertLens is a flexible and lightweight approach for capturing and
analyzing model representations.
[36] Mixture of Routers
Jia-Chen Zhang, Yu-Jie Xiong, Xi-He Qiu, Chun-Ming Xia, Fei Dai, Zheng Zhou
Main category: cs.CL
TL;DR: Proposes Mixture of Routers (MoR), a parameter-efficient fine-tuning method that enhances LoRA by using multiple sub-routers with a learnable main router to improve expert selection in Mixture-of-Experts architectures.
Details
Motivation: To address limitations in LoRA's performance improvement and issues in MoE routing mechanisms like incorrect assignments and imbalanced expert allocation, inspired by Redundancy and Fault Tolerance Theory.Method: Integrates Mixture of Experts into routing mechanism using multiple sub-routers for joint selection and a learnable main router to determine sub-router weights, creating a plug-and-play fine-tuning approach.
Result: Outperforms baseline models on most tasks with average 1% performance improvement, demonstrating effectiveness as parameter-efficient fine-tuning method.
Conclusion: MoR serves as an efficient plug-and-play fine-tuning method suitable for wide range of applications, addressing routing issues in MoE while maintaining parameter efficiency.
Abstract: Supervised fine-tuning (SFT) is a milestone in aligning large language models with human instructions and adapting them to downstream tasks. In particular, Low-Rank Adaptation (LoRA) has gained widespread attention due to its parameter efficiency. However, its impact on improving the performance of large models remains limited. Recent studies suggest that combining LoRA with Mixture-of-Experts (MoE) can significantly enhance fine-tuning performance. MoE adapts to the diversity and complexity of datasets by dynamically selecting the most suitable experts, thereby improving task accuracy and efficiency. Despite impressive results, recent studies reveal issues in the MoE routing mechanism, such as incorrect assignments and imbalanced expert allocation. Inspired by the principles of Redundancy and Fault Tolerance Theory. We innovatively integrate the concept of Mixture of Experts into the routing mechanism and propose an efficient fine-tuning method called Mixture of Routers (MoR). It employs multiple sub-routers for joint selection and uses a learnable main router to determine the weights of the sub-routers. The results show that MoR outperforms baseline models on most tasks, achieving an average performance improvement of 1%. MoR can serve as a plug-and-play, parameter-efficient fine-tuning method suitable for a wide range of applications. Our code is available here: https://anonymous.4open.science/r/MoR-DFC6.
[37] TwT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers’ Guidance
Jingxian Xu, Mengyu Zhou, Weichang Liu, Hanbing Liu, Shi Han, Dongmei Zhang
Main category: cs.CL
TL;DR: TwT reduces LLM inference costs by distilling explicit reasoning into habitual behavior using multi-teacher guidance, achieving 13.6% accuracy improvement with fewer output tokens.
Details
Motivation: LLMs' enhanced reasoning capability increases output tokens during inference, leading to higher computational costs that need to be addressed for efficient deployment.Method: Habitual Reasoning Distillation with multi-teacher guidance and Dual-Criteria Rejection Sampling (DCRS) to generate high-quality distillation datasets for unsupervised scenarios.
Result: TwT effectively reduces inference costs while maintaining high performance, achieving up to 13.6% improvement in accuracy with fewer output tokens compared to other distillation methods.
Conclusion: TwT offers a practical solution for efficient LLM deployment by internalizing reasoning processes into habitual behavior without sacrificing performance.
Abstract: Large Language Models (LLMs) have made significant strides in problem-solving by incorporating reasoning processes. However, this enhanced reasoning capability results in an increased number of output tokens during inference, leading to higher computational costs. To address this challenge, we propose TwT (Thinking without Tokens), a method that reduces inference-time costs through habitual reasoning distillation with multi-teachers’ guidance, while maintaining high performance. Our approach introduces a Habitual Reasoning Distillation method, which internalizes explicit reasoning into the model’s habitual behavior through a Teacher-Guided compression strategy inspired by human cognition. Additionally, we propose Dual-Criteria Rejection Sampling (DCRS), a technique that generates a high-quality and diverse distillation dataset using multiple teacher models, making our method suitable for unsupervised scenarios. Experimental results demonstrate that TwT effectively reduces inference costs while preserving superior performance, achieving up to a 13.6% improvement in accuracy with fewer output tokens compared to other distillation methods, offering a highly practical solution for efficient LLM deployment.
[38] Repetitions are not all alike: distinct mechanisms sustain repetition in language models
Matéo Mahaut, Francesca Franzon
Main category: cs.CL
TL;DR: LLMs exhibit repetitive loops through two distinct mechanisms: ICL-induced repetition develops specialized attention networks during training, while natural repetition emerges early as a fallback when context retrieval fails.
Details
Motivation: To understand why LLMs frequently generate repetitive text, which is rare in human language, and whether similar repetition patterns stem from different underlying mechanisms.Method: Contrast analysis of repetitions from natural text prompts vs. ICL setups requiring copying behavior, examining attention head specialization and training progression.
Result: ICL-induced repetition develops specialized attention networks progressively during training, while natural repetition emerges early without defined circuitry and focuses on low-information tokens.
Conclusion: Superficially similar repetition behaviors originate from qualitatively different internal processes, reflecting distinct failure modes and adaptations in language models.
Abstract: Large Language Models (LLMs) can sometimes degrade into repetitive loops, persistently generating identical word sequences. Because repetition is rare in natural human language, its frequent occurrence across diverse tasks and contexts in LLMs remains puzzling. Here we investigate whether behaviorally similar repetition patterns arise from distinct underlying mechanisms and how these mechanisms develop during model training. We contrast two conditions: repetitions elicited by natural text prompts with those induced by in-context learning (ICL) setups that explicitly require copying behavior. Our analyses reveal that ICL-induced repetition relies on a dedicated network of attention heads that progressively specialize over training, whereas naturally occurring repetition emerges early and lacks a defined circuitry. Attention inspection further shows that natural repetition focuses disproportionately on low-information tokens, suggesting a fallback behavior when relevant context cannot be retrieved. These results indicate that superficially similar repetition behaviors originate from qualitatively different internal processes, reflecting distinct modes of failure and adaptation in language models.
[39] Identifying Aspects in Peer Reviews
Sheng Lu, Ilia Kuznetsov, Iryna Gurevych
Main category: cs.CL
TL;DR: This paper proposes a data-driven approach to formalize and identify review aspects in peer review, addressing the gap in aspect formalization and enabling computational support for the reviewing process.
Details
Motivation: The growing volume of academic submissions is straining peer review, motivating computational support. While reviewers assess papers according to certain aspects (like Novelty), the concept of aspect remains poorly formalized, with data-driven methods underexplored.Method: The authors take a bottom-up approach: propose an operational definition of aspect, develop a data-driven schema for deriving aspects from peer review corpus, and introduce a dataset of peer reviews augmented with aspects.
Result: The work demonstrates how the derived aspects can be used for community-level review analysis and shows how aspect choice impacts downstream applications like LLM-generated review detection.
Conclusion: The results establish a foundation for principled, data-driven investigation of review aspects and enable new NLP applications to support peer review.
Abstract: Peer review is central to academic publishing, but the growing volume of submissions is straining the process. This motivates the development of computational approaches to support peer review. While each review is tailored to a specific paper, reviewers often make assessments according to certain aspects such as Novelty, which reflect the values of the research community. This alignment creates opportunities for standardizing the reviewing process, improving quality control, and enabling computational support. While prior work has demonstrated the potential of aspect analysis for peer review assistance, the notion of aspect remains poorly formalized. Existing approaches often derive aspects from review forms and guidelines, yet data-driven methods for aspect identification are underexplored. To address this gap, our work takes a bottom-up approach: we propose an operational definition of aspect and develop a data-driven schema for deriving aspects from a corpus of peer reviews. We introduce a dataset of peer reviews augmented with aspects and show how it can be used for community-level review analysis. We further show how the choice of aspects can impact downstream applications, such as LLM-generated review detection. Our results lay a foundation for a principled and data-driven investigation of review aspects, and pave the path for new applications of NLP to support peer review.
[40] Rethinking the Relationship between the Power Law and Hierarchical Structures
Kai Nakaishi, Ryo Yoshida, Kohei Kajikawa, Koji Hukushima, Yohei Oseki
Main category: cs.CL
TL;DR: Statistical analysis challenges the interpretation that power-law decay in language corpora indicates hierarchical structures in syntax, showing that assumptions don’t hold for natural language parse trees.
Details
Motivation: To empirically test whether power-law correlations in language corpora actually reflect hierarchical structures in syntax, as commonly assumed but never properly verified.Method: Analyzed English and Japanese corpora by examining mutual information, deviations from probabilistic context-free grammars (PCFGs), and other properties in natural language parse trees and their PCFG approximations.
Result: The assumptions about power-law decay indicating hierarchical structures do not hold for syntactic structures, making it difficult to apply this argument to child speech and animal signals.
Conclusion: The relationship between power law and hierarchical structures needs to be reconsidered, as empirical evidence contradicts the commonly accepted interpretation.
Abstract: Statistical analysis of corpora provides an approach to quantitatively investigate natural languages. This approach has revealed that several power laws consistently emerge across different corpora and languages, suggesting universal mechanisms underlying languages. Particularly, the power-law decay of correlation has been interpreted as evidence for underlying hierarchical structures in syntax, semantics, and discourse. This perspective has also been extended to child speeches and animal signals. However, the argument supporting this interpretation has not been empirically tested in natural languages. To address this problem, the present study examines the validity of the argument for syntactic structures. Specifically, we test whether the statistical properties of parse trees align with the assumptions in the argument. Using English and Japanese corpora, we analyze the mutual information, deviations from probabilistic context-free grammars (PCFGs), and other properties in natural language parse trees, as well as in the PCFG that approximates these parse trees. Our results indicate that the assumptions do not hold for syntactic structures and that it is difficult to apply the proposed argument to child speeches and animal signals, highlighting the need to reconsider the relationship between the power law and hierarchical structures.
[41] Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking
Liangliang Zhang, Zhuorui Jiang, Hongliang Chi, Haoyang Chen, Mohammed Elkoumy, Fali Wang, Qiong Wu, Zhengyi Zhou, Shirui Pan, Suhang Wang, Yao Ma
Main category: cs.CL
TL;DR: KGQAGen is an LLM-in-the-loop framework that generates high-quality KGQA benchmarks to address quality issues in existing datasets like WebQSP and CWQ, which suffer from low factual correctness (57% average).
Details
Motivation: Popular KGQA benchmarks have critical quality issues including inaccurate annotations, ambiguous questions, outdated knowledge, and poor factual correctness (only 57% average), limiting reliable evaluation of multi-hop reasoning.Method: KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to systematically produce challenging and verifiable QA instances. The framework constructs KGQAGen-10k, a 10k-scale benchmark grounded in Wikidata.
Result: Experimental evaluation shows that even state-of-the-art KG-RAG models struggle on KGQAGen-10k, demonstrating its ability to expose limitations of existing systems and provide more rigorous evaluation.
Conclusion: KGQAGen provides a scalable framework for advancing KGQA evaluation through more rigorous benchmark construction, addressing the critical quality issues in existing datasets.
Abstract: Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.
[42] Beyond the Link: Assessing LLMs’ ability to Classify Political Content across Global Media
Alejandro De La Fuente-Cuesta, Alberto Martinez-Serra, Nienke Visscher, Laia Castro, Ana S. Cardenal
Main category: cs.CL
TL;DR: LLMs can effectively classify political content from URLs as a scalable alternative to full-text analysis, but exhibit systematic biases by overclassifying centrist news as political.
Details
Motivation: To evaluate whether LLMs can accurately distinguish political content from non-political content using URLs and text across multiple countries and languages, addressing an underexplored area in political science research.Method: Used cutting-edge LLMs to analyze both text and URLs of news articles from five countries (France, Germany, Spain, UK, US) in different languages, benchmarking performance against human-coded data to compare URL-level analysis with full-text analysis.
Result: URLs embed relevant information and can serve as a scalable, cost-effective alternative for discerning political content. However, LLMs systematically overclassify centrist news as political, leading to false positives that may distort further analyses.
Conclusion: Provided methodological recommendations for using LLMs in political science research, highlighting both the potential of URL-based analysis and the need to address systematic classification biases.
Abstract: The use of large language models (LLMs) is becoming common in political science and digital media research. While LLMs have demonstrated ability in labelling tasks, their effectiveness to classify Political Content (PC) from URLs remains underexplored. This article evaluates whether LLMs can accurately distinguish PC from non-PC using both the text and the URLs of news articles across five countries (France, Germany, Spain, the UK, and the US) and their different languages. Using cutting-edge models, we benchmark their performance against human-coded data to assess whether URL-level analysis can approximate full-text analysis. Our findings show that URLs embed relevant information and can serve as a scalable, cost-effective alternative to discern PC. However, we also uncover systematic biases: LLMs seem to overclassify centrist news as political, leading to false positives that may distort further analyses. We conclude by outlining methodological recommendations on the use of LLMs in political science research.
[43] DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns
Bernd J. Kröger
Main category: cs.CL
TL;DR: DYNARTmo is a dynamic articulatory model for visualizing speech articulation in 2D midsagittal plane, built on UK-DYNAMO framework with web-based application for phonetics education and speech therapy.
Details
Motivation: To create a comprehensive tool for visualizing speech articulation processes that can be used in phonetics education and speech therapy, addressing the need for accessible articulatory modeling.Method: Builds on UK-DYNAMO framework, integrates articulatory underspecification, segmental/gestural control, and coarticulation principles. Simulates six key articulators using ten continuous and six discrete control parameters for vocalic and consonantal configurations.
Result: Developed DYNARTmo model embedded in web-based SpeechArticulationTrainer application with sagittal, glottal, and palatal views. Currently focuses on static modeling aspects with capability to generate articulatory configurations.
Conclusion: DYNARTmo provides a functional articulatory model suitable for educational and therapeutic applications, with planned future work on dynamic movement generation and articulatory-acoustic integration.
Abstract: We present DYNARTmo, a dynamic articulatory model designed to visualize speech articulation processes in a two-dimensional midsagittal plane. The model builds upon the UK-DYNAMO framework and integrates principles of articulatory underspecification, segmental and gestural control, and coarticulation. DYNARTmo simulates six key articulators based on ten continuous and six discrete control parameters, allowing for the generation of both vocalic and consonantal articulatory configurations. The current implementation is embedded in a web-based application (SpeechArticulationTrainer) that includes sagittal, glottal, and palatal views, making it suitable for use in phonetics education and speech therapy. While this paper focuses on the static modeling aspects, future work will address dynamic movement generation and integration with articulatory-acoustic modules.
[44] SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers
Chaitanya Manem, Pratik Prabhanjan Brahma, Prakamya Mishra, Zicheng Liu, Emad Barsoum
Main category: cs.CL
TL;DR: SAND-Math pipeline synthesizes high-quality math problems and systematically increases their complexity via Difficulty Hiking, significantly boosting LLM performance on mathematical reasoning benchmarks.
Details
Motivation: The development of performant mathematical LLMs is bottlenecked by scarcity of useful training data with complex problems, requiring better data generation methods.Method: Pipeline that first synthesizes high-quality math problems from scratch, then elevates their complexity through a novel Difficulty Hiking step.
Result: Augmenting baseline with 500-sample SAND-Math dataset boosts performance by 17.85 points on AIME25 benchmark; Difficulty Hiking increases problem difficulty from 5.02 to 5.98 and lifts AIME25 results from 46.38% to 49.23%.
Conclusion: SAND-Math provides a practical and scalable toolkit for building capable mathematical reasoning LLMs through synthetic data generation and complexity enhancement.
Abstract: The demand for Large Language Models (LLMs) at multiple scales, capable of sophisticated and sound mathematical reasoning, continues to grow. However, the development of performant mathematical LLMs is often bottlenecked by the scarcity of useful training data containing problems with significant complexity. We introduce \textbf{SAND-Math} (\textbf{S}ynthetic \textbf{A}ugmented \textbf{N}ovel and \textbf{D}ifficult Mathematics problems and solutions), a pipeline that addresses this by first synthesizing high-quality problems from scratch and then systematically elevating their complexity via a our newly proposed \textbf{Difficulty Hiking} step. We demonstrate the effectiveness of our approach through two key findings: \textbf{(1)} Augmenting a strong post-training baseline with a small 500-sample SAND-Math dataset significantly boosts performance, outperforming the next-best synthetic dataset by $\uparrow$ 17.85 absolute points on AIME25 benchmark. \textbf{(2)} In a dedicated ablation study, we show the effectiveness of our Difficulty Hiking process in increasing average problem difficulty from 5.02 to 5.98. This step consequently lifts AIME25 results from 46.38% to 49.23%. The full generation pipeline, final dataset, and a fine-tuned model form a practical and scalable toolkit for building capable and efficient mathematical reasoning LLMs.
[45] Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue
Keara Schaaij, Roel Boumans, Tibor Bosse, Iris Hendrickx
Main category: cs.CL
TL;DR: This study investigates constructing stable, personalized lexical profiles for conversational agents to enable lexical alignment, finding that compact profiles with 5 items for adjectives/conjunctions and 10 items for other POS categories created from 10 minutes of speech data offer optimal balance.
Details
Motivation: To enable lexical alignment in human-agent dialogue by developing personalized lexical profiles, addressing the underexplored implementation of lexical alignment in conversational agents despite recent LLM advancements.Method: Varied amounts of transcribed spoken data and number of items per POS category, then evaluated profile performance using recall, coverage, and cosine similarity metrics across time.
Result: Smaller, more compact profiles created from 10 minutes of transcribed speech with 5 items for adjectives/conjunctions and 10 items for adverbs/nouns/pronouns/verbs offered the best balance in performance and data efficiency.
Conclusion: Provides practical insights for constructing stable, personalized lexical profiles with minimal data requirements, serving as a foundational step toward implementing lexical alignment strategies in conversational agents.
Abstract: Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful communication. However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (LLMs). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, serving as a foundational step toward lexical alignment strategies in conversational agents.
[46] LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling
Zeyu Liu, Souvik Kundu, Lianghao Jiang, Anni Li, Srikanth Ronanki, Sravan Bodapati, Gourav Datta, Peter A. Beerel
Main category: cs.CL
TL;DR: LAWCAT is a linear attention framework that efficiently converts pre-trained transformers into performant linear models for long-context applications, achieving competitive performance with minimal training resources.
Details
Motivation: Transformers have quadratic computational complexity that limits their use in long-context applications, while training linear-complexity alternatives from scratch is resource-intensive.Method: LAWCAT integrates causal Conv1D layers for local dependency modeling and uses normalized gated linear attention to improve generalization across context lengths, enabling efficient knowledge distillation from pre-trained transformers.
Result: Distilling Mistral-7B with only 1K-length sequences achieves over 90% passkey retrieval accuracy up to 22K tokens. Llama3.2-1B variant performs competitively on various long-context benchmarks while requiring less than 0.1% pre-training tokens and shows faster prefill speeds than FlashAttention-2 for sequences over 8K tokens.
Conclusion: LAWCAT provides an efficient pathway to high-performance linear models for long-context applications, reducing reliance on extensive training data and computational resources, making it suitable for edge deployment.
Abstract: Although transformer architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for latency-sensitive long-context applications. While recent linear-complexity alternatives are increasingly powerful, effectively training them from scratch is still resource-intensive. To overcome these limitations, we propose LAWCAT (Linear Attention with Convolution Across Time), a novel linearization framework designed to efficiently transfer the capabilities of pre-trained transformers into a performant linear attention architecture. LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths. Our comprehensive evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90% passkey retrieval accuracy up to 22K tokens, significantly extending its effective context window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance on S-NIAH 1&2&3 tasks (1K-8K context length) and BABILong benchmark (QA2&QA3, 0K-16K context length), requiring less than 0.1% pre-training tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster prefill speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT thus provides an efficient pathway to high-performance, long-context linear models suitable for edge deployment, reducing reliance on extensive long-sequence training data and computational resources. Code is released at: https://github.com/zeyuliu1037/LAWCAT
[47] Constraint Satisfaction Approaches to Wordle: Novel Heuristics and Cross-Lexicon Validation
Jahidul Arafat, Fariha Tasmin, Sanjaya Poudel
Main category: cs.CL
TL;DR: First comprehensive CSP formulation of Wordle with constraint-aware solving strategies, achieving 3.54 average guesses with 99.9% success rate and 46% faster runtime than baseline methods.
Details
Motivation: Existing Wordle solvers rely on information-theoretic entropy maximization or frequency-based heuristics without formal constraint treatment, creating a need for principled CSP approaches.Method: Introduced CSP-Aware Entropy (computing information gain after constraint propagation) and Probabilistic CSP framework integrating Bayesian word-frequency priors with logical constraints.
Result: CSP-Aware Entropy achieved 3.54 average guesses with 99.9% success rate, 1.7% improvement over Forward Checking with 46% faster runtime. Maintained performance advantage under noise and achieved 100% success across all noise levels with Probabilistic CSP.
Conclusion: Principled constraint satisfaction techniques outperform classical information-theoretic and learning-based approaches for structured puzzle-solving domains, with core CSP principles transferring across languages without language-specific tuning.
Abstract: Wordle presents an algorithmically rich testbed for constraint satisfaction problem (CSP) solving. While existing solvers rely on information-theoretic entropy maximization or frequency-based heuristics without formal constraint treatment, we present the first comprehensive CSP formulation of Wordle with novel constraint-aware solving strategies. We introduce CSP-Aware Entropy, computing information gain after constraint propagation rather than on raw candidate sets, and a Probabilistic CSP framework integrating Bayesian word-frequency priors with logical constraints. Through evaluation on 2,315 English words, CSP-Aware Entropy achieves 3.54 average guesses with 99.9% success rate, a statistically significant 1.7% improvement over Forward Checking (t=-4.82, p<0.001, Cohen’s d=0.07) with 46% faster runtime (12.9ms versus 23.7ms per guess). Under 10% noise, CSP-aware approaches maintain 5.3 percentage point advantages (29.0% versus 23.7%, p=0.041), while Probabilistic CSP achieves 100% success across all noise levels (0-20%) through constraint recovery mechanisms. Cross-lexicon validation on 500 Spanish words demonstrates 88% success with zero language-specific tuning, validating that core CSP principles transfer across languages despite an 11.2 percentage point gap from linguistic differences (p<0.001, Fisher’s exact test). Our open-source implementation with 34 unit tests achieving 91% code coverage provides reproducible infrastructure for CSP research. The combination of formal CSP treatment, constraint-aware heuristics, probabilistic-logical integration, robustness analysis, and cross-lexicon validation establishes new performance benchmarks demonstrating that principled constraint satisfaction techniques outperform classical information-theoretic and learning-based approaches for structured puzzle-solving domains.
[48] AWARE, Beyond Sentence Boundaries: A Contextual Transformer Framework for Identifying Cultural Capital in STEM Narratives
Khalid Mehtab Khan, Anagha Kulkarni
Main category: cs.CL
TL;DR: AWARE framework improves cultural capital theme detection in student reflections by enhancing transformer models’ domain, context, and class overlap awareness, outperforming baselines by 2.1% in Macro-F1.
Details
Motivation: Cultural capital themes in student reflections are valuable for equitable learning environments but are difficult to detect with standard NLP models due to their narrative nature and domain-specific language.Method: AWARE framework with three components: Domain Awareness (vocabulary adaptation), Context Awareness (essay-aware embeddings), and Class Overlap Awareness (multi-label classification for theme coexistence).
Result: AWARE outperforms strong baseline by 2.1 percentage points in Macro-F1 and shows considerable improvements across all cultural capital themes.
Conclusion: Provides a robust and generalizable methodology for text classification tasks where meaning depends on narrative context, particularly for detecting nuanced cultural capital themes.
Abstract: Identifying cultural capital (CC) themes in student reflections can offer valuable insights that help foster equitable learning environments in classrooms. However, themes such as aspirational goals or family support are often woven into narratives, rather than appearing as direct keywords. This makes them difficult to detect for standard NLP models that process sentences in isolation. The core challenge stems from a lack of awareness, as standard models are pre-trained on general corpora, leaving them blind to the domain-specific language and narrative context inherent to the data. To address this, we introduce AWARE, a framework that systematically attempts to improve a transformer model’s awareness for this nuanced task. AWARE has three core components: 1) Domain Awareness, adapting the model’s vocabulary to the linguistic style of student reflections; 2) Context Awareness, generating sentence embeddings that are aware of the full essay context; and 3) Class Overlap Awareness, employing a multi-label strategy to recognize the coexistence of themes in a single sentence. Our results show that by making the model explicitly aware of the properties of the input, AWARE outperforms a strong baseline by 2.1 percentage points in Macro-F1 and shows considerable improvements across all themes. This work provides a robust and generalizable methodology for any text classification task in which meaning depends on the context of the narrative.
[49] Revisiting Long-context Modeling from Context Denoising Perspective
Zecheng Tang, Baibei Ji, Juntao Li, Lijun Wu, Haijia Gui, Min Zhang
Main category: cs.CL
TL;DR: The paper proposes Context Denoising Training (CDT), a method that uses Integrated Gradient scores to detect and mitigate contextual noise in long-context models, improving attention on critical tokens and boosting performance.
Details
Motivation: Long-context models are susceptible to contextual noise (irrelevant tokens) that mislead model attention, despite their potential for processing long sequences in real-world applications.Method: Proposes Context Denoising Training (CDT) using Integrated Gradient scores to detect and quantify noise, then trains models to improve attention on critical tokens while reinforcing their influence on predictions.
Result: Extensive experiments show CDT substantially boosts model attention on critical tokens and improves performance across four tasks. An 8B model trained with CDT achieves performance (50.92) comparable to GPT-4o (51.00).
Conclusion: CDT is an effective training strategy that addresses context noise in long-context models, enabling smaller models to achieve performance comparable to much larger models like GPT-4o.
Abstract: Long-context models (LCMs) have demonstrated great potential in processing long sequences, facilitating many real-world applications. The success of LCMs can be attributed to their ability to locate implicit critical information within the context for further prediction. However, recent research reveals that LCMs are often susceptible to contextual noise, i.e., irrelevant tokens, that can mislead model attention. In this paper, we conduct a fine-grained analysis of the context noise and propose an effective metric, the Integrated Gradient (IG) score, to detect and quantify the noise information within the context. Our findings reveal that even simple mitigation of detected context noise can substantially boost the model’s attention on critical tokens and benefit subsequent predictions. Building on this insight, we propose Context Denoising Training (CDT), a straightforward yet effective training strategy that improves attention on critical tokens while reinforcing their influence on model predictions. Extensive experiments across four tasks, under both context window scaling and long-context alignment settings, demonstrate the superiority of CDT. Notably, when trained with CDT, an open-source 8B model can achieve performance (50.92) comparable to GPT-4o (51.00).
[50] LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling
Zecheng Tang, Baibei Ji, Quantong Qiu, Haitian Wang, Xiaobo Liang, Juntao Li, Min Zhang
Main category: cs.CL
TL;DR: Long-RewardBench is a new benchmark for evaluating reward models on long-context scenarios, revealing that current models struggle with context-response consistency. The authors propose a multi-stage training approach that creates robust Long-context RMs (LongRMs) that outperform much larger models.
Details
Motivation: Current reward models are limited to short contexts and focus mainly on response-level attributes, neglecting long context-response consistency which is crucial for real-world applications like LLM agents.Method: Proposed a general multi-stage training strategy to scale arbitrary models into robust Long-context RMs (LongRMs), using the Long-RewardBench benchmark with Pairwise Comparison and Best-of-N tasks.
Result: The 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of proprietary Gemini 2.5 Pro model, while preserving strong short-context capability.
Conclusion: The proposed multi-stage training approach effectively creates robust long-context reward models that address the critical gap in context-response consistency for long history trajectories.
Abstract: Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model’s responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.
[51] Hey, wait a minute: on at-issue sensitivity in Language Models
Sanghee J. Kim, Kanishka Misra
Main category: cs.CL
TL;DR: DGRC method uses at-issueness to evaluate dialogue naturalness in LMs, finding they prefer continuing at-issue content, especially instruct-tuned models, and modulate this preference with contextual cues.
Details
Motivation: Evaluating dialogue naturalness in LMs is challenging due to varying notions of naturalness and limited scalable metrics.Method: Divide, Generate, Recombine, and Compare (DGRC): divides dialogue prompts, generates continuations for subparts, recombines sequences, and compares likelihoods.
Result: LMs prefer continuing at-issue content (enhanced in instruct-tuned models) and reduce this preference when relevant cues are present.
Conclusion: Instruct-tuning doesn’t amplify modulation but the pattern reflects successful dialogue dynamics.
Abstract: Evaluating the naturalness of dialogue in language models (LMs) is not trivial: notions of ’naturalness’ vary, and scalable quantitative metrics remain limited. This study leverages the linguistic notion of ‘at-issueness’ to assess dialogue naturalness and introduces a new method: Divide, Generate, Recombine, and Compare (DGRC). DGRC (i) divides a dialogue as a prompt, (ii) generates continuations for subparts using LMs, (iii) recombines the dialogue and continuations, and (iv) compares the likelihoods of the recombined sequences. This approach mitigates bias in linguistic analyses of LMs and enables systematic testing of discourse-sensitive behavior. Applying DGRC, we find that LMs prefer to continue dialogue on at-issue content, with this effect enhanced in instruct-tuned models. They also reduce their at-issue preference when relevant cues (e.g., “Hey, wait a minute”) are present. Although instruct-tuning does not further amplify this modulation, the pattern reflects a hallmark of successful dialogue dynamics.
[52] DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking
Lanni Bu, Lauren Levin, Amir Zeldes
Main category: cs.CL
TL;DR: DiscoTrack is a multilingual LLM benchmark for discourse tracking across 12 languages, testing four levels of discourse understanding that remain challenging for state-of-the-art models.
Details
Motivation: Existing LLM benchmarks focus too much on natural language understanding for explicit information extraction (QA, summarization) at sentence level, lacking challenging multilingual benchmarks for implicit information and pragmatic inferences across larger documents in discourse tracking.Method: Created DiscoTrack benchmark with tasks across 12 languages targeting four levels of discourse understanding: salience recognition, entity tracking, discourse relations, and bridging inference.
Result: Evaluation shows these discourse tracking tasks remain challenging even for state-of-the-art LLMs, demonstrating the need for more sophisticated discourse understanding capabilities.
Conclusion: DiscoTrack addresses the gap in challenging multilingual benchmarks for discourse tracking and reveals current limitations in LLMs’ ability to handle implicit information and pragmatic inferences across larger documents.
Abstract: Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often tar- geting information from individual sentences. We are still lacking more challenging, and im- portantly also multilingual, benchmarks focus- ing on implicit information and pragmatic infer- ences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark target- ing a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.
[53] Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
Haowei Hua, Hong Jiao, Xinyi Wang
Main category: cs.CL
TL;DR: Using generative language models with summarization and prompting improves automated essay scoring for long essays, increasing QWK from 0.822 to 0.8878.
Details
Motivation: BERT and its variants have a 512-token limit, which is insufficient for automated scoring of long essays.Method: Employ generative language models for automated scoring via summarization and prompting.
Result: Scoring accuracy improved with QWK increasing from 0.822 to 0.8878 on the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
Conclusion: Generative language models with summarization and prompting are effective for automated scoring of long essays.
Abstract: BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
[54] A Survey on LLM Mid-Training
Chengying Tu, Xuemiao Zhang, Rongxiang Weng, Rumei Li, Chen Zhang, Yang Bai, Hongfei Yan, Jingang Wang, Xunliang Cai
Main category: cs.CL
TL;DR: Mid-training is a crucial intermediate stage between pre-training and post-training that systematically enhances specific LLM capabilities like mathematics, coding, and reasoning while maintaining foundational competencies.
Details
Motivation: Recent advances show multi-stage training benefits, with mid-training emerging as vital for bridging pre-training and post-training to systematically develop targeted capabilities in LLMs.Method: Investigates optimization frameworks including data curation, training strategies, and model architecture optimization for mid-training, analyzing mainstream implementations through objective-driven interventions.
Result: Provides a formal definition of mid-training for LLMs and demonstrates how it serves as a distinct critical stage in progressive LLM capability development.
Conclusion: Offers comprehensive taxonomy and actionable insights to support future research and innovation in LLM advancement through mid-training approaches.
Abstract: Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.
[55] Leveraging Hierarchical Organization for Medical Multi-document Summarization
Yi-Li Hsu, Katelyn X. Mei, Lucy Lu Wang
Main category: cs.CL
TL;DR: Hierarchical structures in medical multi-document summarization improve model-generated summaries by enhancing clarity and human preference while maintaining content coverage and factuality.
Details
Motivation: To investigate whether hierarchical structures can better organize cross-document relationships in medical multi-document summarization compared to traditional flat methods.Method: Tested two hierarchical organization approaches across three large language models, with comprehensive evaluation using automated metrics, model-based metrics, and domain expert assessment across multiple quality dimensions.
Result: Human experts preferred model-generated summaries over human-written ones; hierarchical approaches preserved factuality, coverage, and coherence while increasing human preference; GPT-4 judgments aligned well with human evaluations on objective criteria.
Conclusion: Hierarchical structures effectively improve medical summary clarity and human preference while maintaining content quality, offering a practical enhancement for medical multi-document summarization systems.
Abstract: Medical multi-document summarization (MDS) is a complex task that requires effectively managing cross-document relationships. This paper investigates whether incorporating hierarchical structures in the inputs of MDS can improve a model’s ability to organize and contextualize information across documents compared to traditional flat summarization methods. We investigate two ways of incorporating hierarchical organization across three large language models (LLMs), and conduct comprehensive evaluations of the resulting summaries using automated metrics, model-based metrics, and domain expert evaluation of preference, understandability, clarity, complexity, relevance, coverage, factuality, and coherence. Our results show that human experts prefer model-generated summaries over human-written summaries. Hierarchical approaches generally preserve factuality, coverage, and coherence of information, while also increasing human preference for summaries. Additionally, we examine whether simulated judgments from GPT-4 align with human judgments, finding higher agreement along more objective evaluation facets. Our findings demonstrate that hierarchical structures can improve the clarity of medical summaries generated by models while maintaining content coverage, providing a practical way to improve human preference for generated summaries.
[56] Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices
Špela Vintar, Taja Kuzman Pungeršek, Mojca Brglez, Nikola Ljubešić
Main category: cs.CL
TL;DR: This paper proposes a new taxonomy and best practices for multilingual LLM benchmarking, focusing on European languages and advocating for greater language/culture sensitivity.
Details
Motivation: Current LLM benchmarks are predominantly English-focused, leaving non-English languages under-evaluated despite growing LLM capabilities.Method: The authors provide an overview of recent LLM benchmarking developments and propose a new taxonomy specifically designed for multilingual scenarios, along with quality standards for coordinated benchmark development.
Result: A framework for categorizing multilingual benchmarks and a set of recommendations for improving evaluation methods that are more sensitive to language and cultural differences.
Conclusion: There is a need for more coordinated, language-sensitive benchmark development for European languages to properly evaluate LLMs in multilingual contexts.
Abstract: While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.
[57] Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models
Sriram Balasubramaniam, Samyadeep Basu, Koustava Goswami, Ryan Rossi, Varun Manjunatha, Roshan Santhosh, Ruiyi Zhang, Soheil Feizi, Nedim Lipka
Main category: cs.CL
TL;DR: DecompTune improves LLM attribution for complex QA by teaching models to decompose answers into constituent units tied to specific sources, outperforming prior methods.
Details
Motivation: Existing post-hoc attribution methods struggle with multi-hop, abstractive, and semi-extractive QA where answers synthesize information across multiple passages, creating a need for more reliable attribution.Method: Reframe attribution as reasoning by decomposing answers into constituent units tied to context. Use DecompTune post-training with a two-stage SFT + GRPO pipeline on curated datasets annotated with decompositions by strong LLMs.
Result: DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models across extensive experiments.
Conclusion: Treating attribution as a reasoning problem through answer decomposition provides a more effective approach for reliable source attribution in complex QA scenarios.
Abstract: Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.
[58] Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning
Qi Luo, Xiaonan Li, Tingshuo Fan, Xinchi Chen, Xipeng Qiu
Main category: cs.CL
TL;DR: Introduces GlobalQA, the first benchmark for evaluating global RAG capabilities, and proposes GlobalRAG framework that significantly outperforms existing methods on global information aggregation tasks.
Details
Motivation: Current RAG evaluation focuses on local retrieval from small document subsets, but real-world applications require global RAG capabilities that aggregate information across entire document collections to derive corpus-level insights.Method: Proposes GlobalRAG framework with chunk-level retrieval to preserve structural coherence, LLM-driven intelligent filters to eliminate noisy documents, and aggregation modules for precise symbolic computation.
Result: Existing RAG methods perform poorly on global tasks (strongest baseline: 1.51 F1). GlobalRAG achieves 6.63 F1 on Qwen2.5-14B model, significantly outperforming baselines.
Conclusion: GlobalRAG effectively addresses the limitations of current RAG methods for global information aggregation tasks, demonstrating substantial performance improvements on the new GlobalQA benchmark.
Abstract: Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability – global RAG – which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, “What are the top 10 most cited papers in 2023?”). In this paper, we introduce GlobalQA – the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline’s 1.51 F1, validating the effectiveness of our method.
[59] A Unified Representation Underlying the Judgment of Large Language Models
Yi-Long Lu, Jiajun Song, Wei Wang
Main category: cs.CL
TL;DR: LLMs use a unified Valence-Assent Axis (VAA) for evaluative judgments, combining subjective valence and factual assent, which can subordinate reasoning and cause hallucinations.
Details
Motivation: To determine whether judgment in AI systems relies on specialized modules or a unified resource, specifically investigating if decodable neural representations in LLMs are truly independent systems.Method: Analyzed diverse evaluative judgments across multiple LLMs, identified the dominant Valence-Assent Axis, and conducted direct interventions to test its function.
Result: Found that evaluative judgments converge along a single VAA dimension that jointly encodes valence and factual assent, and demonstrated this axis subordinates reasoning by steering rationales to match evaluative state.
Conclusion: LLMs have a convergent architecture where the VAA promotes coherent judgment but systematically undermines faithful reasoning, providing a mechanistic explanation for response bias and hallucinations.
Abstract: A central architectural question for both biological and artificial intelligence is whether judgment relies on specialized modules or a unified, domain-general resource. While the discovery of decodable neural representations for distinct concepts in Large Language Models (LLMs) has suggested a modular architecture, whether these representations are truly independent systems remains an open question. Here we provide evidence for a convergent architecture for evaluative judgment. Across a range of LLMs, we find that diverse evaluative judgments are computed along a dominant dimension, which we term the Valence-Assent Axis (VAA). This axis jointly encodes subjective valence (“what is good”) and the model’s assent to factual claims (“what is true”). Through direct interventions, we demonstrate this axis drives a critical mechanism, which is identified as the subordination of reasoning: the VAA functions as a control signal that steers the generative process to construct a rationale consistent with its evaluative state, even at the cost of factual accuracy. Our discovery offers a mechanistic account for response bias and hallucination, revealing how an architecture that promotes coherent judgment can systematically undermine faithful reasoning.
[60] Zero-RAG: Towards Retrieval-Augmented Generation with Zero Redundant Knowledge
Qi Luo, Xiaonan Li, Junqi Dai, Shuang Cheng, Xipeng Qiu
Main category: cs.CL
TL;DR: Zero-RAG addresses knowledge redundancy in RAG systems by pruning redundant external corpus content and improving LLM’s utilization of internal knowledge, achieving 30% corpus reduction and 22% retrieval speedup.
Details
Motivation: Address knowledge redundancy between LLMs' internal knowledge and external RAG corpora, which increases retrieval costs and hurts performance on questions LLMs can answer themselves.Method: Proposes Mastery-Score metric to identify redundant knowledge for pruning, Query Router to avoid irrelevant documents, and Noise-Tolerant Tuning to improve LLM’s internal knowledge utilization.
Result: Prunes Wikipedia corpus by 30%, accelerates retrieval stage by 22%, without compromising RAG performance.
Conclusion: Zero-RAG effectively reduces knowledge redundancy and improves RAG efficiency while maintaining performance.
Abstract: Retrieval-Augmented Generation has shown remarkable results to address Large Language Models’ hallucinations, which usually uses a large external corpus to supplement knowledge to LLMs. However, with the development of LLMs, the internal knowledge of LLMs has expanded significantly, thus causing significant knowledge redundancy between the external corpus and LLMs. On the one hand, the indexing cost of dense retrieval is highly related to the corpus size and thus significant redundant knowledge intensifies the dense retrieval’s workload. On the other hand, the redundant knowledge in the external corpus is not helpful to LLMs and our exploratory analysis shows that it instead hurts the RAG performance on those questions which the LLM can answer by itself. To address these issues, we propose Zero-RAG to tackle these challenges. Specifically, we first propose the Mastery-Score metric to identify redundant knowledge in the RAG corpus to prune it. After pruning, answers to “mastered” questions rely primarily on internal knowledge of the LLM. To better harness the internal capacity, we propose Query Router and Noise-Tolerant Tuning to avoid the irrelevant documents’ distraction and thus further improve the LLM’s utilization of internal knowledge with pruned corpus. Experimental results show that Zero-RAG prunes the Wikipedia corpus by 30% and accelerates the retrieval stage by 22%, without compromising RAG’s performance.
[61] Multi-refined Feature Enhanced Sentiment Analysis Using Contextual Instruction
Peter Atandoh, Jie Zou, Weikang Guo, Jiwei Wei, Zheng Wang
Main category: cs.CL
TL;DR: CISEA-MRFE is a novel PLM-based framework that improves sentiment analysis by integrating contextual instructions, semantic enhancement augmentation, and multi-refined feature extraction to address challenges like nuanced emotions, domain shifts, and imbalanced data.
Details
Motivation: Existing sentiment analysis approaches underperform with nuanced emotional cues, domain shifts, and imbalanced sentiment distributions due to inadequate semantic grounding, poor generalization, and biases toward dominant sentiment classes.Method: Proposes CISEA-MRFE framework with three components: Contextual Instruction (CI) for domain-aware sentiment disambiguation, Semantic Enhancement Augmentation (SEA) for sentiment-consistent paraphrastic augmentation, and Multi-Refined Feature Extraction (MRFE) combining Scale-Adaptive Depthwise Encoder (SADE) for multi-scale features and Emotion Evaluator Context Encoder (EECE) for affect-aware modeling.
Result: Outperforms strong baselines on four benchmark datasets with relative accuracy improvements: 4.6% on IMDb, 6.5% on Yelp, 30.3% on Twitter, and 4.1% on Amazon.
Conclusion: The framework demonstrates effectiveness and strong generalization ability for sentiment classification across varied domains, successfully addressing limitations of existing approaches.
Abstract: Sentiment analysis using deep learning and pre-trained language models (PLMs) has gained significant traction due to their ability to capture rich contextual representations. However, existing approaches often underperform in scenarios involving nuanced emotional cues, domain shifts, and imbalanced sentiment distributions. We argue that these limitations stem from inadequate semantic grounding, poor generalization to diverse linguistic patterns, and biases toward dominant sentiment classes. To overcome these challenges, we propose CISEA-MRFE, a novel PLM-based framework integrating Contextual Instruction (CI), Semantic Enhancement Augmentation (SEA), and Multi-Refined Feature Extraction (MRFE). CI injects domain-aware directives to guide sentiment disambiguation; SEA improves robustness through sentiment-consistent paraphrastic augmentation; and MRFE combines a Scale-Adaptive Depthwise Encoder (SADE) for multi-scale feature specialization with an Emotion Evaluator Context Encoder (EECE) for affect-aware sequence modeling. Experimental results on four benchmark datasets demonstrate that CISEA-MRFE consistently outperforms strong baselines, achieving relative improvements in accuracy of up to 4.6% on IMDb, 6.5% on Yelp, 30.3% on Twitter, and 4.1% on Amazon. These results validate the effectiveness and generalization ability of our approach for sentiment classification across varied domains.
[62] SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding
Jameson Sandler, Jacob K. Christopher, Thomas Hartvigsen, Ferdinando Fioretto
Main category: cs.CL
TL;DR: SpecDiff-2 is a novel speculative decoding framework that uses discrete diffusion as a non-autoregressive drafter to overcome parallelism limitations and misalignment issues in current approaches, achieving up to 5.5x speed-up over standard decoding.
Details
Motivation: Current speculative decoding approaches suffer from two bottlenecks: autoregressive dependency during drafting that limits parallelism, and frequent rejections of draft tokens due to misalignment between draft and verify models.Method: Proposes SpecDiff-2 framework that leverages discrete diffusion as a non-autoregressive drafter to address parallelism limitations, and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers to address misalignment issues.
Result: Achieves state-of-the-art performance across reasoning, coding, and mathematical benchmarks with up to 55% improvement in tokens-per-second over previous baselines and up to 5.5x average speed-up over standard decoding, without accuracy loss.
Conclusion: SpecDiff-2 successfully addresses fundamental bottlenecks in speculative decoding through its novel discrete diffusion approach and calibration techniques, establishing a new state-of-the-art for LLM inference acceleration.
Abstract: Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: (1) the autoregressive dependency during drafting which limits parallelism, and (2) frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that SpecDiff-2 achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of +55% over previous baselines and obtaining up to 5.5x average speed-up over standard decoding, without any loss of accuracy.
[63] Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?
Berk Atil, Rebecca J. Passonneau, Fred Morstatter
Main category: cs.CL
TL;DR: First systematic multilingual evaluation of jailbreak attacks and defenses across 10 languages shows safety vulnerabilities vary significantly by language, with high-resource languages being safer for standard queries but more vulnerable to adversarial attacks.
Details
Motivation: To address the underexplored cross-lingual generalization of jailbreak attacks and defenses in LLMs, as current safety evaluations are primarily English-centric.Method: Evaluated two jailbreak types (logical-expression-based and adversarial-prompt-based) across 10 languages spanning high-, medium-, and low-resource categories using six LLMs on HarmBench and AdvBench benchmarks.
Result: Attack success and defense robustness vary significantly across languages; high-resource languages are safer under standard queries but more vulnerable to adversarial attacks; simple defenses are effective but language- and model-dependent.
Conclusion: Findings highlight the need for language-aware and cross-lingual safety benchmarks for LLMs to ensure robust safety alignment across different languages.
Abstract: Large language models (LLMs) undergo safety alignment after training and tuning, yet recent work shows that safety can be bypassed through jailbreak attacks. While many jailbreaks and defenses exist, their cross-lingual generalization remains underexplored. This paper presents the first systematic multilingual evaluation of jailbreaks and defenses across ten languages – spanning high-, medium-, and low-resource languages – using six LLMs on HarmBench and AdvBench. We assess two jailbreak types: logical-expression-based and adversarial-prompt-based. For both types, attack success and defense robustness vary across languages: high-resource languages are safer under standard queries but more vulnerable to adversarial ones. Simple defenses can be effective, but are language- and model-dependent. These findings call for language-aware and cross-lingual safety benchmarks for LLMs.
[64] The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles
Abhinav P M, Ojasva Saxena, Oswald C, Parameswari Krishnamurthy
Main category: cs.CL
TL;DR: LLMs struggle with culturally grounded reasoning across Indian languages, with top-performing models being overconfident while lower-performing ones show better self-awareness.
Details
Motivation: To examine LLMs' reasoning and self-assessment abilities across seven major Indian languages, as culturally grounded reasoning in non-English languages remains underexplored.Method: Created multilingual riddle dataset with traditional and context-reconstructed variants, evaluated five LLMs under seven prompting strategies, and conducted self-evaluation experiments to measure reasoning consistency.
Result: Gemini 2.5 Pro performed best overall but showed poor self-awareness (4.34% True Negative Rate), while lower-performing models like LLaMA 4 Scout demonstrated better self-assessment (42.09% True Negative Rate). Accuracy varied notably across languages.
Conclusion: Clear gaps exist in multilingual reasoning, highlighting the need for models that not only reason effectively but also recognize their own limitations.
Abstract: The extent to which large language models (LLMs) can perform culturally grounded reasoning across non-English languages remains underexplored. This paper examines the reasoning and self-assessment abilities of LLMs across seven major Indian languages-Bengali, Gujarati, Hindi, Kannada, Malayalam, Tamil, and Telugu. We introduce a multilingual riddle dataset combining traditional riddles with context-reconstructed variants and evaluate five LLMs-Gemini 2.5 Pro, Gemini 2.5 Flash, Mistral-Saba, LLaMA 4 Scout, and LLaMA 4 Maverick-under seven prompting strategies. In the first stage, we assess riddle-solving performance and find that while Gemini 2.5 Pro performs best overall, few-shot methods yield only marginal gains, and accuracy varies notably across languages. In the second stage, we conduct a self-evaluation experiment to measure reasoning consistency. The results reveal a key finding: a model’s initial accuracy is inversely correlated with its ability to identify its own mistakes. Top-performing models such as Gemini 2.5 Pro are overconfident (4.34% True Negative Rate), whereas lower-performing models like LLaMA 4 Scout are substantially more self-aware (42.09% True Negative Rate). These results point to clear gaps in multilingual reasoning and highlight the need for models that not only reason effectively but also recognize their own limitations.
[65] Accumulating Context Changes the Beliefs of Language Models
Jiayi Geng, Howard Chen, Ryan Liu, Manoel Horta Ribeiro, Robb Willer, Graham Neubig, Thomas L. Griffiths
Main category: cs.CL
TL;DR: Language models’ beliefs can significantly shift through accumulated context from conversations and reading, making their responses unreliable.
Details
Motivation: To investigate how accumulating context during interactions and text processing changes language models' belief profiles and behaviors, revealing latent risks in autonomous systems.Method: Tested belief shifts through moral dilemma discussions and political text exposure, and examined behavioral changes via tool selection tasks that reflect implicit beliefs.
Result: GPT-5 showed 54.7% belief shift after moral discussions, Grok 4 showed 27.2% shift after reading opposing political texts, with behavioral changes aligning with stated belief shifts.
Conclusion: Extended talking and reading sessions make language models’ opinions and actions unreliable due to significant belief shifts, posing risks in autonomous applications.
Abstract: Language model (LM) assistants are increasingly used in applications such as brainstorming and research. Improvements in memory and context size have allowed these models to become more autonomous, which has also resulted in more text accumulation in their context windows without explicit user intervention. This comes with a latent risk: the belief profiles of models – their understanding of the world as manifested in their responses or actions – may silently change as context accumulates. This can lead to subtly inconsistent user experiences, or shifts in behavior that deviate from the original alignment of the models. In this paper, we explore how accumulating context by engaging in interactions and processing text – talking and reading – can change the beliefs of language models, as manifested in their responses and behaviors. Our results reveal that models’ belief profiles are highly malleable: GPT-5 exhibits a 54.7% shift in its stated beliefs after 10 rounds of discussion about moral dilemmas and queries about safety, while Grok 4 shows a 27.2% shift on political issues after reading texts from the opposing position. We also examine models’ behavioral changes by designing tasks that require tool use, where each tool selection corresponds to an implicit belief. We find that these changes align with stated belief shifts, suggesting that belief shifts will be reflected in actual behavior in agentic systems. Our analysis exposes the hidden risk of belief shift as models undergo extended sessions of talking or reading, rendering their opinions and actions unreliable.
[66] Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems
Elias Lumer, Faheem Nizar, Anmol Gulati, Pradeep Honaganahalli Basavaraju, Vamse Kumar Subbiah
Main category: cs.CL
TL;DR: Tool-to-Agent Retrieval improves agent selection in multi-agent systems by embedding tools and agents in a shared vector space and connecting them through metadata, enabling granular retrieval at both tool and agent levels.
Details
Motivation: Existing retrieval methods match queries against coarse agent-level descriptions, obscuring fine-grained tool functionality and leading to suboptimal agent selection in LLM multi-agent systems.Method: A unified framework that embeds both tools and their parent agents in a shared vector space and connects them through metadata relationships, enabling granular tool-level or agent-level retrieval.
Result: Achieves consistent improvements of 19.4% in Recall@5 and 17.7% in nDCG@5 over previous state-of-the-art agent retrievers on the LiveMCPBench benchmark across eight embedding models.
Conclusion: Tool-to-Agent Retrieval effectively addresses context dilution from chunking many tools together and ensures agents and their underlying tools are equally represented for better retrieval performance.
Abstract: Recent advances in LLM Multi-Agent Systems enable scalable orchestration of sub-agents, each coordinating hundreds or thousands of tools or Model Context Protocol (MCP) servers. However, existing retrieval methods typically match queries against coarse agent-level descriptions before routing, which obscures fine-grained tool functionality and often results in suboptimal agent selection. We introduce Tool-to-Agent Retrieval, a unified framework that embeds both tools and their parent agents in a shared vector space and connects them through metadata relationships. By explicitly representing tool capabilities and traversing metadata to the agent level, Tool-to-Agent Retrieval enables granular tool-level or agent-level retrieval, ensuring that agents and their underlying tools or MCP servers are equally represented without the context dilution that arises from chunking many tools together. Evaluating Tool-to-Agent Retrieval across eight embedding models, our approach achieves consistent improvements of 19.4% in Recall@5 and 17.7% in nDCG@5 over previous state-of-the-art agent retrievers on the LiveMCPBench benchmark.
cs.CV
[67] iFlyBot-VLA Technical Report
Yuan Zhang, Chenyu Xue, Wenjie Xu, Chao Ji, Jiajia wu, Jia Pan
Main category: cs.CV
TL;DR: iFlyBot-VLA is a Vision-Language-Action model with a dual-level action representation framework that combines latent actions from cross-embodiment data and structured discrete action tokens, achieving competitive performance on manipulation tasks.
Details
Motivation: To develop a comprehensive VLA model that can effectively bridge language, vision, and action representations for robotic manipulation tasks, addressing the challenge of aligning these different modalities.Method: Uses a dual-level action representation framework with latent actions (high-level intentions) and structured discrete action tokens (low-level dynamics), trained on large-scale human/robotic videos with mixed training combining robot trajectory data with QA datasets.
Result: Demonstrates superiority on LIBERO Franka benchmark and achieves competitive success rates in real-world manipulation tasks across diverse and challenging scenarios.
Conclusion: The proposed framework effectively aligns language, vision, and action representations, enabling the VLM to directly contribute to action generation, with plans to open-source part of the dataset for community research.
Abstract: We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our frame-work, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community
[68] Challenging DINOv3 Foundation Model under Low Inter-Class Variability: A Case Study on Fetal Brain Ultrasound
Edoardo Conti, Riccardo Rosati, Lorenzo Federici, Adriano Mancini, Maria Chiara Fiorentin
Main category: cs.CV
TL;DR: First comprehensive evaluation of foundation models in fetal ultrasound imaging under low inter-class variability, showing domain-specific pretraining is essential for distinguishing anatomically similar fetal brain planes.
Details
Motivation: To address the gap in evaluating foundation models' ability to discriminate anatomically similar structures in fetal ultrasound, particularly fetal brain standard planes (TT, TV, TC) which have highly overlapping features and pose challenges for biometric assessment.Method: Curated and aggregated all publicly available fetal ultrasound datasets into FetalUS-188K benchmark (188K+ images). Evaluated DINOv3 through self-supervised pretraining on fetal ultrasound data and natural images, using linear probing and full fine-tuning with standardized adaptation protocols.
Result: Models pretrained on fetal ultrasound data consistently outperformed natural-image initialized models, with up to 20% F1-score improvement. Domain-adaptive pretraining preserved subtle echogenic and structural cues crucial for distinguishing intermediate planes like TV.
Conclusion: Generic foundation models fail to generalize under low inter-class variability, while domain-specific pretraining is essential for achieving robust and clinically reliable representations in fetal brain ultrasound imaging.
Abstract: Purpose: This study provides the first comprehensive evaluation of foundation models in fetal ultrasound (US) imaging under low inter-class variability conditions. While recent vision foundation models such as DINOv3 have shown remarkable transferability across medical domains, their ability to discriminate anatomically similar structures has not been systematically investigated. We address this gap by focusing on fetal brain standard planes–transthalamic (TT), transventricular (TV), and transcerebellar (TC)–which exhibit highly overlapping anatomical features and pose a critical challenge for reliable biometric assessment. Methods: To ensure a fair and reproducible evaluation, all publicly available fetal ultrasound datasets were curated and aggregated into a unified multicenter benchmark, FetalUS-188K, comprising more than 188,000 annotated images from heterogeneous acquisition settings. DINOv3 was pretrained in a self-supervised manner to learn ultrasound-aware representations. The learned features were then evaluated through standardized adaptation protocols, including linear probing with frozen backbone and full fine-tuning, under two initialization schemes: (i) pretraining on FetalUS-188K and (ii) initialization from natural-image DINOv3 weights. Results: Models pretrained on fetal ultrasound data consistently outperformed those initialized on natural images, with weighted F1-score improvements of up to 20 percent. Domain-adaptive pretraining enabled the network to preserve subtle echogenic and structural cues crucial for distinguishing intermediate planes such as TV. Conclusion: Results demonstrate that generic foundation models fail to generalize under low inter-class variability, whereas domain-specific pretraining is essential to achieve robust and clinically reliable representations in fetal brain ultrasound imaging.
[69] Assessing the value of Geo-Foundational Models for Flood Inundation Mapping: Benchmarking models for Sentinel-1, Sentinel-2, and Planetscope for end-users
Saurabh Kaushik, Lalit Maurya, Elizabeth Tellman, ZhiJie Zhang
Main category: cs.CV
TL;DR: Geo-Foundational Models (GFMs) show competitive performance for flood inundation mapping across different satellite sensors, with Clay emerging as the best overall performer due to better accuracy, computational efficiency, and few-shot learning capabilities compared to traditional U-Net models.
Details
Motivation: There is a lack of systematic comparison between GFMs and traditional models like U-Net for flood mapping across different sensors and data availability scenarios, which is essential to guide end-users in model selection.Method: Evaluated three GFMs (Prithvi 2.0, Clay V1.5, DOFA, and UViT) against TransNorm, U-Net, and Attention U-Net using PlanetScope, Sentinel-1, and Sentinel-2 data. Conducted leave-one-region-out cross-validation across five regions and 19 sites, plus few-shot experiments with limited training data.
Result: Clay outperformed other models on PlanetScope (0.79 mIoU) and Sentinel-2 (0.70), while Prithvi led on Sentinel-1 (0.57). In cross-validation, Clay showed slightly better performance across all sensors. Clay achieved 0.64 mIoU with just five training images, significantly outperforming other GFMs. Clay is computationally efficient with 26M parameters, making it 3x faster than Prithvi and 2x faster than DOFA.
Conclusion: GFMs offer small to moderate improvements in flood mapping accuracy at lower computational cost and labeling effort compared to traditional U-Net, with Clay being the most practical choice due to its balanced performance, efficiency, and few-shot learning capabilities.
Abstract: Geo-Foundational Models (GFMs) enable fast and reliable extraction of spatiotemporal information from satellite imagery, improving flood inundation mapping by leveraging location and time embeddings. Despite their potential, it remains unclear whether GFMs outperform traditional models like U-Net. A systematic comparison across sensors and data availability scenarios is still lacking, which is an essential step to guide end-users in model selection. To address this, we evaluate three GFMs, Prithvi 2.0, Clay V1.5, DOFA, and UViT (a Prithvi variant), against TransNorm, U-Net, and Attention U-Net using PlanetScope, Sentinel-1, and Sentinel-2. We observe competitive performance among all GFMs, with only 2-5% variation between the best and worst models across sensors. Clay outperforms others on PlanetScope (0.79 mIoU) and Sentinel-2 (0.70), while Prithvi leads on Sentinel-1 (0.57). In leave-one-region-out cross-validation across five regions, Clay shows slightly better performance across all sensors (mIoU: 0.72(0.04), 0.66(0.07), 0.51(0.08)) compared to Prithvi (0.70(0.05), 0.64(0.09), 0.49(0.13)) and DOFA (0.67(0.07), 0.64(0.04), 0.49(0.09)) for PlanetScope, Sentinel-2, and Sentinel-1, respectively. Across all 19 sites, leave-one-region-out cross-validation reveals a 4% improvement by Clay compared to U-Net. Visual inspection highlights Clay’s superior ability to retain fine details. Few-shot experiments show Clay achieves 0.64 mIoU on PlanetScope with just five training images, outperforming Prithvi (0.24) and DOFA (0.35). In terms of computational time, Clay is a better choice due to its smaller model size (26M parameters), making it ~3x faster than Prithvi (650M) and 2x faster than DOFA (410M). Contrary to previous findings, our results suggest GFMs offer small to moderate improvements in flood mapping accuracy at lower computational cost and labeling effort compared to traditional U-Net.
[70] Locally-Supervised Global Image Restoration
Benjamin Walder, Daniel Toader, Robert Nuster, Günther Paltauf, Peter Burgholzer, Gregor Langer, Lukas Krainer, Markus Haltmeier
Main category: cs.CV
TL;DR: A learning-based method for image reconstruction from incomplete measurements that exploits multiple invariances to achieve full-supervised performance with less ground truth data, validated on photoacoustic microscopy upsampling.
Details
Motivation: Conventional supervised methods need fully sampled ground truth, while self-supervised methods rely on random sampling patterns. This work addresses fixed, deterministic sampling patterns with inherently incomplete coverage.Method: Exploits multiple invariances of the underlying image distribution to overcome limitations of incomplete coverage in fixed sampling patterns, enabling reconstruction performance comparable to fully supervised approaches.
Result: Validated on optical-resolution image upsampling in photoacoustic microscopy, demonstrating competitive or superior reconstruction results while requiring substantially less ground truth data.
Conclusion: The proposed method successfully addresses image reconstruction from incomplete measurements with fixed sampling patterns, achieving full-supervised performance with reduced ground truth requirements.
Abstract: We address the problem of image reconstruction from incomplete measurements, encompassing both upsampling and inpainting, within a learning-based framework. Conventional supervised approaches require fully sampled ground truth data, while self-supervised methods allow incomplete ground truth but typically rely on random sampling that, in expectation, covers the entire image. In contrast, we consider fixed, deterministic sampling patterns with inherently incomplete coverage, even in expectation. To overcome this limitation, we exploit multiple invariances of the underlying image distribution, which theoretically allows us to achieve the same reconstruction performance as fully supervised approaches. We validate our method on optical-resolution image upsampling in photoacoustic microscopy (PAM), demonstrating competitive or superior results while requiring substantially less ground truth data.
[71] Towards Selection of Large Multimodal Models as Engines for Burned-in Protected Health Information Detection in Medical Images
Tuan Truong, Guillermo Jimenez Perez, Pedro Osorio, Matthias Lenga
Main category: cs.CV
TL;DR: LMMs outperform traditional OCR for text extraction but don’t consistently improve PHI detection accuracy. Best gains are seen with complex imprint patterns, while simpler cases show similar performance across pipeline configurations.
Details
Motivation: To evaluate Large Multimodal Models (LMMs) for PHI detection in medical imaging, addressing privacy compliance needs and exploring alternatives to traditional OCR-based approaches.Method: Systematic benchmarking of GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B using two pipeline configurations: text-only analysis and integrated OCR+semantic analysis, compared against traditional OCR models like EasyOCR.
Result: LMMs show superior OCR performance (WER: 0.03-0.05, CER: 0.02-0.03) over EasyOCR, but this doesn’t consistently translate to better PHI detection. Best improvements occur with complex imprint patterns.
Conclusion: LMMs offer enhanced OCR capabilities but limited overall PHI detection improvements. Recommendations provided for LMM selection and deployment strategies based on operational constraints.
Abstract: The detection of Protected Health Information (PHI) in medical imaging is critical for safeguarding patient privacy and ensuring compliance with regulatory frameworks. Traditional detection methodologies predominantly utilize Optical Character Recognition (OCR) models in conjunction with named entity recognition. However, recent advancements in Large Multimodal Model (LMM) present new opportunities for enhanced text extraction and semantic analysis. In this study, we systematically benchmark three prominent closed and open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing two distinct pipeline configurations: one dedicated to text analysis alone and another integrating both OCR and semantic analysis. Our results indicate that LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to conventional models like EasyOCR. However, this improvement in OCR performance does not consistently correlate with enhanced overall PHI detection accuracy. The strongest performance gains are observed on test cases with complex imprint patterns. In scenarios where text regions are well readable with sufficient contrast, and strong LMMs are employed for text analysis after OCR, different pipeline configurations yield similar results. Furthermore, we provide empirically grounded recommendations for LMM selection tailored to specific operational constraints and propose a deployment strategy that leverages scalable and modular infrastructure.
[72] StrengthSense: A Dataset of IMU Signals Capturing Everyday Strength-Demanding Activities
Zeyu Yang, Clayton Souza Leite, Yu Xiao
Main category: cs.CV
TL;DR: StrengthSense is an open IMU dataset capturing 11 strength-demanding activities and 2 non-strength activities from 29 subjects using 10 body-worn sensors, with video-validated joint angle accuracy.
Details
Motivation: There is a lack of comprehensive datasets capturing strength-demanding activities using wearable sensors like IMUs, which are crucial for monitoring muscular strength, endurance, and power.Method: Collected data from 29 healthy subjects using 10 IMUs placed on limbs and torso, annotated using video recordings as references. Conducted comparative analysis between IMU-estimated joint angles and video-extracted angles for validation.
Result: Created a comprehensive open dataset with verified accuracy and reliability of sensor data through joint angle comparison between IMU and video measurements.
Conclusion: StrengthSense enables researchers and developers to advance human activity recognition algorithms and create fitness/health monitoring applications by providing validated strength-activity data.
Abstract: Tracking strength-demanding activities with wearable sensors like IMUs is crucial for monitoring muscular strength, endurance, and power. However, there is a lack of comprehensive datasets capturing these activities. To fill this gap, we introduce \textit{StrengthSense}, an open dataset that encompasses IMU signals capturing 11 strength-demanding activities, such as sit-to-stand, climbing stairs, and mopping. For comparative purposes, the dataset also includes 2 non-strength demanding activities. The dataset was collected from 29 healthy subjects utilizing 10 IMUs placed on limbs and the torso, and was annotated using video recordings as references. This paper provides a comprehensive overview of the data collection, pre-processing, and technical validation. We conducted a comparative analysis between the joint angles estimated by IMUs and those directly extracted from video to verify the accuracy and reliability of the sensor data. Researchers and developers can utilize \textit{StrengthSense} to advance the development of human activity recognition algorithms, create fitness and health monitoring applications, and more.
[73] Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis
Soham Joshi, Shwet Kamal Mishra, Viswanath Gopalakrishnan
Main category: cs.CV
TL;DR: Proposes an automated pipeline for synthesizing large-scale text-VQA datasets using OCR, ROI detection, caption generation, and question generation models.
Details
Motivation: Manual annotation for text-VQA datasets is tedious and challenging; need for automated pipeline leveraging foundation models and mature OCR systems.Method: Streamlined pipeline combining OCR detection/recognition, ROI detection, caption generation, and question generation to automatically synthesize and validate QA pairs.
Result: Created first automated pipeline producing ~72K QA pairs from ~44K images for text-VQA dataset.
Conclusion: Established scalable end-to-end pipeline for automated text-VQA dataset synthesis that can handle large-scale scene text data.
Abstract: Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.
[74] Estimation of Segmental Longitudinal Strain in Transesophageal Echocardiography by Deep Learning
Anders Austlid Taskén, Thierry Judge, Erik Andreas Rye Berg, Jinyang Yu, Bjørnar Grenne, Frank Lindseth, Svend Aakhus, Pierre-Marc Jodoin, Nicolas Duchateau, Olivier Bernard, Gabriel Kiss
Main category: cs.CV
TL;DR: This study introduces autoStrain, the first automated pipeline for segmental longitudinal strain estimation in transesophageal echocardiography using deep learning methods for motion estimation, comparing RAFT-based TeeFlow and CoTracker-based TeeTracker approaches.
Details
Motivation: Current techniques for strain estimation require significant manual intervention and expertise, limiting efficiency and making them too resource-intensive for monitoring purposes. There is a need for automated solutions to enhance cardiac function assessment.Method: Used a simulation pipeline (SIMUS) to generate realistic synthetic TEE dataset with ground truth myocardial motion. Compared two deep learning approaches: TeeFlow (RAFT optical flow model) for dense frame-to-frame predictions and TeeTracker (CoTracker point trajectory model) for sparse long-sequence predictions.
Result: TeeTracker outperformed TeeFlow with mean distance error of 0.65 mm on synthetic test data. Clinical validation on 16 patients showed SLS estimation aligned with clinical references (mean difference 1.09% with -8.90% to 11.09% limits of agreement). Simulated ischemia improved model accuracy for abnormal deformation.
Conclusion: AI-driven motion estimation integrated with TEE can significantly enhance precision and efficiency of cardiac function assessment in clinical settings, providing automated strain estimation that aligns with clinical standards.
Abstract: Segmental longitudinal strain (SLS) of the left ventricle (LV) is an important prognostic indicator for evaluating regional LV dysfunction, in particular for diagnosing and managing myocardial ischemia. Current techniques for strain estimation require significant manual intervention and expertise, limiting their efficiency and making them too resource-intensive for monitoring purposes. This study introduces the first automated pipeline, autoStrain, for SLS estimation in transesophageal echocardiography (TEE) using deep learning (DL) methods for motion estimation. We present a comparative analysis of two DL approaches: TeeFlow, based on the RAFT optical flow model for dense frame-to-frame predictions, and TeeTracker, based on the CoTracker point trajectory model for sparse long-sequence predictions. As ground truth motion data from real echocardiographic sequences are hardly accessible, we took advantage of a unique simulation pipeline (SIMUS) to generate a highly realistic synthetic TEE (synTEE) dataset of 80 patients with ground truth myocardial motion to train and evaluate both models. Our evaluation shows that TeeTracker outperforms TeeFlow in accuracy, achieving a mean distance error in motion estimation of 0.65 mm on a synTEE test dataset. Clinical validation on 16 patients further demonstrated that SLS estimation with our autoStrain pipeline aligned with clinical references, achieving a mean difference (95% limits of agreement) of 1.09% (-8.90% to 11.09%). Incorporation of simulated ischemia in the synTEE data improved the accuracy of the models in quantifying abnormal deformation. Our findings indicate that integrating AI-driven motion estimation with TEE can significantly enhance the precision and efficiency of cardiac function assessment in clinical settings.
[75] Light Future: Multimodal Action Frame Prediction via InstructPix2Pix
Zesen Zhong, Duomin Zhang, Yijia Li
Main category: cs.CV
TL;DR: Proposes a lightweight robot action prediction method using InstructPix2Pix for future visual frame forecasting with single image and text input, achieving superior performance with reduced computational cost.
Details
Motivation: Need for efficient motion trajectory prediction in robotics and autonomous systems to enable safer decision-making, with reduced computational requirements compared to conventional video prediction models.Method: Adapts and fine-tunes InstructPix2Pix model for multimodal future frame prediction, accepting both visual (single image) and textual inputs to forecast 100 frames (10 seconds) into the future.
Result: Achieves superior SSIM and PSNR scores on RoboTWin dataset compared to state-of-the-art baselines, with significantly reduced computational cost and inference latency.
Conclusion: The approach enables efficient robot action prediction with flexible multimodal control, prioritizing motion trajectory precision over visual fidelity, making it valuable for robotics and sports analytics applications.
Abstract: Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling multimodal future frame prediction. Experiments on the RoboTWin dataset (generated based on real-world scenarios) demonstrate that our method achieves superior SSIM and PSNR compared to state-of-the-art baselines in robot action prediction tasks. Unlike conventional video prediction models that require multiple input frames, heavy computation, and slow inference latency, our approach only needs a single image and a text prompt as input. This lightweight design enables faster inference, reduced GPU demands, and flexible multimodal control, particularly valuable for applications like robotics and sports motion trajectory analytics, where motion trajectory precision is prioritized over visual fidelity.
[76] Markerless Augmented Reality Registration for Surgical Guidance: A Multi-Anatomy Clinical Accuracy Study
Yue Yang, Fabian Necker, Christoph Leuze, Michelle Chen, Andrey Finegersh, Jake Lee, Vasu Divi, Bruce Daniel, Brian Hargreaves, Jie Ying Wu, Fred M Baik
Main category: cs.CV
TL;DR: A depth-only, markerless AR registration pipeline on HoloLens 2 achieved ~3-4 mm median error in live surgical settings for small anatomies like feet, ears, and lower legs without requiring fiducial markers.
Details
Motivation: To develop and clinically evaluate a markerless AR registration method that works on small or low-curvature anatomies in real surgical settings, eliminating the need for fiducial markers.Method: Used HoloLens 2 with AHAT depth tracking aligned to CT-derived skin meshes via depth-bias correction, human-in-the-loop initialization, and global+local registration. Validated with AR-tracked tools and performed 7 intraoperative trials.
Result: Preclinical validation showed tight agreement (0.78-1.20 mm RMSE). Clinical median errors: 3.2 mm (feet), 4.3 mm (ear), 5.3 mm (lower leg), with 5 mm coverage of 72-95% depending on anatomy.
Conclusion: The markerless AR pipeline achieved clinically relevant accuracy (~3-4 mm median error) for moderate-risk surgical tasks on small anatomies, improving clinical readiness of markerless AR guidance.
Abstract: Purpose: In this paper, we develop and clinically evaluate a depth-only, markerless augmented reality (AR) registration pipeline on a head-mounted display, and assess accuracy across small or low-curvature anatomies in real-life operative settings. Methods: On HoloLens 2, we align Articulated HAnd Tracking (AHAT) depth to Computed Tomography (CT)-derived skin meshes via (i) depth-bias correction, (ii) brief human-in-the-loop initialization, (iii) global and local registration. We validated the surface-tracing error metric by comparing “skin-to-bone” relative distances to CT ground truth on leg and foot models, using an AR-tracked tool. We then performed seven intraoperative target trials (feet x2, ear x3, leg x2) during the initial stage of fibula free-flap harvest and mandibular reconstruction surgery, and collected 500+ data per trial. Results: Preclinical validation showed tight agreement between AR-traced and CT distances (leg: median |Delta d| 0.78 mm, RMSE 0.97 mm; feet: 0.80 mm, 1.20 mm). Clinically, per-point error had a median of 3.9 mm. Median errors by anatomy were 3.2 mm (feet), 4.3 mm (ear), and 5.3 mm (lower leg), with 5 mm coverage 92-95%, 84-90%, and 72-86%, respectively. Feet vs. lower leg differed significantly (Delta median ~1.1 mm; p < 0.001). Conclusion: A depth-only, markerless AR pipeline on HMDs achieved ~3-4 mm median error across feet, ear, and lower leg in live surgical settings without fiducials, approaching typical clinical error thresholds for moderate-risk tasks. Human-guided initialization plus global-to-local registration enabled accurate alignment on small or low-curvature targets, improving the clinical readiness of markerless AR guidance.
[77] From Instance Segmentation to 3D Growth Trajectory Reconstruction in Planktonic Foraminifera
Huahua Lin, Xiaohao Cai, Mark Nixon, James M. Mulqueeney, Thomas H. G. Ezard
Main category: cs.CV
TL;DR: Automated pipeline for reconstructing 3D growth trajectories of planktonic foraminifera from CT scans using instance segmentation and chamber ordering algorithms.
Details
Motivation: Manual segmentation of foraminifera chambers is time-consuming and subjective, limiting large-scale ecological studies. Automated methods are needed to analyze organismal development under changing environments.Method: End-to-end pipeline combining instance segmentation (computer vision technique) with dedicated chamber ordering algorithm to reconstruct 3D growth trajectories from high-resolution CT scans.
Result: Pipeline substantially reduces manual effort while maintaining biologically meaningful accuracy. Chamber-ordering algorithm remains robust even under partial segmentation, achieving consistent reconstruction of developmental trajectories.
Conclusion: First fully automated and reproducible pipeline for digital foraminiferal growth analysis, establishing foundation for large-scale, data-driven ecological studies.
Abstract: Planktonic foraminifera, marine protists characterized by their intricate chambered shells, serve as valuable indicators of past and present environmental conditions. Understanding their chamber growth trajectory provides crucial insights into organismal development and ecological adaptation under changing environments. However, automated tracing of chamber growth from imaging data remains largely unexplored, with existing approaches relying heavily on manual segmentation of each chamber, which is time-consuming and subjective. In this study, we propose an end-to-end pipeline that integrates instance segmentation, a computer vision technique not extensively explored in foraminifera, with a dedicated chamber ordering algorithm to automatically reconstruct three-dimensional growth trajectories from high-resolution computed tomography scans. We quantitatively and qualitatively evaluate multiple instance segmentation methods, each optimized for distinct spatial features of the chambers, and examine their downstream influence on growth-order reconstruction accuracy. Experimental results on expert-annotated datasets demonstrate that the proposed pipeline substantially reduces manual effort while maintaining biologically meaningful accuracy. Although segmentation models exhibit under-segmentation in smaller chambers due to reduced voxel fidelity and subtle inter-chamber connectivity, the chamber-ordering algorithm remains robust, achieving consistent reconstruction of developmental trajectories even under partial segmentation. This work provides the first fully automated and reproducible pipeline for digital foraminiferal growth analysis, establishing a foundation for large-scale, data-driven ecological studies.
[78] Fast Measuring Pavement Crack Width by Cascading Principal Component Analysis
Zhicheng Wang, Junbiao Pang
Main category: cs.CV
TL;DR: A cascaded PCA-RPCA framework for efficient crack width measurement from pavement images, addressing complex crack morphology and rapid measurement needs.
Details
Motivation: Precise crack width quantification is crucial for pavement structural assessment but challenging due to complex crack boundaries and need for rapid measurements from arbitrary locations.Method: Three-stage framework: 1) crack segmentation using detection algorithms, 2) PCA for orientation axis of quasi-parallel cracks, 3) RPCA for Main Propagation Axis of irregular cracks.
Result: Superior performance in computational efficiency and measurement accuracy compared to state-of-the-art methods across three public datasets.
Conclusion: The proposed PCA-RPCA framework effectively addresses crack width measurement challenges and outperforms existing techniques.
Abstract: Accurate quantification of pavement crack width plays a pivotal role in assessing structural integrity and guiding maintenance interventions. However, achieving precise crack width measurements presents significant challenges due to: (1) the complex, non-uniform morphology of crack boundaries, which limits the efficacy of conventional approaches, and (2) the demand for rapid measurement capabilities from arbitrary pixel locations to facilitate comprehensive pavement condition evaluation. To overcome these limitations, this study introduces a cascaded framework integrating Principal Component Analysis (PCA) and Robust PCA (RPCA) for efficient crack width extraction from digital images. The proposed methodology comprises three sequential stages: (1) initial crack segmentation using established detection algorithms to generate a binary representation, (2) determination of the primary orientation axis for quasi-parallel cracks through PCA, and (3) extraction of the Main Propagation Axis (MPA) for irregular crack geometries using RPCA. Comprehensive evaluations were conducted across three publicly available datasets, demonstrating that the proposed approach achieves superior performance in both computational efficiency and measurement accuracy compared to existing state-of-the-art techniques.
[79] Autobiasing Event Cameras for Flickering Mitigation
Mehdi Sefidgar Dilmaghani, Waseem Shariff, Cian Ryan, Joe Lemley, Peter Corcoran
Main category: cs.CV
TL;DR: Autonomous bias tuning mechanism for event cameras that reduces flicker effects across 25-500 Hz frequency range using CNN-based detection and dynamic bias adjustment, improving face detection performance in various lighting conditions.
Details
Motivation: Address flicker effects caused by rapid light intensity variations that degrade event camera performance in diverse environments, without requiring additional hardware or software filters.Method: Uses CNN to detect flicker in spatial space and dynamically adjusts event camera bias settings to minimize flicker impact, tested with face detector framework under well-lit and low-light conditions.
Result: Significant improvements: enhanced YOLO confidence for face detection, increased percentage of frames with detected faces, and 38.2% (well-lit) to 53.6% (low-light) reduction in average gradient indicating flicker presence.
Conclusion: The autobiasing system effectively improves event camera functionality in adverse lighting scenarios by leveraging inherent bias settings rather than external filters.
Abstract: Understanding and mitigating flicker effects caused by rapid variations in light intensity is critical for enhancing the performance of event cameras in diverse environments. This paper introduces an innovative autonomous mechanism for tuning the biases of event cameras, effectively addressing flicker across a wide frequency range -25 Hz to 500 Hz. Unlike traditional methods that rely on additional hardware or software for flicker filtering, our approach leverages the event cameras inherent bias settings. Utilizing a simple Convolutional Neural Networks -CNNs, the system identifies instances of flicker in a spatial space and dynamically adjusts specific biases to minimize its impact. The efficacy of this autobiasing system was robustly tested using a face detector framework under both well-lit and low-light conditions, as well as across various frequencies. The results demonstrated significant improvements: enhanced YOLO confidence metrics for face detection, and an increased percentage of frames capturing detected faces. Moreover, the average gradient, which serves as an indicator of flicker presence through edge detection, decreased by 38.2 percent in well-lit conditions and by 53.6 percent in low-light conditions. These findings underscore the potential of our approach to significantly improve the functionality of event cameras in a range of adverse lighting scenarios.
[80] Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models
Jinhwan Seo, Yoonki Cho, Junhyug Noh, Sung-eui Yoon
Main category: cs.CV
TL;DR: A framework for Grounded Video Question Answering that decomposes the task into video reasoning, spatio-temporal grounding, and tracking stages, using a trigger moment concept to significantly improve performance.
Details
Motivation: The GVQA task requires robust multimodal models capable of complex reasoning over video content, visual grounding of answers, and temporal tracking of referenced objects.Method: A three-stage pipeline: (1) Video Reasoning & QA, (2) Spatio-temporal Grounding, and (3) Tracking, with a key innovation of using a trigger moment derived from CORTEX prompt to identify the most visible frame of target objects.
Result: Achieved HOTA score of 0.4968, representing a significant improvement over the previous year’s winning score of 0.2704 on GVQA task.
Conclusion: The proposed framework with trigger moment concept effectively addresses the GVQA task by providing robust anchors for grounding and tracking, demonstrating substantial performance gains.
Abstract: In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning & QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year’s winning score of 0.2704 on GVQA task.
[81] MM-UNet: Morph Mamba U-shaped Convolutional Networks for Retinal Vessel Segmentation
Jiawen Liu, Yuanbo Zeng, Jiaming Liang, Yizhen Yang, Yiheng Zhang, Enhui Cai, Xiaoqi Sheng, Hongmin Cai
Main category: cs.CV
TL;DR: MM-UNet is a novel deep learning architecture for retinal vessel segmentation that uses Morph Mamba Convolution layers and Reverse Selective State Guidance modules to improve segmentation accuracy of thin, branching vascular structures.
Details
Motivation: Retinal vessel segmentation is crucial for clinical diagnosis but challenging due to the extremely thin and branching nature of retinal vasculature, which differs significantly from conventional segmentation targets and poses challenges to precision and robustness.Method: Proposes MM-UNet with Morph Mamba Convolution layers that replace pointwise convolutions to enhance branching topological perception, and Reverse Selective State Guidance modules that integrate reverse guidance theory with state-space modeling to improve geometric boundary awareness and decoding efficiency.
Result: Achieves F1-score gains of 1.64% on DRIVE and 1.25% on STARE datasets compared to existing approaches, demonstrating superior segmentation accuracy.
Conclusion: MM-UNet effectively addresses the challenges of retinal vessel segmentation and shows advancement in segmentation accuracy for thin, branching vascular structures.
Abstract: Accurate detection of retinal vessels plays a critical role in reflecting a wide range of health status indicators in the clinical diagnosis of ocular diseases. Recently, advances in deep learning have led to a surge in retinal vessel segmentation methods, which have significantly contributed to the quantitative analysis of vascular morphology. However, retinal vasculature differs significantly from conventional segmentation targets in that it consists of extremely thin and branching structures, whose global morphology varies greatly across images. These characteristics continue to pose challenges to segmentation precision and robustness. To address these issues, we propose MM-UNet, a novel architecture tailored for efficient retinal vessel segmentation. The model incorporates Morph Mamba Convolution layers, which replace pointwise convolutions to enhance branching topological perception through morph, state-aware feature sampling. Additionally, Reverse Selective State Guidance modules integrate reverse guidance theory with state-space modeling to improve geometric boundary awareness and decoding efficiency. Extensive experiments conducted on two public retinal vessel segmentation datasets demonstrate the superior performance of the proposed method in segmentation accuracy. Compared to the existing approaches, MM-UNet achieves F1-score gains of 1.64 $%$ on DRIVE and 1.25 $%$ on STARE, demonstrating its effectiveness and advancement. The project code is public via https://github.com/liujiawen-jpg/MM-UNet.
[82] Language-Enhanced Generative Modeling for PET Synthesis from MRI and Blood Biomarkers
Zhengjie Zhang, Xiaoxie Mao, Qihao Guo, Shaoting Zhang, Qi Huang, Mu Zhou, Fang Xie, Mianxin Liu
Main category: cs.CV
TL;DR: A language-enhanced generative model synthesizes realistic amyloid-beta PET images from blood biomarkers and MRI, enabling accurate Alzheimer’s diagnosis without expensive PET scans.
Details
Motivation: Alzheimer's disease diagnosis relies on expensive and inaccessible amyloid-beta PET imaging. This study aims to predict PET spatial patterns from more accessible blood biomarkers and MRI scans.Method: Collected data from 566 participants (PET, MRI, blood biomarkers). Developed LLM-driven generative model with multimodal fusion to synthesize PET images. Evaluated synthetic images for quality and diagnostic consistency in automated pipeline.
Result: Synthetic PET closely resembles real PET (SSIM=0.920, r=0.955). Diagnostic accuracy=0.80 vs real PET. Synthetic PET-based model (AUC=0.78) outperforms MRI-only (0.68) and blood biomarker-only (0.73) models. Combination with blood biomarkers further improved performance (AUC=0.79).
Conclusion: Language-enhanced generative model successfully synthesizes realistic PET images, enhancing utility of MRI and blood biomarkers for Alzheimer’s assessment and improving diagnostic workflow.
Abstract: Background: Alzheimer’s disease (AD) diagnosis heavily relies on amyloid-beta positron emission tomography (Abeta-PET), which is limited by high cost and limited accessibility. This study explores whether Abeta-PET spatial patterns can be predicted from blood-based biomarkers (BBMs) and MRI scans. Methods: We collected Abeta-PET images, T1-weighted MRI scans, and BBMs from 566 participants. A language-enhanced generative model, driven by a large language model (LLM) and multimodal information fusion, was developed to synthesize PET images. Synthesized images were evaluated for image quality, diagnostic consistency, and clinical applicability within a fully automated diagnostic pipeline. Findings: The synthetic PET images closely resemble real PET scans in both structural details (SSIM = 0.920 +/- 0.003) and regional patterns (Pearson’s r = 0.955 +/- 0.007). Diagnostic outcomes using synthetic PET show high agreement with real PET-based diagnoses (accuracy = 0.80). Using synthetic PET, we developed a fully automatic AD diagnostic pipeline integrating PET synthesis and classification. The synthetic PET-based model (AUC = 0.78) outperforms T1-based (AUC = 0.68) and BBM-based (AUC = 0.73) models, while combining synthetic PET and BBMs further improved performance (AUC = 0.79). Ablation analysis supports the advantages of LLM integration and prompt engineering. Interpretation: Our language-enhanced generative model synthesizes realistic PET images, enhancing the utility of MRI and BBMs for Abeta spatial pattern assessment and improving the diagnostic workflow for Alzheimer’s disease.
[83] Object-Centric 3D Gaussian Splatting for Strawberry Plant Reconstruction and Phenotyping
Jiajia Li, Keyi Zhu, Qianwen Zhang, Dong Chen, Qi Sun, Zhaojian Li
Main category: cs.CV
TL;DR: A novel object-centric 3D reconstruction framework using SAM-2 and alpha channel masking for clean strawberry plant reconstructions, enabling automatic plant trait estimation with improved accuracy and efficiency.
Details
Motivation: Traditional plant phenotyping methods are time-consuming, labor-intensive, and destructive. Current 3DGS applications in agriculture reconstruct entire scenes including background elements, which introduces noise, increases computational costs, and complicates trait analysis.Method: Proposed object-centric 3D reconstruction framework with preprocessing pipeline using Segment Anything Model v2 (SAM-2) and alpha channel background masking. Uses DBSCAN clustering and Principal Component Analysis (PCA) for automatic plant trait estimation.
Result: Method outperforms conventional pipelines in both accuracy and efficiency, producing more accurate geometric representations while substantially reducing computational time. Enables automatic estimation of important plant traits like plant height and canopy width.
Conclusion: Offers a scalable and non-destructive solution for strawberry plant phenotyping, addressing limitations of current 3DGS applications in agricultural domains.
Abstract: Strawberries are among the most economically significant fruits in the United States, generating over $2 billion in annual farm-gate sales and accounting for approximately 13% of the total fruit production value. Plant phenotyping plays a vital role in selecting superior cultivars by characterizing plant traits such as morphology, canopy structure, and growth dynamics. However, traditional plant phenotyping methods are time-consuming, labor-intensive, and often destructive. Recently, neural rendering techniques, notably Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have emerged as powerful frameworks for high-fidelity 3D reconstruction. By capturing a sequence of multi-view images or videos around a target plant, these methods enable non-destructive reconstruction of complex plant architectures. Despite their promise, most current applications of 3DGS in agricultural domains reconstruct the entire scene, including background elements, which introduces noise, increases computational costs, and complicates downstream trait analysis. To address this limitation, we propose a novel object-centric 3D reconstruction framework incorporating a preprocessing pipeline that leverages the Segment Anything Model v2 (SAM-2) and alpha channel background masking to achieve clean strawberry plant reconstructions. This approach produces more accurate geometric representations while substantially reducing computational time. With a background-free reconstruction, our algorithm can automatically estimate important plant traits, such as plant height and canopy width, using DBSCAN clustering and Principal Component Analysis (PCA). Experimental results show that our method outperforms conventional pipelines in both accuracy and efficiency, offering a scalable and non-destructive solution for strawberry plant phenotyping.
[84] Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment
Zhihao Zhan, Wang Pang, Xiang Zhu, Yechao Bai
Main category: cs.CV
TL;DR: A diffusion-based video super-resolution method using a diffusion transformer in latent space that eliminates the need for explicit motion estimation or optical flow alignment.
Details
Motivation: To overcome limitations of traditional video super-resolution methods that require explicit optical flow estimation and motion parameter alignment, by leveraging a powerful diffusion model that learns real-world physics as prior knowledge.Method: Uses Diffusion Posterior Sampling framework with an unconditional video diffusion transformer operating in latent space as a space-time model, enabling alignment-free processing without explicit motion estimation.
Result: Empirical results on synthetic and real-world datasets demonstrate the feasibility of diffusion-based, alignment-free video super-resolution, with the model adapting to different sampling conditions without retraining.
Conclusion: A single video diffusion transformer model can effectively handle various motion patterns as prior knowledge, eliminating the need for explicit optical flow estimation while maintaining adaptability to different sampling conditions.
Abstract: In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Empirical results on synthetic and real-world datasets illustrate the feasibility of diffusion-based, alignment-free video super-resolution.
[85] Can Foundation Models Revolutionize Mobile AR Sparse Sensing?
Yiqin Zhao, Tian Guo
Main category: cs.CV
TL;DR: Foundation models significantly improve mobile sparse sensing by enhancing geometry-aware image warping and 3D scene reconstruction, overcoming traditional trade-offs between sensing quality and efficiency.
Details
Motivation: Mobile sensing systems face fundamental trade-offs between quality and efficiency due to computation and power constraints. Existing sparse sensing methods often suffer from reduced accuracy due to missing information across space and time.Method: Investigate foundation models for mobile sparse sensing using real-world mobile AR data, focusing on geometry-aware image warping to enable accurate reuse of cross-frame information.
Result: Foundation models offer significant improvements in geometry-aware image warping and demonstrate leading performance in 3D scene reconstruction with scalability.
Conclusion: Foundation models show promise for transforming mobile sparse sensing but also reveal open challenges for integration into mobile systems.
Abstract: Mobile sensing systems have long faced a fundamental trade-off between sensing quality and efficiency due to constraints in computation, power, and other limitations. Sparse sensing, which aims to acquire and process only a subset of sensor data, has been a key strategy for maintaining performance under such constraints. However, existing sparse sensing methods often suffer from reduced accuracy, as missing information across space and time introduces uncertainty into many sensing systems. In this work, we investigate whether foundation models can change the landscape of mobile sparse sensing. Using real-world mobile AR data, our evaluations demonstrate that foundation models offer significant improvements in geometry-aware image warping, a central technique for enabling accurate reuse of cross-frame information. Furthermore, our study demonstrates the scalability of foundation model-based sparse sensing and shows its leading performance in 3D scene reconstruction. Collectively, our study reveals critical aspects of the promises and the open challenges of integrating foundation models into mobile sparse sensing systems.
[86] Collaborative Attention and Consistent-Guided Fusion of MRI and PET for Alzheimer’s Disease Diagnosis
Delin Ma, Menghui Zhou, Jun Qi, Yun Yang, Po Yang
Main category: cs.CV
TL;DR: A novel multimodal fusion framework for Alzheimer’s disease diagnosis using MRI and PET that addresses modality-specific feature importance and distribution alignment issues.
Details
Motivation: Early AD diagnosis is crucial, and while multimodal fusion shows promise, existing methods overlook modality-specific features and suffer from distributional differences between MRI and PET data.Method: Collaborative Attention and Consistent-Guided Fusion framework with learnable parameter representation block, shared encoder, modality-independent encoders, and consistency-guided mechanism for distribution alignment.
Result: Superior diagnostic performance on ADNI dataset compared to existing fusion strategies.
Conclusion: The proposed framework effectively addresses modality-specific feature preservation and distribution alignment challenges in multimodal AD diagnosis.
Abstract: Alzheimer’s disease (AD) is the most prevalent form of dementia, and its early diagnosis is essential for slowing disease progression. Recent studies on multimodal neuroimaging fusion using MRI and PET have achieved promising results by integrating multi-scale complementary features. However, most existing approaches primarily emphasize cross-modal complementarity while overlooking the diagnostic importance of modality-specific features. In addition, the inherent distributional differences between modalities often lead to biased and noisy representations, degrading classification performance. To address these challenges, we propose a Collaborative Attention and Consistent-Guided Fusion framework for MRI and PET based AD diagnosis. The proposed model introduces a learnable parameter representation (LPR) block to compensate for missing modality information, followed by a shared encoder and modality-independent encoders to preserve both shared and specific representations. Furthermore, a consistency-guided mechanism is employed to explicitly align the latent distributions across modalities. Experimental results on the ADNI dataset demonstrate that our method achieves superior diagnostic performance compared with existing fusion strategies.
[87] Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency
Hao Li, Daiwei Lu, Jesse d’Almeida, Dilara Isik, Ehsan Khodapanah Aghdam, Nick DiSanto, Ayberk Acar, Susheela Sharma, Jie Ying Wu, Robert J. Webster III, Ipek Oguz
Main category: cs.CV
TL;DR: Latent feature alignment method improves absolute depth estimation in endoscopic videos by reducing domain gap between synthetic and real images using adversarial learning and directional feature consistency.
Details
Motivation: Obtaining absolute depth from endoscopy cameras in surgical scenes is difficult, limiting supervised learning on real endoscopic images. Current domain adaptation methods still leave a domain gap between real and translated synthetic images.Method: Uses latent feature alignment with adversarial learning and directional feature consistency to learn domain-invariant features. The depth network takes both translated synthetic and real endoscopic frames as input.
Result: Achieves superior performance on both absolute and relative depth metrics compared to state-of-the-art methods, with consistent improvements across various backbones and pretrained weights.
Conclusion: The proposed latent feature alignment method effectively reduces domain gap and improves absolute depth estimation for endoscopic videos, being agnostic to the image translation process.
Abstract: Monocular depth estimation (MDE) is a critical task to guide autonomous medical robots. However, obtaining absolute (metric) depth from an endoscopy camera in surgical scenes is difficult, which limits supervised learning of depth on real endoscopic images. Current image-level unsupervised domain adaptation methods translate synthetic images with known depth maps into the style of real endoscopic frames and train depth networks using these translated images with their corresponding depth maps. However a domain gap often remains between real and translated synthetic images. In this paper, we present a latent feature alignment method to improve absolute depth estimation by reducing this domain gap in the context of endoscopic videos of the central airway. Our methods are agnostic to the image translation process and focus on the depth estimation itself. Specifically, the depth network takes translated synthetic and real endoscopic frames as input and learns latent domain-invariant features via adversarial learning and directional feature consistency. The evaluation is conducted on endoscopic videos of central airway phantoms with manually aligned absolute depth maps. Compared to state-of-the-art MDE methods, our approach achieves superior performance on both absolute and relative depth metrics, and consistently improves results across various backbones and pretrained weights. Our code is available at https://github.com/MedICL-VU/MDE.
[88] Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework
Yucheng Song, Yifan Ge, Junhao Li, Zhining Liao, Zhifang Liao
Main category: cs.CV
TL;DR: HTSC-CIF framework addresses three key challenges in Medical Report Generation through hierarchical task decomposition: low-level medical entity alignment, mid-level cross-modal alignment, and high-level causal intervention to reduce biases.
Details
Motivation: Current MRG models face three main challenges: insufficient domain knowledge understanding, poor text-visual entity alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges.Method: Hierarchical task decomposition with three levels: 1) Low-level: align medical entity features with spatial locations; 2) Mid-level: use Prefix Language Modeling and Masked Image Modeling for cross-modal alignment; 3) High-level: cross-modal causal intervention via front-door intervention.
Result: Extensive experiments confirm HTSC-CIF significantly outperforms state-of-the-art MRG methods.
Conclusion: HTSC-CIF effectively addresses all three key challenges in MRG through its hierarchical framework, demonstrating superior performance over existing methods.
Abstract: Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists’ burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF’s effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.
[89] Are Euler angles a useful rotation parameterisation for pose estimation with Normalizing Flows?
Giorgos Sfikas, Konstantina Nikolaidou, Foteini Papadopoulou, George Retsinas, Anastasios L. Kesidis
Main category: cs.CV
TL;DR: The paper explores using Euler angles parameterization for Normalizing Flows models in probabilistic object pose estimation, arguing that despite their shortcomings, Euler angles can lead to useful models compared to more complex parameterizations.
Details
Motivation: Object pose estimation often requires probabilistic outputs when pose is ambiguous due to sensor constraints, projection limitations, or object symmetries. A single point estimate may not be sufficient in these cases.Method: The authors propose using Euler angles parameterization as a basis for Normalizing Flows models for pose estimation, comparing this approach to models built on more complex parameterizations.
Result: The paper explores the usefulness of Euler angles in probabilistic pose estimation models, examining whether this simpler parameterization can produce effective results despite its known limitations.
Conclusion: Euler angles, despite their shortcomings, may lead to useful probabilistic pose estimation models that are competitive with more complex parameterizations in certain aspects.
Abstract: Object pose estimation is a task that is of central importance in 3D Computer Vision. Given a target image and a canonical pose, a single point estimate may very often be sufficient; however, a probabilistic pose output is related to a number of benefits when pose is not unambiguous due to sensor and projection constraints or inherent object symmetries. With this paper, we explore the usefulness of using the well-known Euler angles parameterisation as a basis for a Normalizing Flows model for pose estimation. Isomorphic to spatial rotation, 3D pose has been parameterized in a number of ways, either in or out of the context of parameter estimation. We explore the idea that Euler angles, despite their shortcomings, may lead to useful models in a number of aspects, compared to a model built on a more complex parameterisation.
[90] SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning
Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin, Jiacong Wang, Xiao Liang, Shuicheng Yan, Chao Feng
Main category: cs.CV
TL;DR: SAIL-RL is a reinforcement learning framework that teaches multimodal LLMs when and how to think using dual rewards for reasoning quality and adaptive thinking strategies.
Details
Motivation: Existing approaches suffer from outcome-only supervision (rewarding answers without ensuring sound reasoning) and uniform thinking strategies (overthinking simple tasks, underthinking complex ones).Method: Uses dual reward system: Thinking Reward evaluates reasoning quality (factual grounding, logical coherence, answer consistency) and Judging Reward adaptively determines when to use deep reasoning vs direct answering.
Result: Improves reasoning and multimodal understanding benchmarks at 4B and 8B scales, achieves competitive performance against GPT-4o, and substantially reduces hallucinations.
Conclusion: SAIL-RL establishes a principled framework for building more reliable and adaptive multimodal large language models.
Abstract: We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.
[91] Link prediction Graph Neural Networks for structure recognition of Handwritten Mathematical Expressions
Cuong Tuan Nguyen, Ngoc Tuan Nguyen, Triet Hoang Minh Dao, Huy Minh Nhat, Huy Truong Dinh
Main category: cs.CV
TL;DR: A GNN-based method for HME recognition that models expressions as graphs, using BLSTM for symbol processing and GNN for refining spatial relations to build Symbol Label Graphs.
Details
Motivation: To improve handwritten mathematical expression recognition by effectively capturing spatial dependencies between symbols through graph-based modeling.Method: Uses deep BLSTM for symbol segmentation/recognition, 2D-CFG parser for spatial relations, and GNN link prediction to refine graph structure by removing unnecessary connections.
Result: Experimental results show promising performance in HME structure recognition, demonstrating the effectiveness of the approach.
Conclusion: The proposed GNN-based framework successfully recognizes handwritten mathematical expressions by modeling them as graphs and refining spatial dependencies.
Abstract: We propose a Graph Neural Network (GNN)-based approach for Handwritten Mathematical Expression (HME) recognition by modeling HMEs as graphs, where nodes represent symbols and edges capture spatial dependencies. A deep BLSTM network is used for symbol segmentation, recognition, and spatial relation classification, forming an initial primitive graph. A 2D-CFG parser then generates all possible spatial relations, while the GNN-based link prediction model refines the structure by removing unnecessary connections, ultimately forming the Symbol Label Graph. Experimental results demonstrate the effectiveness of our approach, showing promising performance in HME structure recognition.
[92] Cycle-Sync: Robust Global Camera Pose Estimation through Enhanced Cycle-Consistent Synchronization
Shaohan Li, Yunpeng Shi, Gilad Lerman
Main category: cs.CV
TL;DR: Cycle-Sync is a robust global framework for camera pose estimation that uses modified message-passing least squares with cycle consistency and robust loss functions, achieving state-of-the-art performance without bundle adjustment.
Details
Motivation: To develop a robust and global camera pose estimation framework that avoids the computational complexity of bundle adjustment while maintaining high accuracy and providing strong theoretical guarantees.Method: Adapts message-passing least squares (MPLS) for camera location estimation, emphasizes cycle-consistent information, redefines cycle consistencies using estimated distances, incorporates Welsch-type robust loss, and adds outlier rejection via robust subspace recovery. Fully integrates cycle consistency into rotation synchronization.
Result: Establishes strongest known deterministic exact-recovery guarantee for camera location estimation. Outperforms leading pose estimators including full structure-from-motion pipelines with bundle adjustment on synthetic and real datasets.
Conclusion: Cycle-Sync provides a robust, global camera pose estimation framework that achieves state-of-the-art performance through cycle consistency and modified MPLS, eliminating the need for bundle adjustment while offering strong theoretical guarantees.
Abstract: We introduce Cycle-Sync, a robust and global framework for estimating camera poses (both rotations and locations). Our core innovation is a location solver that adapts message-passing least squares (MPLS) – originally developed for group synchronization – to camera location estimation. We modify MPLS to emphasize cycle-consistent information, redefine cycle consistencies using estimated distances from previous iterations, and incorporate a Welsch-type robust loss. We establish the strongest known deterministic exact-recovery guarantee for camera location estimation, showing that cycle consistency alone – without access to inter-camera distances – suffices to achieve the lowest sample complexity currently known. To further enhance robustness, we introduce a plug-and-play outlier rejection module inspired by robust subspace recovery, and we fully integrate cycle consistency into MPLS for rotation synchronization. Our global approach avoids the need for bundle adjustment. Experiments on synthetic and real datasets show that Cycle-Sync consistently outperforms leading pose estimators, including full structure-from-motion pipelines with bundle adjustment.
[93] GAFD-CC: Global-Aware Feature Decoupling with Confidence Calibration for OOD Detection
Kun Zou, Yongheng Xu, Jianxing Yu, Yan Pan, Jian Yin, Hanjiang Lai
Main category: cs.CV
TL;DR: GAFD-CC is a novel post-hoc OOD detection method that performs global-aware feature decoupling guided by classification weights and fuses decoupled features with multi-scale logit-based confidence for robust OOD detection.
Details
Motivation: Existing post-hoc OOD detection methods overlook the inherent correlation between features and logits, which is crucial for effective OOD detection. This limitation hinders their ability to refine decision boundaries and achieve discriminative performance.Method: GAFD-CC performs global-aware feature decoupling by aligning features with global classification weights to extract positively correlated features (for ID/OOD boundary refinement) and negatively correlated features (to suppress false positives). It then adaptively fuses these decoupled features with multi-scale logit-based confidence.
Result: Extensive experiments on large-scale benchmarks demonstrate GAFD-CC’s competitive performance and strong generalization ability compared to state-of-the-art methods.
Conclusion: GAFD-CC effectively addresses the limitation of existing methods by leveraging the correlation between features and logits, achieving superior OOD detection performance through feature decoupling and confidence calibration.
Abstract: Out-of-distribution (OOD) detection is paramount to ensuring the reliability and robustness of learning models in real-world applications. Existing post-hoc OOD detection methods detect OOD samples by leveraging their features and logits information without retraining. However, they often overlook the inherent correlation between features and logits, which is crucial for effective OOD detection. To address this limitation, we propose Global-Aware Feature Decoupling with Confidence Calibration (GAFD-CC). GAFD-CC aims to refine decision boundaries and increase discriminative performance. Firstly, it performs global-aware feature decoupling guided by classification weights. This involves aligning features with the direction of global classification weights to decouple them. From this, GAFD-CC extracts two types of critical information: positively correlated features that promote in-distribution (ID)/OOD boundary refinement and negatively correlated features that suppress false positives and tighten these boundaries. Secondly, it adaptively fuses these decoupled features with multi-scale logit-based confidence for comprehensive and robust OOD detection. Extensive experiments on large-scale benchmarks demonstrate GAFD-CC’s competitive performance and strong generalization ability compared to those of state-of-the-art methods.
[94] M3PD Dataset: Dual-view Photoplethysmography (PPG) Using Front-and-rear Cameras of Smartphones in Lab and Clinical Settings
Jiankai Tang, Tao Zhang, Jia Li, Yiru Zhang, Mingyu Zhang, Kegang Wang, Yuming Hao, Bolin Wang, Haiyang Li, Xingyao Wang, Yuanchun Shi, Yuntao Wang, Sichong Qian
Main category: cs.CV
TL;DR: M3PD is the first dual-view mobile photoplethysmography dataset with synchronized facial and fingertip videos from 60 participants (47 cardiovascular patients), enabling F3Mamba model that reduces heart-rate error by 21.9-30.2% over single-view methods.
Details
Motivation: Current portable physiological monitoring methods have accessibility limitations and reliability issues with motion artifacts and lighting variations. Video-based smartphone photoplethysmography faces challenges with single-view constraints and lacks datasets for cross-device accuracy validation.Method: Created M3PD dataset with synchronized dual-view videos (facial and fingertip) captured via front and rear smartphone cameras. Proposed F3Mamba model that fuses facial and fingertip views using Mamba-based temporal modeling.
Result: F3Mamba reduces heart-rate error by 21.9% to 30.2% compared to existing single-view baselines, demonstrating improved robustness in challenging real-world scenarios.
Conclusion: The dual-view approach with synchronized facial and fingertip videos significantly improves the reliability and accuracy of mobile photoplethysmography for cardiovascular monitoring, making it more practical for real-world applications.
Abstract: Portable physiological monitoring is essential for early detection and management of cardiovascular disease, but current methods often require specialized equipment that limits accessibility or impose impractical postures that patients cannot maintain. Video-based photoplethysmography on smartphones offers a convenient noninvasive alternative, yet it still faces reliability challenges caused by motion artifacts, lighting variations, and single-view constraints. Few studies have demonstrated reliable application to cardiovascular patients, and no widely used open datasets exist for cross-device accuracy. To address these limitations, we introduce the M3PD dataset, the first publicly available dual-view mobile photoplethysmography dataset, comprising synchronized facial and fingertip videos captured simultaneously via front and rear smartphone cameras from 60 participants (including 47 cardiovascular patients). Building on this dual-view setting, we further propose F3Mamba, which fuses the facial and fingertip views through Mamba-based temporal modeling. The model reduces heart-rate error by 21.9 to 30.2 percent over existing single-view baselines while improving robustness in challenging real-world scenarios. Data and code: https://github.com/Health-HCI-Group/F3Mamba.
[95] CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
Jizheng Ma, Xiaofei Zhou, Yanlong Song, Han Yan
Main category: cs.CV
TL;DR: CoCoVa introduces continuous cross-modal reasoning for VLMs using latent thought vectors, improving accuracy and efficiency over discrete token-based approaches.
Details
Motivation: Current VLMs are constrained by discrete linguistic tokens, which bottleneck the rich, high-dimensional nature of visual perception and cannot capture tacit thought processes beyond verbal expression.Method: Uses iterative reasoning cycle with Latent Q-Former (LQ-Former) to refine latent thought vectors through cross-modal fusion, plus token selection for salient regions and multi-task training with contrastive learning and diffusion-based reconstruction.
Result: Improves accuracy and token efficiency over baselines; 1.5B model competes with 7B-9B models, and 7B version remains competitive with SOTA. Latent space captures interpretable reasoning patterns.
Conclusion: CoCoVa successfully bridges the gap between discrete language processing and continuous visual understanding, demonstrating the potential of continuous cross-modal reasoning in VLMs.
Abstract: In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.
[96] RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
Jiahe Song, Chuang Wang, Bowen Jiang, Yinfan Wang, Hao Zheng, Xingjian Wei, Chengjin Liu, Junyuan Gao, Yubin Wang, Lijun Wu, Jiang Wu, Qian Yu, Conghui He
Main category: cs.CV
TL;DR: RxnCaption converts chemical reaction diagram parsing into an image captioning task using Large Vision-Language Models, achieving state-of-the-art performance with a novel BBox and Index as Visual Prompt strategy.
Details
Motivation: Existing chemical reaction data in papers exist as non-machine-readable images, preventing their use for training machine learning models in chemistry AI research.Method: Reformulates reaction diagram parsing as image captioning using LVLMs, introduces BIVP strategy with MolYOLO detector to pre-draw molecular bounding boxes and indices on input images, simplifying parsing to natural-language description.
Result: Achieves state-of-the-art performance on multiple metrics, constructs RxnCaption-11k dataset (10x larger than prior benchmarks) with balanced test subset across layout archetypes.
Conclusion: The method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry.
Abstract: Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed “BBox and Index as Visual Prompt” (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.
[97] Self-Supervised Moving Object Segmentation of Sparse and Noisy Radar Point Clouds
Leon Schwarzer, Matthias Zeller, Daniel Casado Herraez, Simon Dierl, Michael Heidingsfeld, Cyrill Stachniss
Main category: cs.CV
TL;DR: Self-supervised moving object segmentation for sparse radar point clouds using contrastive learning with cluster refinement to reduce annotation requirements.
Details
Motivation: Radar sensors provide direct Doppler velocity measurements for single-scan moving object segmentation, but radar point clouds are sparse and noisy, making supervised annotation tedious and costly.Method: Two-step approach: contrastive self-supervised representation learning with cluster refinement using dynamic points removal, followed by supervised fine-tuning with limited annotated data.
Result: Method improves label efficiency after fine-tuning and boosts state-of-the-art performance through self-supervised pretraining.
Conclusion: Self-supervised pretraining enables effective moving object segmentation from sparse radar data with reduced annotation requirements.
Abstract: Moving object segmentation is a crucial task for safe and reliable autonomous mobile systems like self-driving cars, improving the reliability and robustness of subsequent tasks like SLAM or path planning. While the segmentation of camera or LiDAR data is widely researched and achieves great results, it often introduces an increased latency by requiring the accumulation of temporal sequences to gain the necessary temporal context. Radar sensors overcome this problem with their ability to provide a direct measurement of a point’s Doppler velocity, which can be exploited for single-scan moving object segmentation. However, radar point clouds are often sparse and noisy, making data annotation for use in supervised learning very tedious, time-consuming, and cost-intensive. To overcome this problem, we address the task of self-supervised moving object segmentation of sparse and noisy radar point clouds. We follow a two-step approach of contrastive self-supervised representation learning with subsequent supervised fine-tuning using limited amounts of annotated data. We propose a novel clustering-based contrastive loss function with cluster refinement based on dynamic points removal to pretrain the network to produce motion-aware representations of the radar data. Our method improves label efficiency after fine-tuning, effectively boosting state-of-the-art performance by self-supervised pretraining.
[98] DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding
Zixuan Liu, Siavash H. Khajavi, Guangkai Jiang
Main category: cs.CV
TL;DR: Introduces DetectiumFire, a large-scale multi-modal dataset with 22.5k fire images and 2.5k fire videos to address the lack of fire domain data for AI models.
Details
Motivation: Current multi-modal models struggle with fire domain applications due to limited publicly available datasets with high-quality fire annotations.Method: Created a comprehensive dataset with high-resolution fire images and videos annotated with computer vision labels and detailed textual prompts.
Result: Dataset enables improved performance in object detection, diffusion-based image generation, and vision-language reasoning tasks for fire-related applications.
Conclusion: DetectiumFire advances fire-related AI research and supports intelligent safety system development, with the dataset publicly released to the community.
Abstract: Recent advances in multi-modal models have demonstrated strong performance in tasks such as image generation and reasoning. However, applying these models to the fire domain remains challenging due to the lack of publicly available datasets with high-quality fire domain annotations. To address this gap, we introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k high-resolution fire-related images and 2.5k real-world fire-related videos covering a wide range of fire types, environments, and risk levels. The data are annotated with both traditional computer vision labels (e.g., bounding boxes) and detailed textual prompts describing the scene, enabling applications such as synthetic data generation and fire risk reasoning. DetectiumFire offers clear advantages over existing benchmarks in scale, diversity, and data quality, significantly reducing redundancy and enhancing coverage of real-world scenarios. We validate the utility of DetectiumFire across multiple tasks, including object detection, diffusion-based image generation, and vision-language reasoning. Our results highlight the potential of this dataset to advance fire-related research and support the development of intelligent safety systems. We release DetectiumFire to promote broader exploration of fire understanding in the AI community. The dataset is available at https://kaggle.com/datasets/38b79c344bdfc55d1eed3d22fbaa9c31fad45e27edbbe9e3c529d6e5c4f93890
[99] A Novel Grouping-Based Hybrid Color Correction Algorithm for Color Point Clouds
Kuo-Liang Chung, Ting-Chung Tang
Main category: cs.CV
TL;DR: A grouping-based hybrid color correction algorithm for color point clouds that adaptively partitions points into proximity groups and applies different correction methods (KBI, JKHE, HE) based on overlapping rate estimation.
Details
Motivation: Color consistency correction is fundamental for 3D rendering and compression, but most existing methods focus on color images rather than point clouds, creating a need for specialized point cloud color correction.Method: Estimates overlapping rate between source and target point clouds, adaptively partitions target points into 2-3 proximity groups (close, moderate, distant), and applies K-nearest neighbors bilateral interpolation (KBI), joint KBI with histogram equalization (JKHE), or histogram equalization (HE) respectively for each group.
Result: Tested on 1086 color point cloud pairs and demonstrated superior color consistency correction compared to state-of-the-art methods.
Conclusion: The proposed grouping-based hybrid approach effectively addresses color consistency in point clouds through adaptive grouping and specialized correction methods for different proximity levels.
Abstract: Color consistency correction for color point clouds is a fundamental yet important task in 3D rendering and compression applications. In the past, most previous color correction methods aimed at correcting color for color images. The purpose of this paper is to propose a grouping-based hybrid color correction algorithm for color point clouds. Our algorithm begins by estimating the overlapping rate between the aligned source and target point clouds, and then adaptively partitions the target points into two groups, namely the close proximity group Gcl and the moderate proximity group Gmod, or three groups, namely Gcl, Gmod, and the distant proximity group Gdist, when the estimated overlapping rate is low or high, respectively. To correct color for target points in Gcl, a K-nearest neighbors based bilateral interpolation (KBI) method is proposed. To correct color for target points in Gmod, a joint KBI and the histogram equalization (JKHE) method is proposed. For target points in Gdist, a histogram equalization (HE) method is proposed for color correction. Finally, we discuss the grouping-effect free property and the ablation study in our algorithm. The desired color consistency correction benefit of our algorithm has been justified through 1086 testing color point cloud pairs against the state-of-the-art methods. The C++ source code of our algorithm can be accessed from the website: https://github.com/ivpml84079/Point-cloud-color-correction.
[100] UniChange: Unifying Change Detection with Multimodal Large Language Model
Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu, Jianye Wang, Qicheng Li, Xiang Li
Main category: cs.CV
TL;DR: UniChange is the first MLLM-based unified change detection model that integrates both binary change detection (BCD) and semantic change detection (SCD) tasks using special tokens and text prompts, achieving state-of-the-art performance on multiple benchmarks.
Details
Motivation: Current change detection models are limited to single-type annotated data and cannot leverage diverse datasets, leading to poor generalization and limited versatility.Method: Leverages Multimodal Large Language Models (MLLMs) with language priors, introduces three special tokens ([T1], [T2], [CHANGE]), and uses text prompts to guide change category identification without predefined classification heads.
Result: Achieves SOTA performance on four benchmarks: WHU-CD (90.41 IoU), S2Looking (53.04 IoU), LEVIR-CD+ (78.87 IoU), and SECOND (57.62 IoU), surpassing all previous methods.
Conclusion: UniChange successfully unifies BCD and SCD tasks, enables knowledge acquisition from multi-source datasets with conflicting class definitions, and demonstrates superior generalization capabilities through MLLM integration.
Abstract: Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/UniChange.
[101] Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs
Arya Shah, Vaibhav Tripathi
Main category: cs.CV
TL;DR: Self-supervised Vision Transformers (DINO) achieve the highest representational alignment between feline and human visual systems compared to CNNs and other ViT variants, with alignment peaking at early network layers.
Details
Motivation: To understand how species-specific ocular anatomy differences (like cats' vertically elongated pupils) manifest in visual representations and identify models that best bridge feline-human visual system differences.Method: Used a frozen-encoder benchmark with layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis across various models including CNNs, supervised ViTs, windowed transformers, and self-supervised ViTs (DINO).
Result: DINO ViT-B/16 achieved the highest alignment (mean CKA-RBF ≈0.814, mean CKA-linear ≈0.745, mean RSA ≈0.698), with alignment peaking at early blocks. Supervised ViTs showed weaker geometric correspondence than DINO despite competitive CKA scores.
Conclusion: Self-supervision combined with ViT inductive biases produces representational geometries that better align feline and human visual systems than CNNs and windowed Transformers, providing neuroscientific insights about cross-species visual computation convergence.
Abstract: Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF $\approx0.814$, mean CKA-linear $\approx0.745$, mean RSA $\approx0.698$), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA $\approx0.53$ at block8; ViT-L/16 $\approx0.47$ at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge. We release our code and dataset for reference and reproducibility.
[102] VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang
Main category: cs.CV
TL;DR: VCode introduces SVG code as a visual representation for multimodal understanding, creating a benchmark that reframes visual tasks as code generation and proposing VCoder framework to improve VLM performance on visual-centric coding.
Details
Motivation: Current progress in code-based reasoning focuses on language-centric tasks, leaving visual-centric coding underexplored. The paper advocates for SVG code as a compact, interpretable visual representation inspired by human sketch reasoning.Method: VCode benchmark covers three domains (commonsense, professional disciplines, visual perception) and proposes CodeVQA evaluation protocol. VCoder framework augments VLMs with iterative revision thinking and visual tools like detectors and parsers.
Result: Frontier VLMs struggle with faithful SVG generation, revealing gaps in visual-centric coding. VCoder achieves 12.3-point overall gain over Claude-4-Opus. Both humans and VLMs perform worse on rendered SVGs but show consistent performance.
Conclusion: SVG code shows promise as symbolic visual representation. The gap between language-centric and visual-centric coding persists, but VCoder’s agentic framework effectively bridges this gap through iterative refinement and visual tool integration.
Abstract: Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model’s intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.
[103] IllumFlow: Illumination-Adaptive Low-Light Enhancement via Conditional Rectified Flow and Retinex Decomposition
Wenyang Wei, Yang yang, Xixi Jia, Xiangchu Feng, Weiwei Wang, Renzhen Wang
Main category: cs.CV
TL;DR: IllumFlow combines conditional Rectified Flow with Retinex theory for low-light image enhancement, separately optimizing illumination and reflectance components to handle lighting variations and noise.
Details
Motivation: To address the challenges of low-light image enhancement, including wide dynamic range of illumination variations and complex noise in reflectance components, while preserving color fidelity.Method: Decomposes images into reflectance/illumination components using Retinex theory, uses conditional rectified flow for illumination modeling, and employs a denoising network with flow-derived data augmentation for reflectance noise removal.
Result: Superior quantitative and qualitative performance in low-light enhancement and exposure correction compared to existing methods, with precise illumination adaptation and customizable brightness enhancement.
Conclusion: IllumFlow effectively handles both lighting variations and noise in low-light images through synergistic integration of conditional rectified flow and Retinex theory, achieving state-of-the-art enhancement results.
Abstract: We present IllumFlow, a novel framework that synergizes conditional Rectified Flow (CRF) with Retinex theory for low-light image enhancement (LLIE). Our model addresses low-light enhancement through separate optimization of illumination and reflectance components, effectively handling both lighting variations and noise. Specifically, we first decompose an input image into reflectance and illumination components following Retinex theory. To model the wide dynamic range of illumination variations in low-light images, we propose a conditional rectified flow framework that represents illumination changes as a continuous flow field. While complex noise primarily resides in the reflectance component, we introduce a denoising network, enhanced by flow-derived data augmentation, to remove reflectance noise and chromatic aberration while preserving color fidelity. IllumFlow enables precise illumination adaptation across lighting conditions while naturally supporting customizable brightness enhancement. Extensive experiments on low-light enhancement and exposure correction demonstrate superior quantitative and qualitative performance over existing methods.
[104] ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension
Duo Xu, Hao Cheng, Xin Lin, Zhen Xie, Hao Wang
Main category: cs.CV
TL;DR: Proposes an automated code-driven pipeline to generate ChartM³ dataset for complex chart understanding, improving MLLMs’ reasoning capabilities through SFT and RL.
Details
Motivation: Current MLLMs have limited coverage of complex chart scenarios and computation-intensive reasoning tasks found in real-world applications.Method: Multi-stage code-driven pipeline using RAG for chart templates and CoT strategies for reasoning codes, generating diverse charts and statistical computations.
Result: Created ChartM³ dataset with 38K charts, 142K Q&A pairs for training, and 2,871 evaluation samples. SFT and RL experiments show improved reasoning and cross-domain generalization.
Conclusion: The dataset enables smaller models to achieve performance comparable to larger models in complex chart comprehension, addressing limitations in current chart understanding research.
Abstract: Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM$^3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.
[105] Synthetic Crop-Weed Image Generation and its Impact on Model Generalization
Garen Boyadjian, Cyrille Pierre, Johann Laconte, Riccardo Bertoglio
Main category: cs.CV
TL;DR: A pipeline for generating synthetic crop-weed images using Blender reduces annotation costs for agricultural weeding robots, achieving 10% sim-to-real gap improvement over previous methods.
Details
Motivation: Training deep learning models for crop-weed segmentation requires large annotated datasets that are costly to obtain in real agricultural fields, creating a need for synthetic data solutions.Method: Procedural generation of synthetic crop-weed images using Blender, producing annotated datasets under diverse conditions including plant growth stages, weed density, lighting variations, and camera angles.
Result: Training on synthetic images achieves a 10% sim-to-real gap improvement over previous state-of-the-art methods, with synthetic data showing better cross-domain generalization than real datasets.
Conclusion: Synthetic agricultural datasets show strong potential and support hybrid training strategies for more efficient model development in agricultural robotics applications.
Abstract: Precise semantic segmentation of crops and weeds is necessary for agricultural weeding robots. However, training deep learning models requires large annotated datasets, which are costly to obtain in real fields. Synthetic data can reduce this burden, but the gap between simulated and real images remains a challenge. In this paper, we present a pipeline for procedural generation of synthetic crop-weed images using Blender, producing annotated datasets under diverse conditions of plant growth, weed density, lighting, and camera angle. We benchmark several state-of-the-art segmentation models on synthetic and real datasets and analyze their cross-domain generalization. Our results show that training on synthetic images leads to a sim-to-real gap of 10%, surpassing previous state-of-the-art methods. Moreover, synthetic data demonstrates good generalization properties, outperforming real datasets in cross-domain scenarios. These findings highlight the potential of synthetic agricultural datasets and support hybrid strategies for more efficient model training.
[106] From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics
Nicolas Schuler, Lea Dewald, Nick Baldig, Jürgen Graf
Main category: cs.CV
TL;DR: Evaluation of small Visual Language Models for scene interpretation and action recognition on edge devices in mobile robotics, focusing on computational efficiency vs. accuracy trade-offs.
Details
Motivation: Large VLMs show strong video understanding capabilities but are computationally intensive, making them unsuitable for edge devices and mobile robotics where low inference time is critical.Method: Proposed pipeline using state-of-the-art small VLMs evaluated on diverse real-world datasets including cityscapes, campus, and indoor scenarios.
Result: Experimental evaluation reveals the potential of small VLMs for edge deployment, while identifying challenges, weaknesses, model biases, and practical application considerations.
Conclusion: Small VLMs show promise for scene interpretation on edge devices in mobile robotics, though trade-offs between accuracy and inference time remain key challenges that need addressing.
Abstract: Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge devices in the context of Mobile Robotics. The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscape, on-campus and indoor scenarios. The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information. Supplementary material is provided via the following repository: https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
[107] KAO: Kernel-Adaptive Optimization in Diffusion for Satellite Image
Teerapong Panboonyuen
Main category: cs.CV
TL;DR: KAO is a novel framework using Kernel-Adaptive Optimization in diffusion models for satellite image inpainting, achieving state-of-the-art performance on VHR datasets like DeepGlobe and Massachusetts Roads Dataset.
Details
Motivation: To address the challenges of satellite image inpainting for very high-resolution datasets, overcoming limitations of existing preconditioned models (requiring extensive retraining) and postconditioned models (high computational overhead).Method: Proposes Kernel-Adaptive Optimization with Latent Space Conditioning to optimize a compact latent space, and incorporates Explicit Propagation for forward-backward fusion in the diffusion process.
Result: KAO sets a new benchmark for VHR satellite image restoration, providing scalable high-performance solution that balances efficiency of preconditioned models with flexibility of postconditioned models.
Conclusion: The proposed KAO framework offers an efficient and accurate approach for satellite image inpainting, demonstrating superior performance on challenging VHR datasets.
Abstract: Satellite image inpainting is a crucial task in remote sensing, where accurately restoring missing or occluded regions is essential for robust image analysis. In this paper, we propose KAO, a novel framework that utilizes Kernel-Adaptive Optimization within diffusion models for satellite image inpainting. KAO is specifically designed to address the challenges posed by very high-resolution (VHR) satellite datasets, such as DeepGlobe and the Massachusetts Roads Dataset. Unlike existing methods that rely on preconditioned models requiring extensive retraining or postconditioned models with significant computational overhead, KAO introduces a Latent Space Conditioning approach, optimizing a compact latent space to achieve efficient and accurate inpainting. Furthermore, we incorporate Explicit Propagation into the diffusion process, facilitating forward-backward fusion, which improves the stability and precision of the method. Experimental results demonstrate that KAO sets a new benchmark for VHR satellite image restoration, providing a scalable, high-performance solution that balances the efficiency of preconditioned models with the flexibility of postconditioned models.
[108] MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer
Taiga Yamane, Satoshi Suzuki, Ryo Masumura, Shotaro Tora
Main category: cs.CV
TL;DR: MVAFormer is a multi-view action recognition method for spatio-temporal action recognition that uses transformer-based cooperation between views while preserving spatial information in feature maps.
Details
Motivation: Previous multi-view action recognition methods focus on single action recognition from entire videos and are not applicable to spatio-temporal action recognition (STAR) where actions are recognized sequentially for each person.Method: Proposes MVAFormer with a novel transformer-based cooperation module that uses feature maps instead of embedding vectors to preserve spatial information, and divides self-attention for same and different views.
Result: Outperforms comparison baselines by approximately 4.4 points on F-measure using a newly collected dataset.
Conclusion: MVAFormer effectively addresses multi-view action recognition in the STAR setting through spatial-preserving feature map cooperation and view-aware attention mechanisms.
Abstract: Multi-view action recognition aims to recognize human actions using multiple camera views and deals with occlusion caused by obstacles or crowds. In this task, cooperation among views, which generates a joint representation by combining multiple views, is vital. Previous studies have explored promising cooperation methods for improving performance. However, since their methods focus only on the task setting of recognizing a single action from an entire video, they are not applicable to the recently popular spatio-temporal action recognition~(STAR) setting, in which each person’s action is recognized sequentially. To address this problem, this paper proposes a multi-view action recognition method for the STAR setting, called MVAFormer. In MVAFormer, we introduce a novel transformer-based cooperation module among views. In contrast to previous studies, which utilize embedding vectors with lost spatial information, our module utilizes the feature map for effective cooperation in the STAR setting, which preserves the spatial information. Furthermore, in our module, we divide the self-attention for the same and different views to model the relationship between multiple views effectively. The results of experiments using a newly collected dataset demonstrate that MVAFormer outperforms the comparison baselines by approximately $4.4$ points on the F-measure.
[109] OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
Xilong Zhou, Jianchun Chen, Pramod Rao, Timo Teufel, Linjie Lyu, Tigran Minasian, Oleksandr Sotnychenko, Xiaoxiao Long, Marc Habermann, Christian Theobalt
Main category: cs.CV
TL;DR: OLATverse is a large-scale dataset with 9M images of 765 real-world objects captured under controlled lighting, providing comprehensive resources for inverse rendering and relighting research.
Details
Motivation: To address the limitation of existing methods that rely on synthetic datasets and small-scale real-world data, which restricts realism and generalization in object-centric inverse rendering and relighting.Method: Captured 765 real-world objects using 35 DSLR cameras and 331 individually controlled light sources, providing well-calibrated camera parameters, object masks, photometric surface normals, and diffuse albedo.
Result: Created the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation with extensive evaluation set and auxiliary resources.
Conclusion: OLATverse represents a pivotal step toward integrating next-generation inverse rendering and relighting methods with real-world data, bridging the gap between synthetic training and real-world application.
Abstract: We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data. The full dataset, along with all post-processing workflows, will be publicly released at https://vcai.mpi-inf.mpg.de/projects/OLATverse/.
[110] Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization
Tao Liu, Kan Ren, Qian Chen
Main category: cs.CV
TL;DR: A cross-view UAV localization framework using object detection and graph neural networks for accurate map matching in GNSS-denied environments.
Details
Motivation: To address UAV localization challenges in GNSS-denied areas where satellite-based methods fail, particularly handling cross-temporal, cross-view, and heterogeneous aerial image matching.Method: Leverages modern object detection to extract salient instances from UAV and satellite images, integrates graph neural network to reason about inter-image and intra-image node relationships, and uses fine-grained graph-based node-similarity metric.
Result: Achieves strong retrieval and localization performance on public and real-world datasets, effectively handles heterogeneous appearance differences and generalizes well to scenarios with larger modality gaps like infrared-visible image matching.
Conclusion: The proposed object detection and graph neural network approach provides an effective solution for cross-view UAV localization that outperforms existing methods and generalizes to various modality gaps.
Abstract: With the rapid growth of the low-altitude economy, UAVs have become crucial for measurement and tracking in patrol systems. However, in GNSS-denied areas, satellite-based localization methods are prone to failure. This paper presents a cross-view UAV localization framework that performs map matching via object detection, aimed at effectively addressing cross-temporal, cross-view, heterogeneous aerial image matching. In typical pipelines, UAV visual localization is formulated as an image-retrieval problem: features are extracted to build a localization map, and the pose of a query image is estimated by matching it to a reference database with known poses. Because publicly available UAV localization datasets are limited, many approaches recast localization as a classification task and rely on scene labels in these datasets to ensure accuracy. Other methods seek to reduce cross-domain differences using polar-coordinate reprojection, perspective transformations, or generative adversarial networks; however, they can suffer from misalignment, content loss, and limited realism. In contrast, we leverage modern object detection to accurately extract salient instances from UAV and satellite images, and integrate a graph neural network to reason about inter-image and intra-image node relationships. Using a fine-grained, graph-based node-similarity metric, our method achieves strong retrieval and localization performance. Extensive experiments on public and real-world datasets show that our approach handles heterogeneous appearance differences effectively and generalizes well, making it applicable to scenarios with larger modality gaps, such as infrared-visible image matching. Our dataset will be publicly available at the following URL: https://github.com/liutao23/ODGNNLoc.git.
[111] Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes
Robinson Umeike, Neil Getty, Yin Xiangyu, Yi Jiang
Main category: cs.CV
TL;DR: PtychoBench benchmark compares SFT vs ICL for adapting foundation models to microscopy tasks, finding optimal strategy depends on task modality - SFT+ICL works best for visual tasks, while ICL alone excels for textual tasks.
Details
Motivation: To determine optimal domain adaptation strategies for foundation models in specialized scientific microscopy workflows, as general-purpose models need effective specialization for scientific tasks.Method: Created PtychoBench multi-modal benchmark to systematically compare Supervised Fine-Tuning (SFT) and In-Context Learning (ICL) strategies for visual artifact detection (VLMs) and textual parameter recommendation (LLMs) in data-scarce conditions.
Result: For visual tasks: SFT+ICL achieved best performance (Micro-F1 0.728). For textual tasks: ICL on large base model was superior (Micro-F1 0.847), outperforming SFT models. Context-aware prompting was consistently better, and fine-tuned models showed contextual interference.
Conclusion: Optimal specialization path depends on task modality - offering a framework for developing effective AI science agents, with ICL excelling for textual tasks and combined SFT+ICL for visual tasks.
Abstract: The automation of workflows in advanced microscopy is a key goal where foundation models like Language Models (LLMs) and Vision-Language Models (VLMs) show great potential. However, adapting these general-purpose models for specialized scientific tasks is critical, and the optimal domain adaptation strategy is often unclear. To address this, we introduce PtychoBench, a new multi-modal, multi-task benchmark for ptychographic analysis. Using this benchmark, we systematically compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies on a visual artifact detection task with VLMs and a textual parameter recommendation task with LLMs in a data-scarce regime. Our findings reveal that the optimal specialization pathway is task-dependent. For the visual task, SFT and ICL are highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest mean performance (Micro-F1 of 0.728). Conversely, for the textual task, ICL on a large base model is the superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a powerful “super-expert” SFT model (0-shot Micro-F1 of 0.839). We also confirm the superiority of context-aware prompting and identify a consistent contextual interference phenomenon in fine-tuned models. These results, benchmarked against strong baselines including GPT-4o and a DINOv3-based classifier, offer key observations for AI in science: the optimal specialization path in our benchmark is dependent on the task modality, offering a clear framework for developing more effective science-based agentic systems.
[112] ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing
Yaosen Chen, Wei Wang, Xuming Wen, Han Yang, Yanru Zhang
Main category: cs.CV
TL;DR: An energy-based optimization method for video shot assembly that learns from reference videos to automatically arrange shots according to specific narrative and artistic styles.
Details
Motivation: Traditional shot assembly is manually done by experienced editors, and current automated video editing technologies fail to capture creators' unique artistic expression in shot arrangement.Method: Uses visual-semantic matching between LLM-generated scripts and video library, extracts shot attributes (size, motion, semantics), employs energy-based models to learn from reference videos, and combines syntax rules for optimization.
Result: The system can automatically arrange independent shots into coherent visual sequences that align with reference video styles, enabling even inexperienced users to create visually compelling videos.
Conclusion: The proposed energy-based shot assembly method successfully automates video editing while preserving artistic expression and assembly styles from reference videos.
Abstract: Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator’s unique artistic expression in shot assembly.To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos. Project page: https://sobeymil.github.io/esa.com
[113] Keeping it Local, Tiny and Real: Automated Report Generation on Edge Computing Devices for Mechatronic-Based Cognitive Systems
Nicolas Schuler, Lea Dewald, Jürgen Graf
Main category: cs.CV
TL;DR: A pipeline for automated report generation in mobile robotics using local multi-modal sensors and edge computing to preserve privacy.
Details
Motivation: The need to evaluate large amounts of heterogeneous data from cognitive systems in critical applications like autonomous driving and service robotics, requiring automated reporting to facilitate system evaluation and acceptance.Method: Proposes a pipeline using local models deployed on edge computing devices with multi-modal sensors, eliminating need for external services while preserving privacy.
Result: Evaluated implementation on diverse dataset spanning indoor, outdoor and urban environments with both quantitative and qualitative results; example reports available in public repository.
Conclusion: The proposed local model approach enables privacy-preserving automated report generation for mobile robotics across various domains without relying on external services.
Abstract: Recent advancements in Deep Learning enable hardware-based cognitive systems, that is, mechatronic systems in general and robotics in particular with integrated Artificial Intelligence, to interact with dynamic and unstructured environments. While the results are impressive, the application of such systems to critical tasks like autonomous driving as well as service and care robotics necessitate the evaluation of large amount of heterogeneous data. Automated report generation for Mobile Robotics can play a crucial role in facilitating the evaluation and acceptance of such systems in various domains. In this paper, we propose a pipeline for generating automated reports in natural language utilizing various multi-modal sensors that solely relies on local models capable of being deployed on edge computing devices, thus preserving the privacy of all actors involved and eliminating the need for external services. In particular, we evaluate our implementation on a diverse dataset spanning multiple domains including indoor, outdoor and urban environments, providing quantitative as well as qualitative evaluation results. Various generated example reports and other supplementary materials are available via a public repository.
[114] LiteVoxel: Low-memory Intelligent Thresholding for Efficient Voxel Rasterization
Jee Won Lee, Jongseong Brad Choi
Main category: cs.CV
TL;DR: LiteVoxel is a self-tuning training pipeline for sparse-voxel rasterization that improves stability, reduces VRAM usage by 40-60%, and preserves low-frequency details while maintaining comparable performance metrics.
Details
Motivation: Address limitations of sparse-voxel rasterization including underfitting low-frequency content, dependency on brittle pruning heuristics, and VRAM overgrowth issues.Method: Uses inverse-Sobel reweighting with mid-training gamma-ramp for low-frequency awareness, depth-quantile pruning logic with EMA-hysteresis guards, and ray-footprint-based priority-driven subdivision under explicit growth budget.
Result: Reduces peak VRAM by 40-60%, mitigates errors in low-frequency regions and boundary instability while maintaining comparable PSNR/SSIM, training time, and FPS to SVRaster pipeline.
Conclusion: LiteVoxel enables more predictable, memory-efficient training without sacrificing perceptual quality by preserving low-frequency detail that prior methods miss.
Abstract: Sparse-voxel rasterization is a fast, differentiable alternative for optimization-based scene reconstruction, but it tends to underfit low-frequency content, depends on brittle pruning heuristics, and can overgrow in ways that inflate VRAM. We introduce LiteVoxel, a self-tuning training pipeline that makes SV rasterization both steadier and lighter. Our loss is made low-frequency aware via an inverse-Sobel reweighting with a mid-training gamma-ramp, shifting gradient budget to flat regions only after geometry stabilize. Adaptation replaces fixed thresholds with a depth-quantile pruning logic on maximum blending weight, stabilized by EMA-hysteresis guards and refines structure through ray-footprint-based, priority-driven subdivision under an explicit growth budget. Ablations and full-system results across Mip-NeRF 360 (6scenes) and Tanks & Temples (3scenes) datasets show mitigation of errors in low-frequency regions and boundary instability while keeping PSNR/SSIM, training time, and FPS comparable to a strong SVRaster pipeline. Crucially, LiteVoxel reduces peak VRAM by ~40%-60% and preserves low-frequency detail that prior setups miss, enabling more predictable, memory-efficient training without sacrificing perceptual quality.
[115] Unsupervised Learning for Industrial Defect Detection: A Case Study on Shearographic Data
Jessica Plassmann, Nicolas Schuler, Georg von Freymann, Michael Schuth
Main category: cs.CV
TL;DR: This study explores unsupervised learning methods for automated anomaly detection in shearographic images, evaluating three architectures trained solely on defect-free data to reduce reliance on expert interpretation and labeled data.
Details
Motivation: Shearography's industrial adoption is limited by the need for expert interpretation. This research aims to reduce reliance on labeled data and manual evaluation through unsupervised learning methods for automated anomaly detection.Method: Three unsupervised architectures were evaluated: fully connected autoencoder, convolutional autoencoder, and student-teacher feature matching model. All models were trained on defect-free data, with a controlled dataset developed using custom specimens with reproducible defect patterns. Two training subsets were used - one with only undistorted defect-free samples, and one including globally deformed defect-free data.
Result: The student-teacher approach achieved superior classification robustness and enabled precise defect localization. It demonstrated improved separability of feature representations compared to autoencoder-based models, as shown through t-SNE embeddings. A YOLOv8 model trained on labeled data served as a reference benchmark.
Conclusion: This study demonstrates the potential of unsupervised deep learning for scalable, label-efficient shearographic inspection in industrial environments, with the student-teacher model showing particular promise for robust anomaly detection and localization.
Abstract: Shearography is a non-destructive testing method for detecting subsurface defects, offering high sensitivity and full-field inspection capabilities. However, its industrial adoption remains limited due to the need for expert interpretation. To reduce reliance on labeled data and manual evaluation, this study explores unsupervised learning methods for automated anomaly detection in shearographic images. Three architectures are evaluated: a fully connected autoencoder, a convolutional autoencoder, and a student-teacher feature matching model. All models are trained solely on defect-free data. A controlled dataset was developed using a custom specimen with reproducible defect patterns, enabling systematic acquisition of shearographic measurements under both ideal and realistic deformation conditions. Two training subsets were defined: one containing only undistorted, defect-free samples, and one additionally including globally deformed, yet defect-free, data. The latter simulates practical inspection conditions by incorporating deformation-induced fringe patterns that may obscure localized anomalies. The models are evaluated in terms of binary classification and, for the student-teacher model, spatial defect localization. Results show that the student-teacher approach achieves superior classification robustness and enables precise localization. Compared to the autoencoder-based models, it demonstrates improved separability of feature representations, as visualized through t-SNE embeddings. Additionally, a YOLOv8 model trained on labeled defect data serves as a reference to benchmark localization quality. This study underscores the potential of unsupervised deep learning for scalable, label-efficient shearographic inspection in industrial environments.
[116] Forecasting Future Anatomies: Longitudianl Brain Mri-to-Mri Prediction
Ali Farki, Elaheh Moradi, Deepika Koundal, Jussi Tohka
Main category: cs.CV
TL;DR: Deep learning models can predict future brain MRI scans from baseline images with high fidelity, enabling participant-specific neurodegenerative disease prognosis.
Details
Motivation: To predict future brain states from baseline MRI scans for studying neurodegenerative diseases like Alzheimer's, moving beyond traditional cognitive score prediction to full image-to-image forecasting.Method: Implemented and evaluated five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, ODE-UNet) on two longitudinal cohorts (ADNI and AIBL) for MRI image-to-image prediction.
Result: Best performing models achieved high-fidelity predictions, with all models generalizing well to independent external datasets and demonstrating robust cross-cohort performance.
Conclusion: Deep learning can reliably predict participant-specific brain MRI at voxel level, offering new opportunities for individualized prognosis in neurodegenerative diseases.
Abstract: Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer’s disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant’s entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.
[117] The Urban Vision Hackathon Dataset and Models: Towards Image Annotations and Accurate Vision Models for Indian Traffic
Akash Sharma, Chinmay Mhatre, Sankalp Gawali, Ruthvik Bokkasam, Brij Kishore, Vishwajeet Pattanaik, Tarun Rambha, Abdul R. Pinjari, Vijay Kovvali, Anirban Chakraborty, Punit Rathore, Raghu Krishnapuram, Yogesh Simmhan
Main category: cs.CV
TL;DR: UVH-26 is the first large-scale Indian traffic dataset with 26,646 images from Bengaluru CCTV cameras, featuring 1.8M bounding boxes across 14 vehicle classes, showing 8.4-31.5% improvement over COCO-trained models.
Details
Motivation: To address the critical gap in existing global benchmarks for Indian traffic scenarios by creating domain-specific training data that captures the heterogeneity of Indian urban mobility.Method: Collected 26,646 high-resolution images from 2800 Bengaluru CCTV cameras over 4 weeks, annotated through crowdsourced hackathon with 565 students, using Majority Voting and STAPLE algorithms to derive consensus ground truth.
Result: Models trained on UVH-26 achieved 8.4-31.5% improvements in mAP50:95 over COCO-trained baselines, with RT-DETR-X performing best at 0.67 mAP50:95 compared to 0.40 for COCO on common classes.
Conclusion: Domain-specific training data significantly improves detection performance for Indian traffic scenarios, providing a foundation for advancing intelligent transportation systems in emerging nations with complex traffic conditions.
Abstract: This report describes the UVH-26 dataset, the first public release by AIM@IISc of a large-scale dataset of annotated traffic-camera images from India. The dataset comprises 26,646 high-resolution (1080p) images sampled from 2800 Bengaluru’s Safe-City CCTV cameras over a 4-week period, and subsequently annotated through a crowdsourced hackathon involving 565 college students from across India. In total, 1.8 million bounding boxes were labeled across 14 vehicle classes specific to India: Cycle, 2-Wheeler (Motorcycle), 3-Wheeler (Auto-rickshaw), LCV (Light Commercial Vehicles), Van, Tempo-traveller, Hatchback, Sedan, SUV, MUV, Mini-bus, Bus, Truck and Other. Of these, 283k-316k consensus ground truth bounding boxes and labels were derived for distinct objects in the 26k images using Majority Voting and STAPLE algorithms. Further, we train multiple contemporary detectors, including YOLO11-S/X, RT-DETR-S/X, and DAMO-YOLO-T/L using these datasets, and report accuracy based on mAP50, mAP75 and mAP50:95. Models trained on UVH-26 achieve 8.4-31.5% improvements in mAP50:95 over equivalent baseline models trained on COCO dataset, with RT-DETR-X showing the best performance at 0.67 (mAP50:95) as compared to 0.40 for COCO-trained weights for common classes (Car, Bus, and Truck). This demonstrates the benefits of domain-specific training data for Indian traffic scenarios. The release package provides the 26k images with consensus annotations based on Majority Voting (UVH-26-MV) and STAPLE (UVH-26-ST) and the 6 fine-tuned YOLO and DETR models on each of these datasets. By capturing the heterogeneity of Indian urban mobility directly from operational traffic-camera streams, UVH-26 addresses a critical gap in existing global benchmarks, and offers a foundation for advancing detection, classification, and deployment of intelligent transportation systems in emerging nations with complex traffic conditions.
[118] Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification
Md Rashidunnabi, Kailash A. Hambarde, Vasco Lopes, Joao C. Neves, Hugo Proenca
Main category: cs.CV
TL;DR: MTF-CVReID is a parameter-efficient framework for cross-view video person re-identification that introduces seven complementary modules over a ViT-B/16 backbone to address viewpoint shifts, scale disparities, and temporal inconsistencies.
Details
Motivation: Video-based person re-identification in cross-view domains (aerial-ground surveillance) faces challenges from extreme viewpoint shifts, scale disparities, and temporal inconsistencies, which remain unsolved problems.Method: Proposes MTF-CVReID with seven modules: Cross-Stream Feature Normalization, Multi-Resolution Feature Harmonization, Identity-Aware Memory Module, Temporal Dynamics Modeling, Inter-View Feature Alignment, Hierarchical Temporal Pattern Learning, and Multi-View Identity Consistency Learning using contrastive learning.
Result: Achieves state-of-the-art performance on AG-VPReID benchmark across all altitude levels with strong cross-dataset generalization to G2A-VReID and MARS datasets, while maintaining real-time efficiency (189 FPS) with only ~2M additional parameters and 0.7 GFLOPs.
Conclusion: Carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency, demonstrating the effectiveness of parameter-efficient approaches for cross-view video ReID.
Abstract: Video-based person re-identification (ReID) in cross-view domains (for example, aerial-ground surveillance) remains an open problem because of extreme viewpoint shifts, scale disparities, and temporal inconsistencies. To address these challenges, we propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone. Specifically, we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment (IVFA) for perspective-invariant representation alignment; (6) Hierarchical Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities; and (7) Multi-View Identity Consistency Learning (MVICL) that enforces cross-view identity coherence using a contrastive learning paradigm. Despite adding only about 2 million parameters and 0.7 GFLOPs over the baseline, MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark across all altitude levels, with strong cross-dataset generalization to G2A-VReID and MARS datasets. These results show that carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency. The source code is available at https://github.com/MdRashidunnabi/MTF-CVReID
[119] A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
Jingyu Lu, Haonan Wang, Qixiang Zhang, Xiaomeng Li
Main category: cs.CV
TL;DR: VCFlow is a subject-agnostic brain decoding framework that reconstructs visual experiences from fMRI using hierarchical modeling of the visual cortex, achieving fast reconstruction without per-subject training.
Details
Motivation: To enable clinical applications by overcoming challenges in cross-subject generalization and complex brain signal processing for visual reconstruction from fMRI.Method: Hierarchical decoding framework modeling ventral-dorsal visual system architecture with feature-level contrastive learning for subject-invariant semantic representations.
Result: Achieves only 7% accuracy loss compared to conventional methods while generating reconstructed videos in 10 seconds without retraining, using significantly less data (vs. 12+ hours per subject).
Conclusion: VCFlow provides a fast, clinically scalable solution for subject-agnostic brain decoding with minimal accuracy sacrifice and no retraining requirements.
Abstract: Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution. The source code will be released upon acceptance of the paper.
[120] TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
Daichi Nagai, Ryugo Morita, Shunsuke Kitada, Hitoshi Iyatomi
Main category: cs.CV
TL;DR: TAUE is a training-free diffusion model framework that enables zero-shot, layer-wise image generation through Noise Transplantation and Cultivation (NTC), producing coherent multi-layered scenes without fine-tuning or datasets.
Details
Motivation: Current text-to-image diffusion models only output flattened images, lacking layer-wise control needed for professional applications. Existing solutions either require inaccessible datasets or can only generate isolated elements, not complete coherent scenes.Method: Uses Noise Transplantation and Cultivation (NTC) to extract intermediate latent representations from foreground and composite generation processes, then transplants them into initial noise for subsequent layers to ensure semantic and structural coherence.
Result: Achieves performance comparable to fine-tuned methods while maintaining high image quality and fidelity, enabling consistent multi-layered outputs without training requirements.
Conclusion: TAUE eliminates costly training and dataset needs, unlocks novel compositional editing applications, and paves the way for more accessible and controllable generative workflows.
Abstract: Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.
[121] Zero-Shot Multi-Animal Tracking in the Wild
Jan Frederik Meier, Timo Lüddecke
Main category: cs.CV
TL;DR: Zero-shot multi-animal tracking using vision foundation models (Grounding Dino + SAM 2) without retraining, achieving strong performance across diverse species and environments.
Details
Motivation: Multi-animal tracking is challenging due to habitat variations, motion patterns, and species appearance differences. Traditional methods require extensive fine-tuning and heuristic design for each scenario.Method: Combines Grounding Dino object detector with Segment Anything Model 2 (SAM 2) tracker and carefully designed heuristics for zero-shot tracking without retraining or hyperparameter adaptation.
Result: Evaluations on ChimpAct, Bird Flock Tracking, AnimalTrack, and GMOT-40 subset demonstrate strong and consistent performance across diverse species and environments.
Conclusion: Vision foundation models enable effective zero-shot multi-animal tracking, eliminating the need for dataset-specific retraining while maintaining robust performance across varied scenarios.
Abstract: Multi-animal tracking is crucial for understanding animal ecology and behavior. However, it remains a challenging task due to variations in habitat, motion patterns, and species appearance. Traditional approaches typically require extensive model fine-tuning and heuristic design for each application scenario. In this work, we explore the potential of recent vision foundation models for zero-shot multi-animal tracking. By combining a Grounding Dino object detector with the Segment Anything Model 2 (SAM 2) tracker and carefully designed heuristics, we develop a tracking framework that can be applied to new datasets without any retraining or hyperparameter adaptation. Evaluations on ChimpAct, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40 demonstrate strong and consistent performance across diverse species and environments. The code is available at https://github.com/ecker-lab/SAM2-Animal-Tracking.
[122] Robust Face Liveness Detection for Biometric Authentication using Single Image
Poulami Raha, Yeongnam Chae
Main category: cs.CV
TL;DR: A lightweight CNN framework for detecting face spoofing attacks (print/display, video, wrap attacks) with fast authentication (1-2 seconds on CPU) and a new dataset of 500+ videos.
Details
Motivation: Face recognition systems are vulnerable to presentation attacks, allowing malicious actors to gain illegitimate access to secure systems.Method: Proposes a novel lightweight CNN framework for liveness detection that can identify various spoofing attacks including print/display, video, and wrap attacks.
Result: The architecture provides seamless liveness detection with fast biometric authentication (1-2 seconds on CPU) and includes a newly created 2D spoof attack dataset with 500+ videos from 60 subjects.
Conclusion: The proposed robust architecture effectively detects presentation attacks and ensures secure biometric authentication, validated through demonstration video.
Abstract: Biometric technologies are widely adopted in security, legal, and financial systems. Face recognition can authenticate a person based on the unique facial features such as shape and texture. However, recent works have demonstrated the vulnerability of Face Recognition Systems (FRS) towards presentation attacks. Using spoofing (aka.,presentation attacks), a malicious actor can get illegitimate access to secure systems. This paper proposes a novel light-weight CNN framework to identify print/display, video and wrap attacks. The proposed robust architecture provides seamless liveness detection ensuring faster biometric authentication (1-2 seconds on CPU). Further, this also presents a newly created 2D spoof attack dataset consisting of more than 500 videos collected from 60 subjects. To validate the effectiveness of this architecture, we provide a demonstration video depicting print/display, video and wrap attack detection approaches. The demo can be viewed in the following link: https://rak.box.com/s/m1uf31fn5amtjp4mkgf1huh4ykfeibaa
[123] Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models
Tianfan Peng, Yuntao Du, Pengzhou Ji, Shijie Dong, Kailin Jiang, Mingchuan Ma, Yijun Tian, Jinhe Bi, Qian Li, Wei Du, Feng Xiao, Lizhen Cui
Main category: cs.CV
TL;DR: UniPruneBench is a unified benchmark for evaluating visual token pruning methods in large multimodal models, revealing that random pruning is surprisingly effective and no single method consistently outperforms others.
Details
Motivation: Large multimodal models suffer from inference inefficiency due to excessive visual tokens, and existing token compression methods lack consistent evaluation standards.Method: Developed UniPruneBench with standardized protocols across 6 ability dimensions and 10 datasets, evaluating 10 compression algorithms on 3 LMM families (LLaVA-v1.5, Intern-VL3, Qwen2.5-VL) with both accuracy and system-level metrics.
Result: Key findings: random pruning is strong baseline, no method consistently outperforms others, OCR tasks are most pruning-sensitive, and pruning ratio is the dominant performance factor.
Conclusion: UniPruneBench provides a reliable foundation for future efficient multimodal modeling research, highlighting the need for scenario-specific pruning strategies.
Abstract: Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.
[124] Differentiable Hierarchical Visual Tokenization
Marius Aasan, Martine Hjelkrem-Tan, Nico Catalano, Changkyu Choi, Adín Ramírez Rivera
Main category: cs.CV
TL;DR: An end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while maintaining compatibility with existing Vision Transformer architectures.
Details
Motivation: Vision Transformers use fixed patch tokens that ignore spatial and semantic structure of images, limiting their ability to adapt to image content.Method: Hierarchical model selection with information criteria to create content-adaptive tokens at pixel-level granularity in a differentiable manner.
Result: Competitive performance in both image-level classification and dense-prediction tasks, with additional capability for raster-to-vector conversion.
Conclusion: The proposed differentiable tokenizer successfully addresses limitations of fixed patch tokens while remaining backward-compatible with pretrained models.
Abstract: Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.
[125] Modality-Transition Representation Learning for Visible-Infrared Person Re-Identification
Chao Yuan, Zanwu Liu, Guiwei Zhang, Haoxuan Xu, Yujian Zhao, Guanglin Niu, Bo Li
Main category: cs.CV
TL;DR: Proposes MTRL framework for VI-ReID using modality-transition representation learning with generated intermediate images as transmitters, achieving state-of-the-art performance without additional parameters.
Details
Motivation: Existing VI-ReID methods have limitations: they rely on intermediate representations through image generation or feature fusion, which don't effectively utilize intermediate features and lack interpretability. There's a substantial gap between visible and infrared modalities that needs better alignment.Method: Uses modality-transition representation learning with middle generated images as transmitters from visible to infrared modalities. Employs modality-transition contrastive loss and modality-query regularization loss for training. The framework requires no additional parameters and maintains same inference speed as backbone.
Result: Significantly and consistently outperforms existing state-of-the-art methods on three typical VI-ReID datasets.
Conclusion: The proposed MTRL framework effectively aligns cross-modal features through modality-transition representation learning, achieving superior VI-ReID performance without increasing model complexity or reducing inference speed.
Abstract: Visible-infrared person re-identification (VI-ReID) technique could associate the pedestrian images across visible and infrared modalities in the practical scenarios of background illumination changes. However, a substantial gap inherently exists between these two modalities. Besides, existing methods primarily rely on intermediate representations to align cross-modal features of the same person. The intermediate feature representations are usually create by generating intermediate images (kind of data enhancement), or fusing intermediate features (more parameters, lack of interpretability), and they do not make good use of the intermediate features. Thus, we propose a novel VI-ReID framework via Modality-Transition Representation Learning (MTRL) with a middle generated image as a transmitter from visible to infrared modals, which are fully aligned with the original visible images and similar to the infrared modality. After that, using a modality-transition contrastive loss and a modality-query regularization loss for training, which could align the cross-modal features more effectively. Notably, our proposed framework does not need any additional parameters, which achieves the same inference speed to the backbone while improving its performance on VI-ReID task. Extensive experimental results illustrate that our model significantly and consistently outperforms existing SOTAs on three typical VI-ReID datasets.
[126] VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models
Zhicheng Zhang, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di Zhang, Jufeng Yang
Main category: cs.CV
TL;DR: Proposes a novel affective cues-guided reasoning framework for video emotion analysis using video emotion foundation models (VidEmo) with two-stage tuning and a large emotion-centric dataset (Emo-CFG).
Details
Motivation: Emotions are dynamic and cues-dependent, making it challenging to understand complex emotional states with reasonable rationale in video analysis.Method: Uses affective cues-guided reasoning framework with video emotion foundation models (VidEmo) that undergo curriculum emotion learning and affective-tree reinforcement learning, supported by a 2.1M instruction-based dataset (Emo-CFG).
Result: Achieves competitive performance and sets new milestones across 15 face perception tasks.
Conclusion: The proposed framework effectively addresses the challenges in video emotion analysis through staged reasoning and comprehensive data infrastructure.
Abstract: Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.
[127] LLEXICORP: End-user Explainability of Convolutional Neural Networks
Vojtěch Kůr, Adam Bajger, Adam Kukučka, Marek Hradil, Vít Musil, Tomáš Brázdil
Main category: cs.CV
TL;DR: LLEXICORP automates concept naming and explanation generation in CNNs by combining concept relevance propagation with multimodal LLMs, making model interpretations more accessible.
Details
Motivation: Current concept relevance propagation methods require manual inspection and explanation synthesis, limiting scalability and accessibility of CNN interpretations.Method: A modular pipeline that couples CRP with a multimodal LLM, using crafted prompts to teach CRP semantics and separate naming from explanation tasks.
Result: Automatically generates descriptive concept names and natural-language explanations that can be tailored to different audiences (experts vs. non-technical).
Conclusion: Integrating concept-based attribution with LLMs significantly lowers the barrier to interpreting deep neural networks, enabling more transparent AI systems.
Abstract: Convolutional neural networks (CNNs) underpin many modern computer vision systems. With applications ranging from common to critical areas, a need to explain and understand the model and its decisions (XAI) emerged. Prior works suggest that in the top layers of CNNs, the individual channels can be attributed to classifying human-understandable concepts. Concept relevance propagation (CRP) methods can backtrack predictions to these channels and find images that most activate these channels. However, current CRP workflows are largely manual: experts must inspect activation images to name the discovered concepts and must synthesize verbose explanations from relevance maps, limiting the accessibility of the explanations and their scalability. To address these issues, we introduce Large Language model EXplaIns COncept Relevance Propagation (LLEXICORP), a modular pipeline that couples CRP with a multimodal large language model. Our approach automatically assigns descriptive names to concept prototypes and generates natural-language explanations that translate quantitative relevance distributions into intuitive narratives. To ensure faithfulness, we craft prompts that teach the language model the semantics of CRP through examples and enforce a separation between naming and explanation tasks. The resulting text can be tailored to different audiences, offering low-level technical descriptions for experts and high-level summaries for non-technical stakeholders. We qualitatively evaluate our method on various images from ImageNet on a VGG16 model. Our findings suggest that integrating concept-based attribution methods with large language models can significantly lower the barrier to interpreting deep neural networks, paving the way for more transparent AI systems.
[128] Dynamic Reflections: Probing Video Representations with Text Alignment
Tyler Zhu, Tengda Han, Leonidas Guibas, Viorica Pătrăucean, Maks Ovsjanikov
Main category: cs.CV
TL;DR: This paper presents the first comprehensive study of video-text representation alignment, exploring how modern video and language encoders align across modalities and how this alignment relates to downstream task performance.
Details
Motivation: While image-text alignment has been well-studied, the temporal nature of video data remains largely unexplored in cross-modal representation alignment research. The authors aim to fill this gap by systematically investigating video-text alignment.Method: The study probes capabilities of modern video and language encoders through cross-modal alignment analysis. They propose parametric test-time scaling laws to capture alignment behavior and investigate correlations between semantic alignment and downstream task performance.
Result: Key findings show that cross-modal alignment highly depends on the richness of both visual (static vs. multi-frame) and text (single caption vs. collection) data. The proposed scaling laws demonstrate remarkable predictive power. Strong alignment correlates with general-purpose video representation and understanding, and temporal reasoning is linked to cross-modal alignment.
Conclusion: Video-text alignment serves as an informative zero-shot way to probe representation power of encoders for spatio-temporal data, providing insights into structural similarities and downstream capabilities across modalities.
Abstract: The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at https://video-prh.github.io/
[129] PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
Antonio Oroz, Matthias Nießner, Tobias Kirschstein
Main category: cs.CV
TL;DR: PercHead is a method for single-image 3D head reconstruction and semantic 3D editing using perceptual supervision from DINOv2 and SAM2.1, achieving state-of-the-art novel-view synthesis and enabling intuitive geometry and appearance editing.
Details
Motivation: Address challenges in single-image 3D head reconstruction including view occlusions, weak perceptual supervision, and ambiguity in 3D editing.Method: Uses dual-branch encoder with ViT-based decoder for 3D lifting via cross-attention, Gaussian Splatting rendering, and perceptual supervision from DINOv2/SAM2.1. For editing, swaps encoder and uses segmentation maps for geometry control with text/image prompts for appearance.
Result: Achieves state-of-the-art novel-view synthesis with exceptional robustness to extreme viewing angles. Enables intuitive semantic 3D editing through interactive GUI.
Conclusion: PercHead provides a unified framework for high-quality 3D head reconstruction and semantic editing, demonstrating strong performance and user-friendly editing capabilities.
Abstract: We present PercHead, a method for single-image 3D head reconstruction and semantic 3D editing - two tasks that are inherently challenging due to severe view occlusions, weak perceptual supervision, and the ambiguity of editing in 3D space. We develop a unified base model for reconstructing view-consistent 3D heads from a single input image. The model employs a dual-branch encoder followed by a ViT-based decoder that lifts 2D features into 3D space through iterative cross-attention. Rendering is performed using Gaussian Splatting. At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric and appearance fidelity. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles compared to established baselines. Furthermore, this base model can be seamlessly extended for semantic 3D editing by swapping the encoder and finetuning the network. In this variant, we disentangle geometry and style through two distinct input modalities: a segmentation map to control geometry and either a text prompt or a reference image to specify appearance. We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI, where users can effortlessly sculpt geometry by drawing segmentation maps and stylize appearance via natural language or image prompts. Project Page: https://antoniooroz.github.io/PercHead Video: https://www.youtube.com/watch?v=4hFybgTk4kE
[130] When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye
Main category: cs.CV
TL;DR: MIRA is a new benchmark that evaluates models on tasks requiring generation of intermediate visual images (sketches, diagrams) for reasoning, mimicking human “drawing to think” processes. It contains 546 multimodal problems and shows current models struggle with text-only reasoning but improve significantly when provided visual cues.
Details
Motivation: Traditional Chain-of-Thought methods rely solely on text, but many complex problems require visual reasoning through intermediate images like sketches and diagrams, similar to how humans solve problems by "drawing to think".Method: Created MIRA benchmark with 546 multimodal problems requiring intermediate visual image generation. Proposed unified evaluation protocol with three input levels: direct input, text-only CoT, and Visual-CoT with annotated image clues and thinking prompts.
Result: Existing multimodal models perform poorly with text-only prompts but show 33.7% average relative improvement when intermediate visual cues are provided. Textual prompt optimization and expanded search space yield only limited improvements compared to Visual-CoT.
Conclusion: Intermediate visual information is critical for successful reasoning on complex multimodal tasks, highlighting the limitations of text-only reasoning approaches and the importance of visual imagination in problem-solving.
Abstract: We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through “drawing to think”. To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.
[131] AI-Generated Image Detection: An Empirical Study and Future Research Directions
Nusrat Tasnim, Kutub Uddin, Khalid Mahmood Malik
Main category: cs.CV
TL;DR: This paper introduces a unified benchmarking framework to systematically evaluate AI-generated media forensic methods, addressing gaps in non-standardized benchmarks, inconsistent training protocols, and limited evaluation metrics.
Details
Motivation: AI-generated media threats like deepfakes are eroding public trust and increasing fraud, while current forensic methods suffer from inconsistent evaluation standards that hinder fair comparison and deployment in security-critical applications.Method: The authors developed a unified benchmarking framework that evaluates ten state-of-the-art forensic methods (scratch, frozen, and fine-tuned) across seven publicly available datasets (GAN and diffusion) using multiple metrics including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity.
Result: Evaluations revealed substantial variability in generalization, with some methods showing strong in-distribution performance but poor cross-model transferability. Model interpretability was analyzed using confidence curves and Grad-CAM heatmaps.
Conclusion: This study provides guidance for developing more robust, generalizable, and explainable forensic solutions by establishing standardized evaluation protocols and revealing the limitations of current approaches.
Abstract: The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.
[132] PLUTO-4: Frontier Pathology Foundation Models
Harshith Padigela, Shima Nofallah, Atchuth Naveen Chilaparasetti, Ryun Han, Andrew Walker, Judy Shen, Chintan Shah, Blake Martin, Aashish Sood, Elliot Miller, Ben Glass, Andy Beck, Harsha Pokkalla, Syed Ashar Javed
Main category: cs.CV
TL;DR: PLUTO-4 introduces two advanced pathology foundation models - a compact PLUTO-4S for efficient deployment and a frontier-scale PLUTO-4G for maximum performance - achieving state-of-the-art results across various pathology tasks.
Details
Motivation: To build on the progress of foundation models in pathology by creating next-generation models that can handle diverse histopathology tasks with improved efficiency and performance.Method: Developed two Vision Transformer architectures: PLUTO-4S (compact, multi-scale with FlexiViT and 2D-RoPE) and PLUTO-4G (frontier-scale, single patch size). Pretrained using DINOv2 self-supervised objective on a large multi-institutional corpus of 551,164 WSIs from 137,144 patients across 50+ institutions.
Result: Achieved state-of-the-art performance across patch-level classification, segmentation, and slide-level diagnosis. PLUTO-4S provides high-throughput deployment, while PLUTO-4G shows 11% improvement in dermatopathology diagnosis and establishes new performance frontiers.
Conclusion: PLUTO-4 demonstrates strong potential to transform real-world pathology applications as a backbone for translational research and diagnostic use cases, with diverse improvements across multiple benchmarks.
Abstract: Foundation models trained on large-scale pathology image corpora have demonstrated strong transfer capabilities across diverse histopathology tasks. Building on this progress, we introduce PLUTO-4, our next generation of pathology foundation models that extend the Pathology-Universal Transformer (PLUTO) to frontier scale. We share two complementary Vision Transformer architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE embeddings, and a frontier-scale PLUTO-4G model trained with a single patch size to maximize representation capacity and stability. Both models are pretrained using a self-supervised objective derived from DINOv2 on a large multi-institutional corpus containing 551,164 WSIs from 137,144 patients across over 50 institutions, spanning over 60 disease types and over 100 stains. Comprehensive evaluation across public and internal benchmarks demonstrates that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying spatial and biological context, including patch-level classification, segmentation, and slide-level diagnosis. The compact PLUTO-4S provides high-throughput and robust performance for practical deployment, while PLUTO-4G establishes new performance frontiers across multiple pathology benchmarks, including an 11% improvement in dermatopathology diagnosis. These diverse improvements underscore PLUTO-4’s potential to transform real-world applications as a backbone for translational research and diagnostic use cases.
[133] Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks
Dmitrii Pozdeev, Alexey Artemov, Ananta R. Bhattarai, Artem Sevastopolsky
Main category: cs.CV
TL;DR: DenseMarks is a learned representation for human heads that enables high-quality dense correspondences by predicting 3D embeddings for each pixel in a canonical unit cube, trained using contrastive learning with point matches from talking head videos.
Details
Motivation: To create a robust representation for human heads that can handle pose variations and cover the entire head including hair, enabling applications like semantic part matching, face/head tracking, and stereo reconstruction.Method: Uses Vision Transformer to predict 3D pixel embeddings in canonical space, trained with contrastive loss on pairwise point matches from talking head videos, plus multi-task learning with face landmarks, segmentation constraints, and spatial continuity through latent cube features.
Result: Achieves state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models, with robustness to pose variations and full head coverage including hair.
Conclusion: DenseMarks provides an interpretable and queryable canonical space representation that ensures consistency across poses and individuals, enabling various applications in human head analysis and reconstruction.
Abstract: We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.
[134] Robust Identity Perceptual Watermark Against Deepfake Face Swapping
Tianyi Wang, Mengxiao Huang, Harry Cheng, Bin Ma, Yinglong Wang
Main category: cs.CV
TL;DR: Proposes a robust identity perceptual watermarking framework for proactive defense against Deepfake face swapping, enabling both detection and source tracing without requiring ground-truth images.
Details
Motivation: Deepfake face swapping causes critical privacy issues, and existing proactive defense approaches have unsatisfactory results in visual quality, detection accuracy, and source tracing ability.Method: Assigns identity semantics to watermarks using chaotic encryption for confidentiality, trains encoder-decoder framework with adversarial manipulations for robust watermark encoding/recovery, and justifies consistency between content-matched identity watermark and recovered watermark.
Result: Extensive experiments show state-of-the-art detection and source tracing performance against Deepfake face swapping with promising watermark robustness for cross-dataset and cross-manipulation settings.
Conclusion: The proposed framework effectively addresses the research gap by providing robust proactive defense against Deepfake face swapping with both detection and source tracing capabilities.
Abstract: Notwithstanding offering convenience and entertainment to society, Deepfake face swapping has caused critical privacy issues with the rapid development of deep generative models. Due to imperceptible artifacts in high-quality synthetic images, passive detection models against face swapping in recent years usually suffer performance damping regarding the generalizability issue in cross-domain scenarios. Therefore, several studies have been attempted to proactively protect the original images against malicious manipulations by inserting invisible signals in advance. However, existing proactive defense approaches demonstrate unsatisfactory results with respect to visual quality, detection accuracy, and source tracing ability. In this study, to fulfill the research gap, we propose a robust identity perceptual watermarking framework that concurrently performs detection and source tracing against Deepfake face swapping proactively. We innovatively assign identity semantics regarding the image contents to the watermarks and devise an unpredictable and nonreversible chaotic encryption system to ensure watermark confidentiality. The watermarks are robustly encoded and recovered by jointly training an encoder-decoder framework along with adversarial image manipulations. For a suspect image, falsification is accomplished by justifying the consistency between the content-matched identity perceptual watermark and the recovered robust watermark, without requiring the ground-truth. Moreover, source tracing can be accomplished based on the identity semantics that the recovered watermark carries. Extensive experiments demonstrate state-of-the-art detection and source tracing performance against Deepfake face swapping with promising watermark robustness for both cross-dataset and cross-manipulation settings.
[135] Training Convolutional Neural Networks with the Forward-Forward algorithm
Riccardo Scodellaro, Ajinkya Kulkarni, Frauke Alves, Matthias Schröter
Main category: cs.CV
TL;DR: Extends Forward-Forward algorithm to CNNs using Fourier patterns and morphological transformations for label distribution, enabling successful training on CIFAR datasets and revealing meaningful feature learning across layers.
Details
Motivation: To explore biologically inspired alternatives to backpropagation by adapting the Forward-Forward algorithm for convolutional neural networks, addressing limitations of fully connected implementations.Method: Introduces spatially extended labeling strategies using Fourier patterns and morphological transformations to distribute label information across all spatial positions in convolutional layers.
Result: Successfully trained deeper FF-CNNs on CIFAR10, prevented shortcut solutions with morphology-based labels, scaled to 100 classes on CIFAR100, and revealed meaningful feature learning through Class Activation Maps.
Conclusion: Forward-Forward training is feasible for CNNs, provides insights into learning dynamics, and shows potential for neuromorphic computing and biologically inspired learning systems.
Abstract: Recent successes in image analysis with deep neural networks are achieved almost exclusively with Convolutional Neural Networks (CNNs), typically trained using the backpropagation (BP) algorithm. In a 2022 preprint, Geoffrey Hinton proposed the Forward-Forward (FF) algorithm as a biologically inspired alternative, where positive and negative examples are jointly presented to the network and training is guided by a locally defined goodness function. Here, we extend the FF paradigm to CNNs. We introduce two spatially extended labeling strategies, based on Fourier patterns and morphological transformations, that enable convolutional layers to access label information across all spatial positions. On CIFAR10, we show that deeper FF-trained CNNs can be optimized successfully and that morphology-based labels prevent shortcut solutions on dataset with more complex and fine features. On CIFAR100, carefully designed label sets scale effectively to 100 classes. Class Activation Maps reveal that FF-trained CNNs learn meaningful and complementary features across layers. Together, these results demonstrate that FF training is feasible beyond fully connected networks, provide new insights into its learning dynamics and stability, and highlight its potential for neuromorphic computing and biologically inspired learning.
[136] Deep Fourier-embedded Network for RGB and Thermal Salient Object Detection
Pengfei Lyu, Xiaosheng Yu, Pak-Hei Yeung, Chengdong Wu, Jagath C. Rajapakse
Main category: cs.CV
TL;DR: FreqSal is a Fourier Transform-based RGB-T salient object detection model that uses linear complexity FFT for efficient bimodal fusion, edge enhancement, and frequency-aware decoding, outperforming 29 existing methods.
Details
Motivation: Existing Transformer-based RGB-T SOD models have quadratic complexity and are memory-intensive, limiting high-resolution bimodal feature fusion applications.Method: Uses Fast Fourier Transform with linear complexity for three components: Modal-coordinated Perception Attention for bimodal fusion, Frequency-decomposed Edge-aware Block for edge enhancement, and Fourier Residual Channel Attention Block for feature decoding, plus Co-focus Frequency Loss for frequency gap reduction.
Result: Extensive experiments on ten bimodal SOD benchmark datasets show FreqSal outperforms twenty-nine state-of-the-art bimodal SOD models.
Conclusion: FreqSal provides an efficient and accurate solution for RGB-T salient object detection using Fourier Transform with linear complexity, validated by comprehensive ablation studies.
Abstract: The rapid development of deep learning has significantly improved salient object detection (SOD) combining both RGB and thermal (RGB-T) images. However, existing Transformer-based RGB-T SOD models with quadratic complexity are memory-intensive, limiting their application in high-resolution bimodal feature fusion. To overcome this limitation, we propose a purely Fourier Transform-based model, namely Deep Fourier-embedded Network (FreqSal), for accurate RGB-T SOD. Specifically, we leverage the efficiency of Fast Fourier Transform with linear complexity to design three key components: (1) To fuse RGB and thermal modalities, we propose Modal-coordinated Perception Attention, which aligns and enhances bimodal Fourier representation in multiple dimensions; (2) To clarify object edges and suppress noise, we design Frequency-decomposed Edge-aware Block, which deeply decomposes and filters Fourier components of low-level features; (3) To accurately decode features, we propose Fourier Residual Channel Attention Block, which prioritizes high-frequency information while aligning channel-wise global relationships. Additionally, even when converged, existing deep learning-based SOD models’ predictions still exhibit frequency gaps relative to ground-truth. To address this problem, we propose Co-focus Frequency Loss, which dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing bimodal edge information in the Fourier domain. Extensive experiments on ten bimodal SOD benchmark datasets demonstrate that FreqSal outperforms twenty-nine existing state-of-the-art bimodal SOD models. Comprehensive ablation studies further validate the value and effectiveness of our newly proposed components. The code is available at https://github.com/JoshuaLPF/FreqSal.
[137] Visual Program Distillation with Template-Based Augmentation
Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem
Main category: cs.CV
TL;DR: A low-cost visual program distillation method that enables small language models (≤1B parameters) to generate specialized visual programs without human annotations, using synthetic data augmentation with skill templates.
Details
Motivation: Adapting visual programming or prompting LLMs for visual tasks like VQA is challenging due to high annotation and inference costs, especially for specialized domains.Method: Visual program distillation using synthetic data augmentation that decouples programs into higher-level skill templates and their arguments, requiring no human-generated program annotations.
Result: Small language models can generate high-quality specialized visual programs with relatively small question/answer data, achieving much faster inference.
Conclusion: The proposed method enables efficient visual program generation for specialized tasks without expensive annotations, making visual programming more accessible with smaller models.
Abstract: Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference
[138] Image Super-Resolution with Guarantees via Conformalized Generative Models
Eduardo Adame, Daniel Csillag, Guilherme Tegoni Goedert
Main category: cs.CV
TL;DR: A conformal prediction-based method for creating confidence masks that reliably indicate trustworthy regions in images generated by black-box ML foundation models for restoration tasks.
Details
Motivation: The increasing use of generative ML foundation models for image restoration tasks requires robust and interpretable uncertainty quantification methods.Method: Novel approach based on conformal prediction techniques to create confidence masks, adaptable to any black-box generative model, requiring only easily attainable calibration data and customizable via local image similarity metrics.
Result: Proven strong theoretical guarantees for fidelity error control, reconstruction quality, and robustness against data leakage, with solid empirical performance established through evaluation.
Conclusion: The method provides reliable and intuitive uncertainty quantification for generative image restoration models, offering strong theoretical guarantees and practical adaptability to black-box models.
Abstract: The increasing use of generative ML foundation models for image restoration tasks such as super-resolution calls for robust and interpretable uncertainty quantification methods. We address this need by presenting a novel approach based on conformal prediction techniques to create a ‘confidence mask’ capable of reliably and intuitively communicating where the generated image can be trusted. Our method is adaptable to any black-box generative model, including those locked behind an opaque API, requires only easily attainable data for calibration, and is highly customizable via the choice of a local image similarity metric. We prove strong theoretical guarantees for our method that span fidelity error control (according to our local image similarity metric), reconstruction quality, and robustness in the face of data leakage. Finally, we empirically evaluate these results and establish our method’s solid performance.
[139] Mobile Robotic Multi-View Photometric Stereo
Suryansh Kumar
Main category: cs.CV
TL;DR: Proposes a mobile robotic system for Multi-View Photometric Stereo (MVPS) that enables 3D acquisition without fixed camera/light setups, using incremental learning-based approach for surface normal/depth prediction and optimization.
Details
Motivation: Traditional MVPS requires fixed camera and calibrated light sources, limiting mobile robotics applications. The paper aims to bring MVPS benefits to movable platforms for robotic automation in photogrammetry.Method: Uses supervised learning to predict per-view surface normal, object depth, and uncertainty. Solves MVPS-driven optimization for refined depth maps, then fuses them with camera pose tracking for globally consistent 3D geometry.
Result: Achieves local high-frequency surface detail recovery with globally consistent object shape. Works on objects with unknown reflectance using fewer frames, without calibration process. 100x faster than state-of-the-art MVPS methods while maintaining similar accuracy on benchmark datasets.
Conclusion: The proposed mobile robotic MVPS system enables computationally efficient robotic automation for photogrammetry, overcoming limitations of traditional fixed setups while maintaining high-quality 3D reconstruction.
Abstract: Multi-View Photometric Stereo (MVPS) is a popular method for fine-detailed 3D acquisition of an object from images. Despite its outstanding results on diverse material objects, a typical MVPS experimental setup requires a well-calibrated light source and a monocular camera installed on an immovable base. This restricts the use of MVPS on a movable platform, limiting us from taking MVPS benefits in 3D acquisition for mobile robotics applications. To this end, we introduce a new mobile robotic system for MVPS. While the proposed system brings advantages, it introduces additional algorithmic challenges. Addressing them, in this paper, we further propose an incremental approach for mobile robotic MVPS. Our approach leverages a supervised learning setup to predict per-view surface normal, object depth, and per-pixel uncertainty in model-predicted results. A refined depth map per view is obtained by solving an MVPS-driven optimization problem proposed in this paper. Later, we fuse the refined depth map while tracking the camera pose w.r.t the reference frame to recover globally consistent object 3D geometry. Experimental results show the advantages of our robotic system and algorithm, featuring the local high-frequency surface detail recovery with globally consistent object shape. Our work is beyond any MVPS system yet presented, providing encouraging results on objects with unknown reflectance properties using fewer frames without a tiring calibration and installation process, enabling computationally efficient robotic automation approach to photogrammetry. The proposed approach is nearly 100 times computationally faster than the state-of-the-art MVPS methods such as [1, 2] while maintaining the similar results when tested on subjects taken from the benchmark DiLiGenT MV dataset [3].
[140] Detection and Geographic Localization of Natural Objects in the Wild: A Case Study on Palms
Kangning Cui, Rongkun Zhu, Manqi Wang, Wei Tang, Gregory D. Larsen, Victor P. Pauca, Sarra Alqahtani, Fan Yang, David Segurado, David Lutz, Jean-Michel Morel, Miles R. Silman
Main category: cs.CV
TL;DR: PRISM is a pipeline for detecting and localizing palms in dense tropical forests using large orthomosaic images, addressing challenges like overlapping crowns and heterogeneous landscapes.
Details
Motivation: Palms are important ecological and economic indicators, but mapping naturally occurring palms in dense forests is limited by technical challenges like overlapping crowns and uneven shading.Method: Developed PRISM pipeline using large UAV-derived orthomosaics, evaluated multiple object detectors, integrated SAM 2 for segmentation, and applied calibration methods for confidence scores and saliency maps.
Result: Created a large dataset from 21 sites in Ecuador with 8,830 bounding boxes and 5,026 palm center points, and demonstrated the pipeline’s effectiveness for palm detection.
Conclusion: PRISM provides a flexible solution for palm detection in dense forests and is adaptable for other natural objects, with future work focusing on transfer learning for lower-resolution datasets.
Abstract: Palms are ecologically and economically indicators of tropical forest health, biodiversity, and human impact that support local economies and global forest product supply chains. While palm detection in plantations is well-studied, efforts to map naturally occurring palms in dense forests remain limited by overlapping crowns, uneven shading, and heterogeneous landscapes. We develop PRISM (Processing, Inference, Segmentation, and Mapping), a flexible pipeline for detecting and localizing palms in dense tropical forests using large orthomosaic images. Orthomosaics are created from thousands of aerial images and spanning several to hundreds of gigabytes. Our contributions are threefold. First, we construct a large UAV-derived orthomosaic dataset collected across 21 ecologically diverse sites in western Ecuador, annotated with 8,830 bounding boxes and 5,026 palm center points. Second, we evaluate multiple state-of-the-art object detectors based on efficiency and performance, integrating zero-shot SAM 2 as the segmentation backbone, and refining the results for precise geographic mapping. Third, we apply calibration methods to align confidence scores with IoU and explore saliency maps for feature explainability. Though optimized for palms, PRISM is adaptable for identifying other natural objects, such as eastern white pines. Future work will explore transfer learning for lower-resolution datasets (0.5 to 1m).
[141] Prompt to Restore, Restore to Prompt: Cyclic Prompting for Universal Adverse Weather Removal
Rongxin Liao, Feng Li, Yanyan Wei, Zenglin Shi, Le Zhang, Huihui Bai, Meng Wang
Main category: cs.CV
TL;DR: CyclicPrompt is a novel cyclic prompt approach for universal adverse weather removal that integrates weather information and context-aware representations through a composite prompt and erase-and-paste mechanism, forming a “Prompt-Restore-Prompt” pipeline.
Details
Motivation: To enhance the effectiveness, adaptability, and generalizability of universal adverse weather removal (UAWR) by addressing various weather degradations in a unified framework, overcoming limitations of previous prompt learning methods.Method: Proposes CyclicPrompt with two key components: 1) Composite context prompt that integrates weather-related information and context-aware representations using learnable input-conditional vectors with weather-specific knowledge; 2) Erase-and-paste mechanism that substitutes weather-specific knowledge with constrained restoration priors after initial restoration to fine-tune the process.
Result: Extensive experiments on synthetic and real-world datasets validate the superior performance of CyclicPrompt compared to existing methods.
Conclusion: CyclicPrompt effectively harnesses weather-specific knowledge, textual contexts, and reliable textures through its cyclic “Prompt-Restore-Prompt” pipeline, demonstrating improved adaptability and generalizability for universal adverse weather removal tasks.
Abstract: Universal adverse weather removal (UAWR) seeks to address various weather degradations within a unified framework. Recent methods are inspired by prompt learning using pre-trained vision-language models (e.g., CLIP), leveraging degradation-aware prompts to facilitate weather-free image restoration, yielding significant improvements. In this work, we propose CyclicPrompt, an innovative cyclic prompt approach designed to enhance the effectiveness, adaptability, and generalizability of UAWR. CyclicPrompt Comprises two key components: 1) a composite context prompt that integrates weather-related information and context-aware representations into the network to guide restoration. This prompt differs from previous methods by marrying learnable input-conditional vectors with weather-specific knowledge, thereby improving adaptability across various degradations. 2) The erase-and-paste mechanism, after the initial guided restoration, substitutes weather-specific knowledge with constrained restoration priors, inducing high-quality weather-free concepts into the composite prompt to further fine-tune the restoration process. Therefore, we can form a cyclic “Prompt-Restore-Prompt” pipeline that adeptly harnesses weather-specific knowledge, textual contexts, and reliable textures. Extensive experiments on synthetic and real-world datasets validate the superior performance of CyclicPrompt. The code is available at: https://github.com/RongxinL/CyclicPrompt.
[142] RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing
Fengxiang Wang, Yulin Wang, Mingshuo Chen, Haiyan Zhao, Yangang Sun, Shuo Wang, Hongzhen Wang, Di Wang, Long Lan, Wenjing Yang, Jing Zhang
Main category: cs.CV
TL;DR: RoMA enables scalable self-supervised pretraining of Mamba-based remote sensing foundation models, overcoming Vision Transformers’ quadratic complexity limitations through rotation-aware mechanisms and multi-scale token prediction.
Details
Motivation: Vision Transformers face quadratic complexity barriers for large models and high-resolution remote sensing images, while existing Mamba applications are limited to supervised tasks on small datasets.Method: Proposes RoMA framework with rotation-aware pretraining (adaptive cropping + angular embeddings) and multi-scale token prediction objectives to handle arbitrary orientations and extreme scale variations in remote sensing imagery.
Result: Mamba adheres to RS data and parameter scaling laws, with RoMA-pretrained models consistently outperforming ViT-based counterparts in accuracy and computational efficiency across scene classification, object detection, and semantic segmentation tasks.
Conclusion: RoMA successfully enables scalable self-supervised pretraining of Mamba-based RS foundation models, demonstrating superior performance and efficiency compared to Vision Transformer approaches.
Abstract: Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at https://github.com/MiliLab/RoMA.
[143] Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Ziming Wei, Bingqian Lin, Yunshuang Nie, Jiaqi Chen, Shikui Ma, Hang Xu, Xiaodan Liang
Main category: cs.CV
TL;DR: RAM is a rewriting-driven data augmentation method for Vision-Language Navigation that creates new observation-instruction pairs by rewriting training data, improving generalization without additional simulators or manual data collection.
Details
Motivation: Data scarcity in VLN hinders agent generalization to unseen environments. Previous methods rely on limited simulator data or noisy web-collected data requiring manual cleaning.Method: Uses object-enriched observation rewriting with VLMs/LLMs to generate diverse scene descriptions, then observation-contrast instruction rewriting to create aligned instructions. Includes mixing-then-focusing training with random cropping.
Result: Superior performance on R2R, REVERIE, R4R (discrete) and R2R-CE (continuous) datasets, showing impressive generalization ability.
Conclusion: RAM provides an effective simulator-free and labor-saving data augmentation paradigm for VLN that significantly improves generalization to unseen environments.
Abstract: Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
[144] The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs
Jonathan Sauder, Viktor Domazetoski, Guilhem Banc-Prandi, Gabriela Perna, Anders Meibom, Devis Tuia
Main category: cs.CV
TL;DR: The paper introduces Coralscapes, the first large-scale semantic segmentation dataset for coral reefs with 2075 images, 39 benthic classes, and 174k expert-annotated masks, enabling benchmarking of computer vision models for automated coral reef monitoring.
Details
Motivation: Coral reefs are declining worldwide, but conventional monitoring methods are limited by scalability due to reliance on expert labor. Computer vision tools could automate coral identification but have been impeded by lack of large, high-quality datasets.Method: Created Coralscapes dataset following the structure of Cityscapes dataset, with 2075 expert-annotated images covering 39 benthic classes and 174k segmentation masks. Benchmarked various semantic segmentation models using this dataset.
Result: Transfer learning from Coralscapes to existing smaller datasets consistently achieved state-of-the-art performance. The dataset enables effective benchmarking of semantic segmentation models in this challenging domain.
Conclusion: Coralscapes will catalyze research on efficient and scalable coral reef surveying methods using computer vision, potentially streamlining the development of underwater ecological robotics for conservation and restoration efforts.
Abstract: Coral reefs are declining worldwide due to climate change and local stressors. To inform effective conservation or restoration, monitoring at the highest possible spatial and temporal resolution is necessary. Conventional coral reef surveying methods are limited in scalability due to their reliance on expert labor time, motivating the use of computer vision tools to automate the identification and abundance estimation of live corals from images. However, the design and evaluation of such tools has been impeded by the lack of large high quality datasets. We release the Coralscapes dataset, the first general-purpose dense semantic segmentation dataset for coral reefs, covering 2075 images, 39 benthic classes, and 174k segmentation masks annotated by experts. Coralscapes has a similar scope and the same structure as the widely used Cityscapes dataset for urban scene segmentation, allowing benchmarking of semantic segmentation models in a new challenging domain which requires expert knowledge to annotate. We benchmark a wide range of semantic segmentation models, and find that transfer learning from Coralscapes to existing smaller datasets consistently leads to state-of-the-art performance. Coralscapes will catalyze research on efficient, scalable, and standardized coral reef surveying methods based on computer vision, and holds the potential to streamline the development of underwater ecological robotics.
[145] 3DBonsai: Structure-Aware Bonsai Modeling Using Conditioned 3D Gaussian Splatting
Hao Wu, Hao Wang, Ruochong Li, Xuran Ma, Hui Xiong
Main category: cs.CV
TL;DR: 3DBonsai is a novel text-to-3D framework that generates complex 3D bonsai structures using trainable 3D space colonization algorithms and 3D Gaussian priors, outperforming existing methods.
Details
Motivation: Previous text-to-3D generation methods lack detailed structural information, limiting them to simple objects and struggling with intricate structures like bonsai.Method: Uses trainable 3D space colonization algorithm with random sampling and point cloud augmentation to create 3D Gaussian priors. Features two pipelines: fine structure conditioned generation (initializes 3D Gaussians with 3D structure prior) and coarse structure conditioned generation (uses multi-view structure consistency module).
Result: Significantly outperforms existing methods in generating complex 3D bonsai structures, establishing a new benchmark for structure-aware 3D generation.
Conclusion: 3DBonsai provides an effective framework for generating detailed and complex 3D bonsai structures, addressing limitations of previous methods in handling intricate structural details.
Abstract: Recent advancements in text-to-3D generation have shown remarkable results by leveraging 3D priors in combination with 2D diffusion. However, previous methods utilize 3D priors that lack detailed and complex structural information, limiting them to generating simple objects and presenting challenges for creating intricate structures such as bonsai. In this paper, we propose 3DBonsai, a novel text-to-3D framework for generating 3D bonsai with complex structures. Technically, we first design a trainable 3D space colonization algorithm to produce bonsai structures, which are then enhanced through random sampling and point cloud augmentation to serve as the 3D Gaussian priors. We introduce two bonsai generation pipelines with distinct structural levels: fine structure conditioned generation, which initializes 3D Gaussians using a 3D structure prior to produce detailed and complex bonsai, and coarse structure conditioned generation, which employs a multi-view structure consistency module to align 2D and 3D structures. Moreover, we have compiled a unified 2D and 3D Chinese-style bonsai dataset. Our experimental results demonstrate that 3DBonsai significantly outperforms existing methods, providing a new benchmark for structure-aware 3D bonsai generation.
[146] Towards Predicting Any Human Trajectory In Context
Ryo Fujii, Hideo Saito, Ryo Hachiuma
Main category: cs.CV
TL;DR: TrajICL is an In-Context Learning framework for pedestrian trajectory prediction that enables adaptation without fine-tuning by using spatio-temporal similarity and prediction-guided example selection from large-scale synthetic data.
Details
Motivation: Current pedestrian trajectory prediction methods require impractical fine-tuning for each new scenario, especially for edge device deployment. There's a need for adaptable models that can handle different environments without weight updates.Method: Uses In-Context Learning with spatio-temporal similarity-based example selection (STES) and prediction-guided example selection (PG-ES) to identify relevant motion patterns from previously observed trajectories, trained on large-scale synthetic datasets.
Result: Achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks.
Conclusion: TrajICL provides an effective framework for pedestrian trajectory prediction that enables practical deployment without the need for scenario-specific fine-tuning, demonstrating superior performance through in-context learning.
Abstract: Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, the need to fine-tune for each new scenario is often impractical for deployment on edge devices. To address this challenge, we introduce TrajICL, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables adaptation without fine-tuning on the scenario-specific data at inference time without requiring weight updates. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. Project Page: https://fujiry0.github.io/TrajICL-project-page/.
[147] FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks
Tianyi Wang, Harry Cheng, Ming-Hui Liu, Mohan Kankanhalli
Main category: cs.CV
TL;DR: FractalForensics uses fractal-based watermarks for proactive Deepfake detection and localization, providing explainable results by highlighting manipulated areas.
Details
Motivation: Existing proactive Deepfake detectors lack localization functionality and explainability, and have unstable watermark robustness affecting detection performance.Method: Parameter-driven fractal watermark generation with one-way encryption, semi-fragile watermarking framework robust to benign operations but fragile to Deepfake manipulations, and entry-to-patch strategy for localization.
Result: Outperforms state-of-the-art semi-fragile watermarking algorithms and passive detectors, with satisfactory robustness against image processing and fragility against Deepfake manipulations.
Conclusion: FractalForensics provides effective proactive Deepfake detection with localization capability and explainable results through fractal watermarks.
Abstract: Proactive Deepfake detection via robust watermarks has seen interest ever since passive Deepfake detectors encountered challenges in identifying high-quality synthetic images. However, while demonstrating reasonable detection performance, they lack localization functionality and explainability in detection results. Additionally, the unstable robustness of watermarks can significantly affect the detection performance. In this study, we propose novel fractal watermarks for proactive Deepfake detection and localization, namely FractalForensics. Benefiting from the characteristics of fractals, we devise a parameter-driven watermark generation pipeline that derives fractal-based watermarks and performs one-way encryption of the selected parameters. Subsequently, we propose a semi-fragile watermarking framework for watermark embedding and recovery, trained to be robust against benign image processing operations and fragile when facing Deepfake manipulations in a black-box setting. Moreover, we introduce an entry-to-patch strategy that implicitly embeds the watermark matrix entries into image patches at corresponding positions, achieving localization of Deepfake manipulations. Extensive experiments demonstrate satisfactory robustness and fragility of our approach against common image processing operations and Deepfake manipulations, outperforming state-of-the-art semi-fragile watermarking algorithms and passive detectors for Deepfake detection. Furthermore, by highlighting the areas manipulated, our method provides explainability for the proactive Deepfake detection results.
[148] Breaking Down Monocular Ambiguity: Exploiting Temporal Evolution for 3D Lane Detection
Huan Zheng, Wencheng Han, Tianyi Yan, Cheng-zhong Xu, Jianbing Shen
Main category: cs.CV
TL;DR: GTA-Net uses temporal information from consecutive frames to improve 3D lane detection by learning geometric consistency and enhancing lane integrity through pseudo future perspectives.
Details
Motivation: Existing monocular 3D lane detection methods suffer from inherent ambiguity in single-frame input, leading to inaccurate geometric predictions and poor lane integrity, especially for distant lanes.Method: Proposes Geometry-aware Temporal Aggregation Network (GTA-Net) with two modules: Temporal Geometry Enhancement Module (TGEM) for geometric consistency across frames, and Temporal Instance-aware Query Generation (TIQG) that aggregates instance cues and synthesizes pseudo future perspectives.
Result: GTA-Net achieves new state-of-the-art results, significantly outperforming existing monocular 3D lane detection solutions.
Conclusion: Leveraging temporal information from vehicle motion effectively addresses the limitations of single-frame 3D lane detection, enabling more accurate geometric predictions and better lane integrity.
Abstract: Monocular 3D lane detection aims to estimate the 3D position of lanes from frontal-view (FV) images. However, existing methods are fundamentally constrained by the inherent ambiguity of single-frame input, which leads to inaccurate geometric predictions and poor lane integrity, especially for distant lanes.To overcome this, we propose to unlock the rich information embedded in the temporal evolution of the scene as the vehicle moves. Our proposed Geometry-aware Temporal Aggregation Network (GTA-Net) systematically leverages the temporal information from complementary perspectives.First, Temporal Geometry Enhancement Module (TGEM) learns geometric consistency across consecutive frames, effectively recovering depth information from motion to build a reliable 3D scene representation.Second, to enhance lane integrity, Temporal Instance-aware Query Generation (TIQG) module aggregates instance cues from past and present frames. Crucially, for lanes that are ambiguous in the current view, TIQG innovatively synthesizes a pseudo future perspective to generate queries that reveal lanes which would otherwise be missed.The experiments demonstrate that GTA-Net achieves new SoTA results, significantly outperforming existing monocular 3D lane detection solutions.
[149] GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution
Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, Hongzhen Wang, Wenjing Yang, Bo Du, Jing Zhang
Main category: cs.CV
TL;DR: GeoLLaVA-8K is the first remote sensing multimodal large language model that handles 8K×8K resolution images by addressing data scarcity with new high-resolution datasets and token explosion through object-centric token selection strategies.
Details
Motivation: Ultra-high-resolution remote sensing imagery poses challenges for multimodal foundation models due to limited training data and token explosion from large image sizes.Method: Introduced SuperRS-VQA and HighRS-VQA datasets, and proposed Background Token Pruning and Anchored Token Selection to reduce memory footprint while preserving key semantics. Built GeoLLaVA-8K on LLaVA framework.
Result: GeoLLaVA-8K achieves state-of-the-art performance on XLRS-Bench, capable of handling inputs up to 8K×8K resolution.
Conclusion: The proposed techniques effectively address data scarcity and token explosion challenges in ultra-high-resolution remote sensing, enabling the first RS-focused multimodal model to handle 8K resolution images.
Abstract: Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.
[150] OmniEarth-Bench: Towards Holistic Evaluation of Earth’s Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data
Fengxiang Wang, Mingshuo Chen, Xuming He, Yueying Li, YiFan Zhang, Feng Liu, Zijie Guo, Zhenghao Hu, Jiong Wang, Jingyi Xu, Zhangrui Li, Fenghua Ling, Ben Fei, Weijia Li, Long Lan, Wenjing Yang, Wenlong Zhang, Lei Bai
Main category: cs.CV
TL;DR: OmniEarth-Bench is the first multimodal benchmark that systematically covers all six Earth spheres and their interactions, with 29,855 expert-curated annotations across 109 tasks, revealing that current MLLMs struggle significantly (under 35% accuracy).
Details
Motivation: Existing Earth science benchmarks are limited in scope, covering only a few spheres and tasks, with narrow data sources and constrained scientific granularity.Method: Built using a scalable, modular-topology data inference framework with multi-observation sources and expert-in-the-loop curation to create standardized annotations organized in a four-level hierarchy.
Result: Experiments on 9 state-of-the-art MLLMs show all models perform poorly, with none reaching 35% accuracy, indicating systematic gaps in Earth-system cognitive ability.
Conclusion: The benchmark reveals significant limitations in current multimodal models’ understanding of Earth systems and provides a comprehensive evaluation framework for future development.
Abstract: Existing benchmarks for multimodal learning in Earth science offer limited, siloed coverage of Earth’s spheres and their cross-sphere interactions, typically restricting evaluation to the human-activity sphere of atmosphere and to at most 16 tasks. These limitations: \textit{narrow-source heterogeneity (single/few data sources), constrained scientific granularity, and limited-sphere extensibility}. Therefore, we introduce \textbf{OmniEarth-Bench}, the first multimodal benchmark that systematically spans all six spheres: atmosphere, lithosphere, oceanosphere, cryosphere, biosphere, and human-activity sphere, and cross-spheres. Built with a scalable, modular-topology data inference framework and native multi-observation sources and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized, expert-curated annotations. All annotations are organized into a four-level hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35% accuracy, revealing systematic gaps in Earth-system cognitive ability. The dataset and evaluation code were released at OmniEarth-Bench (https://anonymous.4open.science/r/OmniEarth-Bench-B1BD).
[151] DIsoN: Decentralized Isolation Networks for Out-of-Distribution Detection in Medical Imaging
Felix Wagner, Pramit Saha, Harry Anthony, J. Alison Noble, Konstantinos Kamnitsas
Main category: cs.CV
TL;DR: DIsoN is a decentralized OOD detection framework that enables comparison of training and test data without sharing data, using only model parameter exchange between remote nodes.
Details
Motivation: Current OOD detection methods either discard training data after deployment or assume centralized storage of training and test data, which is impractical due to data size, privacy, and proprietary constraints in real-world medical imaging applications.Method: Uses Isolation Networks to quantify separation difficulty between test samples and training data via binary classification. Extends to Decentralized Isolation Networks (DIsoN) that exchange only model parameters between training and deployment nodes, with class-conditioning to compare samples only with training data of predicted class.
Result: Evaluated on 4 medical imaging datasets across 12 OOD detection tasks, DIsoN performs favorably against existing methods while maintaining data privacy.
Conclusion: DIsoN enables secure, decentralized OOD detection services where ML developers can provide remote utilization of training data for OOD detection without sharing the actual data.
Abstract: Safe deployment of machine learning (ML) models in safety-critical domains such as medical imaging requires detecting inputs with characteristics not seen during training, known as out-of-distribution (OOD) detection, to prevent unreliable predictions. Effective OOD detection after deployment could benefit from access to the training data, enabling direct comparison between test samples and the training data distribution to identify differences. State-of-the-art OOD detection methods, however, either discard the training data after deployment or assume that test samples and training data are centrally stored together, an assumption that rarely holds in real-world settings. This is because shipping the training data with the deployed model is usually impossible due to the size of training databases, as well as proprietary or privacy constraints. We introduce the Isolation Network, an OOD detection framework that quantifies the difficulty of separating a target test sample from the training data by solving a binary classification task. We then propose Decentralized Isolation Networks (DIsoN), which enables the comparison of training and test data when data-sharing is impossible, by exchanging only model parameters between the remote computational nodes of training and deployment. We further extend DIsoN with class-conditioning, comparing a target sample solely with training data of its predicted class. We evaluate DIsoN on four medical imaging datasets (dermatology, chest X-ray, breast ultrasound, histopathology) across 12 OOD detection tasks. DIsoN performs favorably against existing methods while respecting data-privacy. This decentralized OOD detection framework opens the way for a new type of service that ML developers could provide along with their models: providing remote, secure utilization of their training data for OOD detection services. Code: https://github.com/FelixWag/DIsoN
[152] GeoSDF: Plane Geometry Diagram Synthesis via Signed Distance Field
Chengrui Zhang, Maizhen Ning, Tianyi Liu, Zihao Zhou, Jie Sun, Qiufeng Wang, Kaizhu Huang
Main category: cs.CV
TL;DR: GeoSDF is a novel framework that uses Signed Distance Fields (SDF) to automatically generate accurate geometry diagrams by representing geometric elements and constraints as SDF functions, then optimizing them to produce mathematically precise diagrams.
Details
Motivation: Traditional manual tools require complex calculations, while existing model-based methods suffer from limited realism and accuracy. There's a need for automated diagram generation that maintains mathematical precision.Method: Represent geometric elements (points, segments, circles) as SDF functions, construct constraint functions for geometric relationships, optimize these functions, and render the optimized field to generate diagrams with self-verification capability.
Result: Achieved 88.67% synthesis accuracy on IMO problems and over 95% accuracy in solving geometry problems (vs. 75% SOTA), demonstrating effectiveness on both high-school and IMO-level diagrams.
Conclusion: GeoSDF provides a sophisticated, accurate, and flexible approach for geometric diagram generation with self-verification, enabling wide applications in education and AI reasoning.
Abstract: Plane Geometry Diagram Synthesis has been a crucial task in computer graphics, with applications ranging from educational tools to AI-driven mathematical reasoning. Traditionally, we rely on manual tools (e.g., Matplotlib and GeoGebra) to generate precise diagrams, but this usually requires huge, complicated calculations. Recently, researchers start to work on model-based methods (e.g., Stable Diffusion and GPT5) to automatically generate diagrams, saving operational cost but usually suffering from limited realism and insufficient accuracy. In this paper, we propose a novel framework GeoSDF, to automatically generate diagrams efficiently and accurately with Signed Distance Field (SDF). Specifically, we first represent geometric elements (e.g., points, segments, and circles) in the SDF, then construct a series of constraint functions to represent geometric relationships. Next, we optimize those constructed constraint functions to get an optimized field of both elements and constraints. Finally, by rendering the optimized field, we can obtain the synthesized diagram. In our GeoSDF, we define a symbolic language to represent geometric elements and constraints, and our synthesized geometry diagrams can be self-verified in the SDF, ensuring both mathematical accuracy and visual plausibility. In experiments, through both qualitative and quantitative analysis, GeoSDF synthesized both normal high-school level and IMO-level geometry diagrams. We achieve 88.67% synthesis accuracy by human evaluation in the IMO problem set. Furthermore, we obtain a very high accuracy of solving geometry problems (over 95% while the current SOTA accuracy is around 75%) by leveraging our self-verification property. All of these demonstrate the advantage of GeoSDF, paving the way for more sophisticated, accurate, and flexible generation of geometric diagrams for a wide array of applications.
[153] Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions
Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Yoichi Sato
Main category: cs.CV
TL;DR: MLLMs struggle with multimodal deception detection in social interactions despite their strong visual-textual capabilities, revealing significant performance gaps and highlighting the need for better grounding of language in visual social cues.
Details
Motivation: As AI systems become more integrated into human lives, robust social intelligence including deception detection is crucial, but current MLLMs' capabilities in this domain remain unquantified.Method: Introduces Multimodal Interactive Veracity Assessment (MIVA) task and a novel dataset from Werewolf game with synchronized video, text, and verifiable ground-truth labels for every statement.
Result: State-of-the-art MLLMs like GPT-4o show significant performance gaps, struggling to reliably distinguish truth from falsehood and failing to effectively ground language in visual social cues.
Conclusion: Current MLLMs have limitations in multimodal deception detection, being overly conservative and failing to integrate visual social cues effectively, highlighting the urgent need for novel approaches to build more perceptive and trustworthy AI systems.
Abstract: As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.
[154] MediQ-GAN: Quantum-Inspired GAN for High Resolution Medical Image Generation
Qingyue Jiao, Yongcan Tang, Jun Zhuang, Jason Cong, Yiyu Shi
Main category: cs.CV
TL;DR: MediQ-GAN is a quantum-inspired GAN that addresses medical imaging data scarcity through a dual-stream generator combining classical and quantum-inspired branches, achieving superior performance over state-of-the-art methods while avoiding barren plateaus.
Details
Motivation: Medical imaging datasets are often scarce, imbalanced, and privacy-constrained, making data augmentation essential. Classical generative models require extensive resources, while existing quantum methods face scalability limitations and barren plateaus.Method: Proposes MediQ-GAN with prototype-guided skip connections and a dual-stream generator that fuses classical and quantum-inspired branches. Uses variational quantum circuits that preserve full-rank mappings and avoid rank collapse, theory-guided to balance expressivity with trainability.
Result: Outperforms state-of-the-art GANs and diffusion models across three medical imaging datasets. Provides first latent-geometry and rank-based analysis of quantum-inspired GANs. Validated on IBM hardware but remains hardware-agnostic.
Conclusion: MediQ-GAN offers a scalable and data-efficient framework for medical image generation and augmentation, addressing key limitations of both classical and existing quantum approaches while providing theoretical insights into quantum-inspired GAN performance.
Abstract: Machine learning-assisted diagnosis shows promise, yet medical imaging datasets are often scarce, imbalanced, and constrained by privacy, making data augmentation essential. Classical generative models typically demand extensive computational and sample resources. Quantum computing offers a promising alternative, but existing quantum-based image generation methods remain limited in scale and often face barren plateaus. We present MediQ-GAN, a quantum-inspired GAN with prototype-guided skip connections and a dual-stream generator that fuses classical and quantum-inspired branches. Its variational quantum circuits inherently preserve full-rank mappings, avoid rank collapse, and are theory-guided to balance expressivity with trainability. Beyond generation quality, we provide the first latent-geometry and rank-based analysis of quantum-inspired GANs, offering theoretical insight into their performance. Across three medical imaging datasets, MediQ-GAN outperforms state-of-the-art GANs and diffusion models. While validated on IBM hardware for robustness, our contribution is hardware-agnostic, offering a scalable and data-efficient framework for medical image generation and augmentation.
[155] Weakly Supervised Object Segmentation by Background Conditional Divergence
Hassan Baker, Matthew S. Emigh, Austin J. Brockmeier
Main category: cs.CV
TL;DR: A weakly supervised object segmentation method that uses image-level labels (presence/absence) and counterfactual background blending to train segmentation networks without pixel-level annotations.
Details
Motivation: Object segmentation in specialized domains lacks massive labeled data, and obtaining pixel-wise masks is expensive. Weak supervision via image-level labels is more accessible but provides less information.Method: Uses weak supervision with image-wise object presence/absence labels. Creates counterfactual images by blending segmented objects into background-only images from different clusters. Training uses divergence loss between counterfactual and real images, plus supervised loss for background-only images.
Result: Successfully tested on sonar images, outperforming previous unsupervised segmentation baselines. Also works reasonably on natural images without requiring pretrained networks, generative networks, or adversarial critics.
Conclusion: The proposed weakly supervised segmentation method effectively leverages counterfactual background blending and achieves good performance across specialized domains and natural images, providing a practical alternative to fully supervised approaches.
Abstract: As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images and then, during learning, create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The code for this work can be found at \href{GitHub}{https://github.com/bakerhassan/WSOS}.
[156] Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios
Siyue Yao, Mingjie Sun, Eng Gee Lim, Ran Yi, Baojiang Zhong, Moncef Gabbouj
Main category: cs.CV
TL;DR: Crucial-Diff is a domain-agnostic framework that synthesizes crucial training samples to address data scarcity issues by targeting downstream models’ weaknesses, achieving state-of-the-art performance on MVTec and polyp datasets.
Details
Motivation: Data scarcity in domains like medical, industrial, and autonomous driving causes model overfitting and dataset imbalance. Existing generative methods produce repetitive or simplistic synthetic samples that don't target model weaknesses and require separate training for different objects.Method: Uses two key modules: Scene Agnostic Feature Extractor (SAFE) with unified feature extractor to capture target information, and Weakness Aware Sample Miner (WASM) that generates hard-to-detect samples using feedback from downstream model detection results, then fuses them with SAFE output.
Result: Achieved pixel-level AP of 83.63% and F1-MAX of 78.12% on MVTec dataset, and mIoU of 81.64% and mDice of 87.69% on polyp dataset, demonstrating superior performance over existing methods.
Conclusion: Crucial-Diff effectively addresses data scarcity by generating diverse, high-quality training data that specifically targets downstream model weaknesses, outperforming existing approaches while being computationally efficient through unified feature extraction.
Abstract: The scarcity of data in various scenarios, such as medical, industry and autonomous driving, leads to model overfitting and dataset imbalance, thus hindering effective detection and segmentation performance. Existing studies employ the generative models to synthesize more training samples to mitigate data scarcity. However, these synthetic samples are repetitive or simplistic and fail to provide “crucial information” that targets the downstream model’s weaknesses. Additionally, these methods typically require separate training for different objects, leading to computational inefficiencies. To address these issues, we propose Crucial-Diff, a domain-agnostic framework designed to synthesize crucial samples. Our method integrates two key modules. The Scene Agnostic Feature Extractor (SAFE) utilizes a unified feature extractor to capture target information. The Weakness Aware Sample Miner (WASM) generates hard-to-detect samples using feedback from the detection results of downstream model, which is then fused with the output of SAFE module. Together, our Crucial-Diff framework generates diverse, high-quality training data, achieving a pixel-level AP of 83.63% and an F1-MAX of 78.12% on MVTec. On polyp dataset, Crucial-Diff reaches an mIoU of 81.64% and an mDice of 87.69%. Code is publicly available at https://github.com/JJessicaYao/Crucial-diff.
[157] Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey
Jiahui Zhang, Yuelei Li, Anpei Chen, Muyu Xu, Kunhao Liu, Jianyuan Wang, Xiao-Xiao Long, Hanxue Liang, Zexiang Xu, Hao Su, Christian Theobalt, Christian Rupprecht, Andrea Vedaldi, Kaichen Zhou, Paul Pu Liang, Shijian Lu, Fangneng Zhan
Main category: cs.CV
TL;DR: Survey of feed-forward deep learning methods for 3D reconstruction and view synthesis, covering representations like NeRF, 3DGS, and applications in AR/VR, digital twins, robotics.
Details
Motivation: Traditional 3D reconstruction methods rely on computationally intensive iterative optimization, limiting real-world applicability. Feed-forward deep learning approaches enable faster, more generalizable solutions.Method: Comprehensive review and taxonomy of feed-forward techniques based on representation architectures including point clouds, 3D Gaussian Splatting, Neural Radiance Fields, and others.
Result: Survey covers key tasks like pose-free reconstruction, dynamic 3D reconstruction, 3D-aware image/video synthesis, and applications in digital humans, SLAM, robotics. Includes dataset reviews and evaluation protocols.
Conclusion: Feed-forward approaches have potential to advance state of the art in 3D vision, with open research challenges and promising future directions identified.
Abstract: 3D reconstruction and view synthesis are foundational problems in computer vision, graphics, and immersive technologies such as augmented reality (AR), virtual reality (VR), and digital twins. Traditional methods rely on computationally intensive iterative optimization in a complex chain, limiting their applicability in real-world scenarios. Recent advances in feed-forward approaches, driven by deep learning, have revolutionized this field by enabling fast and generalizable 3D reconstruction and view synthesis. This survey offers a comprehensive review of feed-forward techniques for 3D reconstruction and view synthesis, with a taxonomy according to the underlying representation architectures including point cloud, 3D Gaussian Splatting (3DGS), Neural Radiance Fields (NeRF), etc. We examine key tasks such as pose-free reconstruction, dynamic 3D reconstruction, and 3D-aware image and video synthesis, highlighting their applications in digital humans, SLAM, robotics, and beyond. In addition, we review commonly used datasets with detailed statistics, along with evaluation protocols for various downstream tasks. We conclude by discussing open research challenges and promising directions for future work, emphasizing the potential of feed-forward approaches to advance the state of the art in 3D vision.
[158] A Practical Investigation of Spatially-Controlled Image Generation with Transformers
Guoxuan Xia, Harleen Hanspal, Petru-Daniel Tudosiu, Shifeng Zhang, Sarah Parisot
Main category: cs.CV
TL;DR: This paper provides a systematic comparison of transformer-based spatially-controlled image generation methods, establishing control token prefilling as a strong baseline and investigating sampling enhancements like classifier-free guidance and softmax truncation.
Details
Motivation: To address the lack of fair scientific comparison in spatially-controlled image generation research, where differing training data, model architectures, and generation paradigms make it difficult to isolate performance factors.Method: Performed controlled experiments on ImageNet across diffusion-based, flow-based, and autoregressive models, testing control token prefilling, classifier-free guidance extension to control, softmax truncation, and adapter-based approaches.
Result: Control token prefilling proved to be a simple, general, and performant baseline. Sampling enhancements like extended classifier-free guidance and softmax truncation significantly improved control-generation consistency. Adapter approaches maintained generation quality with limited data but underperformed in consistency.
Conclusion: The study provides clear takeaways for developing transformer-based spatially-controlled generation systems, clarifying literature gaps and establishing practical guidelines for different generation paradigms.
Abstract: Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency. Finally, we re-clarify the motivation of adapter-based approaches, demonstrating that they mitigate “forgetting” and maintain generation quality when trained on limited downstream data, but underperform full training in terms of generation-control consistency.
[159] Label tree semantic losses for rich multi-class medical image segmentation
Junwen Wang, Oscar MacCormac, William Rochford, Aaron Kujawa, Jonathan Shapey, Tom Vercauteren
Main category: cs.CV
TL;DR: Proposes tree-based semantic loss functions that leverage hierarchical label organization for medical image segmentation, achieving state-of-the-art performance on brain MRI and neurosurgical imaging tasks.
Details
Motivation: Current medical image segmentation methods penalize all errors equally, failing to exploit inter-class semantics, which becomes problematic as label richness increases with subtly different classes.Method: Two tree-based semantic loss functions that utilize hierarchical label organization, integrated with sparse annotation training approaches for background-free scenarios.
Result: Achieves state-of-the-art performance on head MRI for whole brain parcellation with full supervision and neurosurgical hyperspectral imaging for scene understanding with sparse annotations.
Conclusion: The proposed hierarchical semantic loss functions effectively leverage label relationships to improve medical image segmentation, particularly beneficial for rich label spaces with subtle class distinctions.
Abstract: Rich and accurate medical image segmentation is poised to underpin the next generation of AI-defined clinical practice by delineating critical anatomy for pre-operative planning, guiding real-time intra-operative navigation, and supporting precise post-operative assessment. However, commonly used learning methods for medical and surgical imaging segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the labels space. This becomes particularly problematic as the cardinality and richness of labels increases to include subtly different classes. In this work, we propose two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations to extend the applicability of our proposed losses. Extensive experiments are reported on two medical and surgical image segmentation tasks, namely head MRI for whole brain parcellation (WBP) with full supervision and neurosurgical hyperspectral imaging (HSI) for scene understanding with sparse annotations. Results demonstrate that our proposed method reaches state-of-the-art performance in both cases.
[160] Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras
Lingdong Kong, Dongyue Lu, Ao Liang, Rong Li, Yuhao Dong, Tianshuai Hu, Lai Xing Ng, Wei Tsang Ooi, Benoit R. Cottereau
Main category: cs.CV
TL;DR: Talk2Event is the first large-scale benchmark for language-driven object grounding in event-based perception, with EventRefer framework using Mixture of Event-Attribute Experts for multi-attribute fusion.
Details
Motivation: Event cameras offer microsecond latency and motion blur robustness, but connecting asynchronous event streams to human language remains challenging for understanding dynamic environments.Method: Proposed EventRefer framework with Mixture of Event-Attribute Experts (MoEE) that dynamically fuses multi-attribute representations (appearance, status, relation to viewer, relation to other objects) for language-driven object grounding.
Result: Achieved consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings, adapting to different modalities and scene dynamics.
Conclusion: The dataset and approach establish foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.
Abstract: Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes – appearance, status, relation to viewer, and relation to other objects – bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.
[161] WXSOD: A Benchmark for Robust Salient Object Detection in Adverse Weather Conditions
Quan Chen, Xiong Yang, Bolun Zheng, Rongfeng Lu, Xiaokai Yang, Qianyu Zhang, Yu Liu, Xiaofei Zhou
Main category: cs.CV
TL;DR: This paper introduces WXSOD, a new dataset for salient object detection in weather-affected environments, and proposes WFANet, a two-branch network that integrates weather prediction with saliency detection to improve performance in noisy conditions.
Details
Motivation: Most existing SOD methods perform well in clean natural scenes but struggle with weather noise due to lack of appropriate datasets with pixel-wise annotations for weather-affected environments.Method: Proposes Weather-aware Feature Aggregation Network (WFANet) with two branches: weather prediction branch for weather-related features and saliency detection branch that fuses semantic features with weather features.
Result: WFANet achieves superior performance compared to 17 existing SOD methods on the proposed WXSOD dataset, which contains 14,945 RGB images with diverse weather noise and annotations.
Conclusion: The WXSOD dataset bridges the gap in weather-affected SOD research, and WFANet demonstrates effective integration of weather awareness for improved salient object detection in complex environments.
Abstract: Salient object detection (SOD) in complex environments remains a challenging research topic. Most existing methods perform well in natural scenes with negligible noise, and tend to leverage multi-modal information (e.g., depth and infrared) to enhance accuracy. However, few studies are concerned with the damage of weather noise on SOD performance due to the lack of dataset with pixel-wise annotations. To bridge this gap, this paper introduces a novel Weather-eXtended Salient Object Detection (WXSOD) dataset. It consists of 14,945 RGB images with diverse weather noise, along with the corresponding ground truth annotations and weather labels. To verify algorithm generalization, WXSOD contains two test sets, i.e., a synthesized test set and a real test set. The former is generated by adding weather noise to clean images, while the latter contains real-world weather noise. Based on WXSOD, we propose an efficient baseline, termed Weather-aware Feature Aggregation Network (WFANet), which adopts a fully supervised two-branch architecture. Specifically, the weather prediction branch mines weather-related deep features, while the saliency detection branch fuses semantic features extracted from the backbone with weather features for SOD. Comprehensive comparisons against 17 SOD methods shows that our WFANet achieves superior performance on WXSOD. The code and benchmark results will be made publicly available at https://github.com/C-water/WXSOD
[162] CWSSNet: Hyperspectral Image Classification Enhanced by Wavelet Domain Convolution
Yulin Tong, Fengzong Zhang, Haiqin Cheng
Main category: cs.CV
TL;DR: CWSSNet is a hyperspectral image classification framework that combines 3D spectral-spatial features with wavelet convolution, achieving superior performance in ground object classification with limited training data.
Details
Motivation: Hyperspectral images have rich spectral information but suffer from feature redundancy due to numerous bands, high dimensionality, and spectral mixing, requiring improved classification methods.Method: Proposed CWSSNet framework integrating 3D spectral-spatial features and wavelet convolution using multiscale convolutional attention module and multi-band decomposition in wavelet domain.
Result: Achieved 74.50% mIoU, 82.73% mAcc, and 84.94% mF1 in Yugan County, with highest IoU for water bodies, vegetation, and bare land classification. Maintained reliable performance with limited training time increase at 70% training set proportion.
Conclusion: CWSSNet demonstrates good robustness and reliable performance under small-sample training conditions, breaking through traditional method bottlenecks for hyperspectral image classification.
Abstract: Hyperspectral remote sensing technology has significant application value in fields such as forestry ecology and precision agriculture, while also putting forward higher requirements for fine ground object classification. However, although hyperspectral images are rich in spectral information and can improve recognition accuracy, they tend to cause prominent feature redundancy due to their numerous bands, high dimensionality, and spectral mixing characteristics. To address this, this study used hyperspectral images from the ZY1F satellite as a data source and selected Yugan County, Shangrao City, Jiangxi Province as the research area to perform ground object classification research. A classification framework named CWSSNet was proposed, which integrates 3D spectral-spatial features and wavelet convolution. This framework integrates multimodal information us-ing a multiscale convolutional attention module and breaks through the classification performance bottleneck of traditional methods by introducing multi-band decomposition and convolution operations in the wavelet domain. The experiments showed that CWSSNet achieved 74.50%, 82.73%, and 84.94% in mean Intersection over Union (mIoU), mean Accuracy (mAcc), and mean F1-score (mF1) respectively in Yugan County. It also obtained the highest Intersection over Union (IoU) in the classifica-tion of water bodies, vegetation, and bare land, demonstrating good robustness. Additionally, when the training set proportion was 70%, the increase in training time was limited, and the classification effect was close to the optimal level, indicating that the model maintains reliable performance under small-sample training conditions.
[163] 3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data
Nojod M. Alotaibi, Areej M. Alhothali, Manar S. Ali
Main category: cs.CV
TL;DR: A unified pipeline using Vision Transformers for 3D region embeddings from sMRI data and Graph Neural Networks for classification achieves 81.51% accuracy in detecting Major Depressive Disorder.
Details
Motivation: Automated MDD detection using sMRI and deep learning can improve diagnostic accuracy and enable early intervention, but existing methods using voxel-level features or predefined brain atlases limit complex pattern capture.Method: Developed a unified pipeline with Vision Transformers for 3D region embeddings and Graph Neural Networks for classification, testing two region definition strategies: atlas-based (predefined brain atlases) and cube-based (ViT-trained 3D patches), with cosine similarity graphs modeling interregional relationships.
Result: Achieved 81.51% accuracy, 85.94% sensitivity, 76.36% specificity, 80.88% precision, and 83.33% F1-score using stratified 10-fold cross-validation on REST-meta-MDD dataset. Atlas-based models consistently outperformed cube-based approach.
Conclusion: The proposed ViT-GNN framework effectively detects MDD from sMRI data, with atlas-based methods showing superior performance, highlighting the importance of domain-specific anatomical priors in MDD detection.
Abstract: Major depressive disorder (MDD) is a prevalent mental health condition that negatively impacts both individual well-being and global public health. Automated detection of MDD using structural magnetic resonance imaging (sMRI) and deep learning (DL) methods holds increasing promise for improving diagnostic accuracy and enabling early intervention. Most existing methods employ either voxel-level features or handcrafted regional representations built from predefined brain atlases, limiting their ability to capture complex brain patterns. This paper develops a unified pipeline that utilizes Vision Transformers (ViTs) for extracting 3D region embeddings from sMRI data and Graph Neural Network (GNN) for classification. We explore two strategies for defining regions: (1) an atlas-based approach using predefined structural and functional brain atlases, and (2) an cube-based method by which ViTs are trained directly to identify regions from uniformly extracted 3D patches. Further, cosine similarity graphs are generated to model interregional relationships, and guide GNN-based classification. Extensive experiments were conducted using the REST-meta-MDD dataset to demonstrate the effectiveness of our model. With stratified 10-fold cross-validation, the best model obtained 81.51% accuracy, 85.94% sensitivity, 76.36% specificity, 80.88% precision, and 83.33% F1-score. Further, atlas-based models consistently outperformed the cube-based approach, highlighting the importance of using domain-specific anatomical priors for MDD detection.
[164] Research on Expressway Congestion Warning Technology Based on YOLOv11-DIoU and GRU-Attention
Tong Yulin, Liang Xuechen
Main category: cs.CV
TL;DR: An integrated framework combining optimized vehicle detection (YOLOv11-DIoU + enhanced DeepSort) and congestion prediction (GRU-Attention model) for expressway traffic management, achieving high accuracy in occlusion handling and 10-minute advance congestion warnings.
Details
Motivation: Expressway traffic congestion reduces travel efficiency, and existing systems have flaws in vehicle perception under occlusion and long-sequence dependency loss in congestion forecasting.Method: Optimized YOLOv11 with DIoU Loss for better occlusion handling, enhanced DeepSort with fused motion and appearance distances, and built GRU-Attention model for congestion precursor capture using flow, density, and speed data.
Result: YOLOv11-DIoU achieved 95.7% mAP (6.5pp higher), DeepSort reached 93.8% MOTA (11.3pp higher), GRU-Attention achieved 99.7% test accuracy with ≤1 minute time error in 10-minute advance warnings, and 95% warning accuracy in validation.
Conclusion: The framework provides quantitative support for expressway congestion control with promising intelligent transportation applications, demonstrating stable performance in high-flow scenarios.
Abstract: Expressway traffic congestion severely reduces travel efficiency and hinders regional connectivity. Existing “detection-prediction” systems have critical flaws: low vehicle perception accuracy under occlusion and loss of long-sequence dependencies in congestion forecasting. This study proposes an integrated technical framework to resolve these issues.For traffic flow perception, two baseline algorithms were optimized. Traditional YOLOv11 was upgraded to YOLOv11-DIoU by replacing GIoU Loss with DIoU Loss, and DeepSort was improved by fusing Mahalanobis (motion) and cosine (appearance) distances. Experiments on Chang-Shen Expressway videos showed YOLOv11-DIoU achieved 95.7% mAP (6.5 percentage points higher than baseline) with 5.3% occlusion miss rate. DeepSort reached 93.8% MOTA (11.3 percentage points higher than SORT) with only 4 ID switches. Using the Greenberg model (for 10-15 vehicles/km high-density scenarios), speed and density showed a strong negative correlation (r=-0.97), conforming to traffic flow theory. For congestion warning, a GRU-Attention model was built to capture congestion precursors. Trained 300 epochs with flow, density, and speed, it achieved 99.7% test accuracy (7-9 percentage points higher than traditional GRU). In 10-minute advance warnings for 30-minute congestion, time error was $\leq$ 1 minute. Validation with an independent video showed 95% warning accuracy, over 90% spatial overlap of congestion points, and stable performance in high-flow ($>$5 vehicles/second) scenarios.This framework provides quantitative support for expressway congestion control, with promising intelligent transportation applications.
[165] FitPro: A Zero-Shot Framework for Interactive Text-based Pedestrian Retrieval in Open World
Zengli Luo, Canlong Zhang, Xiaochun Lu, Zhixin Li
Main category: cs.CV
TL;DR: FitPro is an open-world interactive zero-shot text-based pedestrian retrieval framework that addresses generalization and semantic understanding limitations through three components: Feature Contrastive Decoding, Incremental Semantic Mining, and Query-aware Hierarchical Retrieval.
Details
Motivation: Existing text-based pedestrian retrieval methods struggle with limited model generalization and insufficient semantic understanding in open-world interactive scenarios, particularly in zero-shot settings.Method: FitPro uses three key components: 1) Feature Contrastive Decoding for generating high-quality structured descriptions from denoised images, 2) Incremental Semantic Mining for holistic pedestrian representations from multi-view observations, and 3) Query-aware Hierarchical Retrieval for dynamic optimization based on query types.
Result: Extensive experiments on five public datasets and two evaluation protocols show that FitPro significantly overcomes generalization limitations and semantic modeling constraints of existing methods in interactive retrieval.
Conclusion: FitPro paves the way for practical deployment of text-based pedestrian retrieval systems by effectively addressing open-world interactive retrieval challenges with enhanced semantic comprehension and cross-scene adaptability.
Abstract: Text-based Pedestrian Retrieval (TPR) deals with retrieving specific target pedestrians in visual scenes according to natural language descriptions. Although existing methods have achieved progress under constrained settings, interactive retrieval in the open-world scenario still suffers from limited model generalization and insufficient semantic understanding. To address these challenges, we propose FitPro, an open-world interactive zero-shot TPR framework with enhanced semantic comprehension and cross-scene adaptability. FitPro has three innovative components: Feature Contrastive Decoding (FCD), Incremental Semantic Mining (ISM), and Query-aware Hierarchical Retrieval (QHR). The FCD integrates prompt-guided contrastive decoding to generate high-quality structured pedestrian descriptions from denoised images, effectively alleviating semantic drift in zero-shot scenarios. The ISM constructs holistic pedestrian representations from multi-view observations to achieve global semantic modeling in multi-turn interactions, thereby improving robustness against viewpoint shifts and fine-grained variations in descriptions. The QHR dynamically optimizes the retrieval pipeline according to query types, enabling efficient adaptation to multi-modal and multi-view inputs. Extensive experiments on five public datasets and two evaluation protocols demonstrate that FitPro significantly overcomes the generalization limitations and semantic modeling constraints of existing methods in interactive retrieval, paving the way for practical deployment.
[166] InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, Hongjie Zhang
Main category: cs.CV
TL;DR: InternSVG is a unified multimodal framework for SVG understanding, editing, and generation that leverages MLLMs to overcome challenges in fragmented datasets and structural complexity through comprehensive datasets, benchmarks, and specialized training strategies.
Details
Motivation: Address challenges in general SVG modeling including fragmented datasets, limited transferability across tasks, and difficulty handling structural complexity by leveraging MLLMs' strong transfer and generalization capabilities.Method: Propose InternSVG family with SAgoge dataset (largest multimodal SVG dataset), SArena benchmark, and InternSVG model using SVG-specific tokens, subword-based embedding initialization, and two-stage training from static SVGs to complex animations.
Result: InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts on SArena benchmark and prior benchmarks, demonstrating positive transfer and improved overall performance.
Conclusion: The unified formulation enables positive transfer across SVG tasks, with InternSVG establishing state-of-the-art performance in SVG understanding, editing, and generation through comprehensive data resources and specialized model architecture.
Abstract: General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.
[167] Real-Time Neural Video Compression with Unified Intra and Inter Coding
Hui Xiang, Yifan Bian, Li Li, Jingran Wu, Xianguo Zhang, Dong Liu
Main category: cs.CV
TL;DR: Proposes a neural video compression framework with unified intra/inter coding and simultaneous two-frame compression to address limitations like disocclusion handling and error propagation in existing NVC schemes.
Details
Motivation: Existing neural video compression schemes have limitations in handling disocclusion, new content, and interframe error propagation. The authors aim to eliminate these limitations by borrowing ideas from classic video coding that allow intra coding within inter-coded frames.Method: Developed an NVC framework with unified intra and inter coding where every frame is processed by a single model trained to perform intra/inter coding adaptively. Also proposed simultaneous two-frame compression to exploit interframe redundancy both forwardly and backwardly.
Result: The scheme outperforms DCVC-RT by an average of 12.1% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances.
Conclusion: The proposed framework successfully addresses key limitations of existing NVC schemes by incorporating intra coding within inter-coded frames and bidirectional interframe redundancy exploitation, achieving superior compression efficiency while maintaining real-time performance.
Abstract: Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 12.1% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.
[168] WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging
Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Sami Azam, Asif Karim, Jemima Beissbarth, Amanda Leach
Main category: cs.CV
TL;DR: WeCKD introduces a chain-based knowledge distillation framework where models learn progressively from predecessors in a sequence, enabling effective learning with minimal supervision and outperforming traditional KD methods.
Details
Motivation: Traditional KD suffers from knowledge degradation, inefficient supervision, and reliance on strong teacher models or large labeled datasets, which limits practical applications.Method: Proposes a weakly-supervised chain-based KD network where models form a progressive distillation chain, each learning from its predecessor and refining knowledge before passing forward, trained on only fractions of datasets.
Result: Outperforms existing methods across six imaging datasets with cumulative accuracy gains up to +23% over single backbone models trained on limited data.
Conclusion: WeCKD demonstrates effective knowledge transfer with minimal supervision, showing strong generalization and potential for real-world adoption in medical imaging applications.
Abstract: Knowledge distillation (KD) has traditionally relied on a static teacher-student framework, where a large, well-trained teacher transfers knowledge to a single student model. However, these approaches often suffer from knowledge degradation, inefficient supervision, and reliance on either a very strong teacher model or large labeled datasets. To address these, we present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that redefines knowledge transfer through a structured sequence of interconnected models. Unlike conventional KD, it forms a progressive distillation chain, where each model not only learns from its predecessor but also refines the knowledge before passing it forward. This structured knowledge transfer further enhances feature learning and addresses the limitations of one-step KD. Each model in the chain is trained on only a fraction of the dataset and shows that effective learning can be achieved with minimal supervision. Extensive evaluation on six imaging datasets across otoscopic, microscopic, and magnetic resonance imaging modalities shows that it generalizes and outperforms existing methods. Furthermore, the proposed distillation chain resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data, which highlights its potential for real-world adoption.
[169] CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection
Huiming Yang, Wenzhuo Liu, Yicheng Qiao, Lei Yang, Xianzhu Zeng, Li Wang, Zhiwei Li, Zijian Zeng, Zhiying Jiang, Huaping Liu, Kunfeng Wang
Main category: cs.CV
TL;DR: CrossRay3D is a sparse multi-modal 3D detector that improves token representation quality through Ray-Aware Supervision and Class-Balanced Supervision, achieving state-of-the-art performance on nuScenes while being faster and more robust to missing sensor data.
Details
Motivation: Existing sparse detectors overlook token representation quality, leading to sub-optimal foreground quality and limited performance. The paper identifies geometric structure preservation and class distribution as key factors for improving sparse detector performance.Method: Proposes Sparse Selector (SS) with two core modules: Ray-Aware Supervision (RAS) to preserve geometric information, and Class-Balanced Supervision to adaptively reweight class semantics. Also introduces Ray Positional Encoding to address LiDAR-image modality distribution differences.
Result: Achieves state-of-the-art performance on nuScenes benchmark with 72.4 mAP and 74.7 NDS, running 1.84x faster than other leading methods. Demonstrates strong robustness to partial or complete missing LiDAR or camera data.
Conclusion: CrossRay3D effectively addresses token representation quality in sparse detectors through geometric structure preservation and balanced class semantics, achieving superior performance, efficiency, and robustness compared to existing methods.
Abstract: The sparse cross-modality detector offers more advantages than its counterpart, the Bird’s-Eye-View (BEV) detector, particularly in terms of adaptability for downstream tasks and computational cost savings. However, existing sparse detectors overlook the quality of token representation, leaving it with a sub-optimal foreground quality and limited performance. In this paper, we identify that the geometric structure preserved and the class distribution are the key to improving the performance of the sparse detector, and propose a Sparse Selector (SS). The core module of SS is Ray-Aware Supervision (RAS), which preserves rich geometric information during the training stage, and Class-Balanced Supervision, which adaptively reweights the salience of class semantics, ensuring that tokens associated with small objects are retained during token sampling. Thereby, outperforming other sparse multi-modal detectors in the representation of tokens. Additionally, we design Ray Positional Encoding (Ray PE) to address the distribution differences between the LiDAR modality and the image. Finally, we integrate the aforementioned module into an end-to-end sparse multi-modality detector, dubbed CrossRay3D. Experiments show that, on the challenging nuScenes benchmark, CrossRay3D achieves state-of-the-art performance with 72.4 mAP and 74.7 NDS, while running 1.84 faster than other leading methods. Moreover, CrossRay3D demonstrates strong robustness even in scenarios where LiDAR or camera data are partially or entirely missing.
[170] Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, Li Yuan
Main category: cs.CV
TL;DR: Edit-R1 is a policy optimization framework for instruction-based image editing that uses Diffusion Negative-aware Finetuning and MLLM-based rewards to overcome overfitting and achieve state-of-the-art performance.
Details
Motivation: Models trained via supervised fine-tuning often overfit to annotated patterns, limiting their ability to generalize beyond training distributions in instruction-based image editing.Method: Uses Diffusion Negative-aware Finetuning (DiffusionNFT) for policy optimization and employs Multimodal Large Language Models as training-free reward models with low-variance group filtering to reduce scoring noise.
Result: UniWorld-V2 trained with Edit-R1 achieves state-of-the-art results on ImgEdit (4.49) and GEdit-Bench (7.83) benchmarks, with framework being model-agnostic and applicable to diverse base models.
Conclusion: Edit-R1 provides an effective post-training framework that addresses overfitting and reward modeling challenges in instruction-based image editing, demonstrating wide applicability and superior performance.
Abstract: Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. \texttt{UniWorld-V2}, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available to support further research.
[171] Progressive Growing of Patch Size: Curriculum Learning for Accelerated and Improved Medical Image Segmentation
Stefan M. Fischer, Johannes Kiechle, Laura Daza, Lina Felsner, Richard Osuala, Daniel M. Lang, Karim Lekadir, Jan C. Peeken, Julia A. Schnabel
Main category: cs.CV
TL;DR: Progressive Growing of Patch Size is a curriculum learning method for 3D medical image segmentation that gradually increases patch size during training, improving class balance and accelerating convergence.
Details
Motivation: To address class imbalance issues in 3D medical image segmentation and accelerate training convergence by progressively increasing patch size during model training.Method: Curriculum learning approach that progressively increases patch size during training. Evaluated in two modes: resource-efficient (faster training) and performance (better accuracy).
Result: Resource-efficient mode reduces training time to 44% while matching baseline Dice scores. Performance mode achieves 1.28% relative mean Dice score improvement while reducing training time to 89%. Benefits are particularly strong for imbalanced tasks like lesion segmentation.
Conclusion: The progressive patch size curriculum is a simple yet effective strategy that substantially improves both segmentation performance and training efficiency across diverse architectures and medical imaging tasks.
Abstract: In this work, we introduce Progressive Growing of Patch Size, an automatic curriculum learning approach for 3D medical image segmentation. Our approach progressively increases the patch size during model training, resulting in an improved class balance for smaller patch sizes and accelerated convergence of the training process. We evaluate our curriculum approach in two settings: a resource-efficient mode and a performance mode, both regarding Dice score performance and computational costs across 15 diverse and popular 3D medical image segmentation tasks. The resource-efficient mode matches the Dice score performance of the conventional constant patch size sampling baseline with a notable reduction in training time to only 44%. The performance mode improves upon constant patch size segmentation results, achieving a statistically significant relative mean performance gain of 1.28% in Dice Score. Remarkably, across all 15 tasks, our proposed performance mode manages to surpass the constant patch size baseline in Dice Score performance, while simultaneously reducing training time to only 89%. The benefits are particularly pronounced for highly imbalanced tasks such as lesion segmentation tasks. Rigorous experiments demonstrate that our performance mode not only improves mean segmentation performance but also reduces performance variance, yielding more trustworthy model comparison. Furthermore, our findings reveal that the proposed curriculum sampling is not tied to a specific architecture but represents a broadly applicable strategy that consistently boosts performance across diverse segmentation models, including UNet, UNETR, and SwinUNETR. In summary, we show that this simple yet elegant transformation on input data substantially improves both Dice Score performance and training runtime, while being compatible across diverse segmentation backbones.
[172] A Quantitative Evaluation Framework for Explainable AI in Semantic Segmentation
Reem Hammoud, Abdul Karim Gizzini, Ali J. Ghandour
Main category: cs.CV
TL;DR: Proposes a quantitative evaluation framework for explainable AI (XAI) methods in semantic segmentation to address limitations of subjective qualitative assessments.
Details
Motivation: Current XAI evaluation for semantic segmentation is limited and relies on subjective qualitative methods that cannot ensure explanation accuracy or stability, creating a need for objective quantitative assessment.Method: Develops a comprehensive quantitative evaluation framework that integrates pixel-level evaluation strategies with carefully designed metrics to account for spatial and contextual task complexities in semantic segmentation.
Result: Simulation results using class activation mapping (CAM)-based XAI schemes demonstrate the framework’s efficiency, robustness, and reliability in providing fine-grained interpretability insights.
Conclusion: The proposed methodology advances the development of transparent, trustworthy, and accountable semantic segmentation models by enabling rigorous quantitative evaluation of XAI approaches.
Abstract: Ensuring transparency and trust in artificial intelligence (AI) models is essential as they are increasingly deployed in safety-critical and high-stakes domains. Explainable AI (XAI) has emerged as a promising approach to address this challenge; however, the rigorous evaluation of XAI methods remains vital for balancing the trade-offs between model complexity, predictive performance, and interpretability. While substantial progress has been made in evaluating XAI for classification tasks, strategies tailored to semantic segmentation remain limited. Moreover, objectively assessing XAI approaches is difficult, since qualitative visual explanations provide only preliminary insights. Such qualitative methods are inherently subjective and cannot ensure the accuracy or stability of explanations. To address these limitations, this work introduces a comprehensive quantitative evaluation framework for assessing XAI in semantic segmentation, accounting for both spatial and contextual task complexities. The framework systematically integrates pixel-level evaluation strategies with carefully designed metrics to yield fine-grained interpretability insights. Simulation results using recently adapted class activation mapping (CAM)-based XAI schemes demonstrate the efficiency, robustness, and reliability of the proposed methodology. These findings advance the development of transparent, trustworthy, and accountable semantic segmentation models.
[173] DRIP: Dynamic patch Reduction via Interpretable Pooling
Yusen Peng, Sachin Kumar
Main category: cs.CV
TL;DR: DRIP reduces computational costs in vision-language models by dynamically merging image tokens in deeper layers while maintaining performance.
Details
Motivation: Vision-language models require expensive large-scale pretraining, creating efficiency concerns that discourage training from scratch.Method: Dynamic patch Reduction via Interpretable Pooling (DRIP) adapts to input images and dynamically merges tokens in deeper visual encoder layers.
Result: Significant GFLOP reduction while maintaining comparable classification/zero-shot performance on ImageNet and CLIP pretraining; validated on large biology dataset.
Conclusion: DRIP provides an effective approach to reduce computational costs in vision-language model pretraining without sacrificing performance.
Abstract: Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.
[174] FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Chuhao Chen, Isabella Liu, Xinyue Wei, Hao Su, Minghua Liu
Main category: cs.CV
TL;DR: FreeArt3D is a training-free framework for articulated 3D object generation that repurposes pre-trained static 3D diffusion models as shape priors, extending Score Distillation Sampling to handle articulation as an additional dimension.
Details
Motivation: Existing approaches for articulated 3D objects either require dense-view supervision or produce coarse geometry without surface texture, while training native 3D diffusion models for articulated objects faces significant challenges due to limited data.Method: FreeArt3D extends Score Distillation Sampling (SDS) into 3D-to-4D domain by treating articulation as an additional generative dimension. It jointly optimizes geometry, texture, and articulation parameters using only a few images captured in different articulation states, without requiring task-specific training.
Result: The method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. It completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.
Conclusion: FreeArt3D provides an effective training-free solution for articulated 3D generation by leveraging existing static 3D diffusion models, achieving superior results without the need for large-scale articulated datasets or specialized training.
Abstract: Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object’s geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility. Please check our website for more details: https://czzzzh.github.io/FreeArt3D
[175] Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization
Guozheng Zheng, Jian Guan, Mingjie Xie, Xuanjia Zhao, Congyi Fan, Shiheng Zhang, Pengming Feng
Main category: cs.CV
TL;DR: DPHR is a dual-level reweighting strategy for cross-view geo-localization that addresses hard negatives through sample-level difficulty assessment and batch-level progressive weighting to improve training stability and performance.
Details
Motivation: Cross-view geo-localization faces severe viewpoint gaps and hard negatives. Existing static weighting methods are sensitive to distribution shifts and prone to overemphasizing difficult samples too early, causing noisy gradients and unstable convergence.Method: Dual-level Progressive Hardness-aware Reweighting (DPHR) with: 1) Sample-level Ratio-based Difficulty-Aware (RDA) module for fine-grained negative weighting, and 2) Batch-level Progressive Adaptive Loss Weighting (PALW) mechanism that uses training progress to attenuate noisy gradients early and enhance hard-negative mining later.
Result: Experiments on University-1652 and SUES-200 benchmarks show DPHR achieves consistent improvements over state-of-the-art methods, demonstrating effectiveness and robustness.
Conclusion: DPHR provides an effective solution for cross-view geo-localization by addressing hard negatives through progressive difficulty-aware reweighting, improving training stability and final performance.
Abstract: Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.
[176] Parameterized Prompt for Incremental Object Detection
Zijia An, Boyu Diao, Ruiqi Liu, Libo Huang, Chuanguang Yang, Fei Wang, Zhulin An, Yongjun Xu
Main category: cs.CV
TL;DR: P²IOD introduces parameterized prompts using neural networks for incremental object detection, addressing co-occurrence issues and catastrophic forgetting through adaptive consolidation and constrained updates.
Details
Motivation: Existing prompt-based approaches for incremental learning assume disjoint class sets, which is unsuitable for object detection where objects from previous tasks can co-occur in current images, causing confusion in prompt pools.Method: Uses neural networks as parameterized prompts to adaptively consolidate knowledge across tasks, with a parameterized prompts fusion strategy to constrain structure updates and prevent catastrophic forgetting.
Result: Extensive experiments on PASCAL VOC2007 and MS COCO datasets show P²IOD achieves state-of-the-art performance in incremental object detection.
Conclusion: Parameterized prompts with adaptive consolidation and constrained updates effectively address co-occurrence challenges in incremental object detection, outperforming existing baselines.
Abstract: Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Existing prompts pool based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection images. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P$^2$IOD). Leveraging neural networks global evolution properties, P$^2$IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P$^2$IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P$^2$IOD’s effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.
[177] ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng
Main category: cs.CV
TL;DR: ThinkMorph is a unified multimodal model that generates interleaved text-image reasoning chains where text and vision complement each other to advance reasoning, achieving significant performance gains and emergent multimodal intelligence.
Details
Motivation: Current multimodal reasoning lacks clear understanding of what constitutes meaningful interleaved chains of thought between language and vision modalities.Method: Built ThinkMorph by fine-tuning on ~24K high-quality interleaved reasoning traces across tasks with varying visual engagement, learning to generate progressive text-image reasoning steps that manipulate visual content while maintaining verbal logic.
Result: Achieved 34.7% average improvement on vision-centric benchmarks, matched or surpassed larger proprietary VLMs on out-of-domain tasks, and exhibited emergent multimodal intelligence including visual manipulation skills and adaptive reasoning mode switching.
Conclusion: ThinkMorph demonstrates promising directions for characterizing emergent capabilities in unified multimodal reasoning models through complementary text-image reasoning chains.
Abstract: Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary rather than isomorphic modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on approximately 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7 percent over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts. These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
[178] Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin
Main category: cs.CV
TL;DR: DUST is a dual-stream diffusion framework that enhances vision-language-action models by separately handling action and vision modalities while enabling cross-modal knowledge sharing, achieving improved performance in robotic tasks.
Details
Motivation: Address the modality conflict between vision and action in world-model augmented VLAs, which makes joint prediction of next-state observations and action sequences challenging due to inherent differences between these modalities.Method: Multimodal diffusion transformer with separate modality streams, independent noise perturbations for each modality, decoupled flow matching loss, and asynchronous sampling of action and vision tokens at different rates.
Result: Achieves up to 6% gains over standard VLA baselines and implicit world-modeling methods on simulated benchmarks (RoboCasa, GR-1), with additional 2-5% gain from inference-time scaling. Outperforms baselines by 13% in success rate on real-world Franka Research 3 tasks, and shows significant gains in large-scale pretraining with BridgeV2 videos.
Conclusion: DUST effectively handles modality conflicts in VLAs through dual-stream architecture and decoupled training, demonstrating superior performance across simulation, real-world robotics, and large-scale pretraining scenarios.
Abstract: Recently, augmenting vision-language-action models (VLAs) with world-models has shown promise in robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, we propose training techniques such as independent noise perturbations for each modality and a decoupled flow matching loss, which enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Furthermore, based on the decoupled training framework, we introduce a sampling method where we sample action and vision tokens asynchronously at different rates, which shows improvement through inference-time scaling. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods, with our inference-time scaling approach providing an additional 2-5% gain on success rate. On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%, confirming its effectiveness beyond simulation. Lastly, we demonstrate the effectiveness of DUST in large-scale pretraining with action-free videos from BridgeV2, where DUST leads to significant gain when transferred to the RoboCasa benchmark.
[179] ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation
Panwang Pan, Jingjing Zhao, Yuchen Lin, Chenguo Lin, Chenxin Li, Haopeng Li, Honglei Yan, Tingting Shen, Yadong Mu
Main category: cs.CV
TL;DR: ID-Composer is a novel framework for multi-subject video generation from text prompts and reference images, using hierarchical identity-preserving attention, VLM semantic guidance, and reinforcement learning to improve subject consistency and video quality.
Details
Motivation: Existing video generative models are limited to text or single image conditioning, lacking controllability for multi-subject scenarios where preserving subject identities and maintaining temporal consistency are crucial.Method: Uses hierarchical identity-preserving attention to aggregate features across subjects and modalities, leverages pretrained VLM for semantic understanding, and employs online reinforcement learning (RLVR) to align critical concepts like subject ID.
Result: Extensive experiments show the model surpasses existing methods in identity preservation, temporal consistency, and video quality.
Conclusion: ID-Composer effectively addresses multi-subject video generation challenges by combining hierarchical attention, VLM guidance, and reinforcement learning to achieve superior performance in preserving subject identities and maintaining video quality.
Abstract: Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image, limiting controllability and applicability. We introduce ID-Composer, a novel framework that addresses this gap by tackling multi-subject video generation from a text prompt and reference images. This task is challenging as it requires preserving subject identities, integrating semantics across subjects and modalities, and maintaining temporal consistency. To faithfully preserve the subject consistency and textual information in synthesized videos, ID-Composer designs a hierarchical identity-preserving attention mechanism, which effectively aggregates features within and across subjects and modalities. To effectively allow for the semantic following of user intention, we introduce semantic understanding via pretrained vision-language model (VLM), leveraging VLM’s superior semantic understanding to provide fine-grained guidance and capture complex interactions between multiple subjects. Considering that standard diffusion loss often fails in aligning the critical concepts like subject ID, we employ an online reinforcement learning phase to drive the overall training objective of ID-Composer into RLVR. Extensive experiments demonstrate that our model surpasses existing methods in identity preservation, temporal consistency, and video quality.
[180] Towards classification-based representation learning for place recognition on LiDAR scans
Maksim Konoplia, Dmitrii Khizbullin
Main category: cs.CV
TL;DR: Framing place recognition as multi-class classification instead of contrastive learning, achieving competitive performance with better training efficiency.
Details
Motivation: Most existing place recognition methods use contrastive learning, but this paper explores an alternative classification-based approach for autonomous driving localization.Method: Assign discrete location labels to LiDAR scans and train an encoder-decoder model to directly classify each scan’s position.
Result: Achieves competitive performance compared to contrastive learning methods on NuScenes dataset, with advantages in training efficiency and stability.
Conclusion: Classification-based approach is a viable alternative to contrastive learning for place recognition, offering improved training characteristics.
Abstract: Place recognition is a crucial task in autonomous driving, allowing vehicles to determine their position using sensor data. While most existing methods rely on contrastive learning, we explore an alternative approach by framing place recognition as a multi-class classification problem. Our method assigns discrete location labels to LiDAR scans and trains an encoder-decoder model to classify each scan’s position directly. We evaluate this approach on the NuScenes dataset and show that it achieves competitive performance compared to contrastive learning-based methods while offering advantages in training efficiency and stability.
[181] Learning with Category-Equivariant Architectures for Human Activity Recognition
Yoshihiro Maruyama
Main category: cs.CV
TL;DR: CatEquiv is a category-equivariant neural network for Human Activity Recognition that encodes temporal, amplitude, and structural symmetries to improve robustness against out-of-distribution perturbations.
Details
Motivation: To improve robustness in Human Activity Recognition by systematically encoding the categorical symmetry structure of inertial sensor data, including temporal shifts, amplitude scaling, and sensor hierarchy relationships.Method: Introduces a symmetry category that jointly represents cyclic time shifts, positive gain scalings, and sensor-hierarchy poset, then builds a neural network that achieves equivariance with respect to this categorical symmetry product.
Result: On UCI-HAR dataset under out-of-distribution perturbations, CatEquiv achieves significantly higher robustness compared to circularly padded CNNs and plain CNNs.
Conclusion: Enforcing categorical symmetries yields strong invariance and generalization benefits without requiring additional model capacity, demonstrating the effectiveness of systematic symmetry encoding for robust activity recognition.
Abstract: We propose CatEquiv, a category-equivariant neural network for Human Activity Recognition (HAR) from inertial sensors that systematically encodes temporal, amplitude, and structural symmetries. We introduce a symmetry category that jointly represents cyclic time shifts, positive gain scalings, and the sensor-hierarchy poset, capturing the categorical symmetry structure of the data. CatEquiv achieves equivariance with respect to the categorical symmetry product. On UCI-HAR under out-of-distribution perturbations, CatEquiv attains markedly higher robustness compared with circularly padded CNNs and plain CNNs. These results demonstrate that enforcing categorical symmetries yields strong invariance and generalization without additional model capacity.
[182] Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution
Peng Du, Hui Li, Han Xu, Paul Barom Jeon, Dongwook Lee, Daehyun Ji, Ran Yang, Feng Zhu
Main category: cs.CV
TL;DR: DTWSR is a Diffusion Transformer model for image super-resolution that uses wavelet spectra and pyramid tokenization to capture interrelations among multiscale frequency sub-bands, achieving consistent and realistic results.
Details
Motivation: Most DWT-based super-resolution methods neglect interrelations among multiscale frequency sub-bands, causing inconsistencies and artifacts in reconstructed images.Method: Uses Multi-level Discrete Wavelet Transform for decomposition, pyramid tokenization to embed spectra into transformer tokens, and a dual-decoder to handle distinct variances in low/high-frequency sub-bands while maintaining alignment.
Result: Extensive experiments show high performance on both perception quality and fidelity across multiple benchmark datasets.
Conclusion: DTWSR effectively captures frequency sub-band interrelations using diffusion models and transformers, producing more consistent and realistic super-resolution images.
Abstract: Discrete Wavelet Transform (DWT) has been widely explored to enhance the performance of image superresolution (SR). Despite some DWT-based methods improving SR by capturing fine-grained frequency signals, most existing approaches neglect the interrelations among multiscale frequency sub-bands, resulting in inconsistencies and unnatural artifacts in the reconstructed images. To address this challenge, we propose a Diffusion Transformer model based on image Wavelet spectra for SR (DTWSR). DTWSR incorporates the superiority of diffusion models and transformers to capture the interrelations among multiscale frequency sub-bands, leading to a more consistence and realistic SR image. Specifically, we use a Multi-level Discrete Wavelet Transform to decompose images into wavelet spectra. A pyramid tokenization method is proposed which embeds the spectra into a sequence of tokens for transformer model, facilitating to capture features from both spatial and frequency domain. A dual-decoder is designed elaborately to handle the distinct variances in low-frequency and high-frequency sub-bands, without omitting their alignment in image generation. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method, with high performance on both perception quality and fidelity.
[183] Positive Semi-definite Latent Factor Grouping-Boosted Cluster-reasoning Instance Disentangled Learning for WSI Representation
Chentao Li, Behzad Bozorgtabar, Yifang Ping, Pan Huang, Jing Qin
Main category: cs.CV
TL;DR: A novel MIL framework using latent factor grouping and cluster-reasoning to disentangle spatial, semantic, and decision entanglements in whole-slide pathology images, achieving superior performance and interpretability.
Details
Motivation: To address limitations in multiple instance learning (MIL) for whole-slide pathology images, specifically spatial, semantic, and decision entanglements among instances that limit representation and interpretability.Method: Three-phase framework: 1) Positive semi-definite latent factor grouping to mitigate spatial entanglement; 2) Instance probability counterfactual inference and optimization via cluster-reasoning for semantic disentanglement; 3) Generalized linear weighted decision with instance effect re-weighting for decision disentanglement.
Result: Extensive experiments on multicentre datasets show the model outperforms all state-of-the-art models and achieves pathologist-aligned interpretability through disentangled representations and transparent decision-making.
Conclusion: The proposed framework successfully addresses key entanglement challenges in MIL for whole-slide images, providing both superior performance and enhanced interpretability aligned with pathologist reasoning.
Abstract: Multiple instance learning (MIL) has been widely used for representing whole-slide pathology images. However, spatial, semantic, and decision entanglements among instances limit its representation and interpretability. To address these challenges, we propose a latent factor grouping-boosted cluster-reasoning instance disentangled learning framework for whole-slide image (WSI) interpretable representation in three phases. First, we introduce a novel positive semi-definite latent factor grouping that maps instances into a latent subspace, effectively mitigating spatial entanglement in MIL. To alleviate semantic entanglement, we employs instance probability counterfactual inference and optimization via cluster-reasoning instance disentangling. Finally, we employ a generalized linear weighted decision via instance effect re-weighting to address decision entanglement. Extensive experiments on multicentre datasets demonstrate that our model outperforms all state-of-the-art models. Moreover, it attains pathologist-aligned interpretability through disentangled representations and a transparent decision-making process.
[184] CGF-DETR: Cross-Gated Fusion DETR for Enhanced Pneumonia Detection in Chest X-rays
Yefeng Wu, Yuchen Song, Ling Wu, Shan Wan, Yecheng Zhao
Main category: cs.CV
TL;DR: CGF-DETR is an enhanced real-time detection transformer for pneumonia detection in chest X-rays, achieving 82.2% mAP@0.5 (3.7% improvement over baseline) while maintaining 48.1 FPS inference speed.
Details
Motivation: Pneumonia is a leading cause of morbidity/mortality worldwide, requiring accurate automated detection systems. Transformer-based detectors show promise but remain underexplored for medical imaging applications like pneumonia detection.Method: Proposes CGF-DETR with three key modules: XFABlock in backbone for multi-scale feature extraction using convolutional attention with CSP architecture; SPGA module replacing multi-head attention with dynamic gating and single-head self-attention; GCFC3 in neck for multi-path convolution fusion with structural re-parameterization.
Result: Achieves 82.2% mAP@0.5 on RSNA Pneumonia Detection dataset (3.7% improvement over RT-DETR-l baseline) with 48.1 FPS inference speed. Complete model achieves 50.4% mAP@[0.5:0.95]. Ablation studies confirm each module contributes meaningfully to performance.
Conclusion: CGF-DETR effectively enhances pneumonia detection performance while maintaining real-time inference capabilities, demonstrating the potential of transformer-based approaches for medical imaging applications.
Abstract: Pneumonia remains a leading cause of morbidity and mortality worldwide, necessitating accurate and efficient automated detection systems. While recent transformer-based detectors like RT-DETR have shown promise in object detection tasks, their application to medical imaging, particularly pneumonia detection in chest X-rays, remains underexplored. This paper presents CGF-DETR, an enhanced real-time detection transformer specifically designed for pneumonia detection. We introduce XFABlock in the backbone to improve multi-scale feature extraction through convolutional attention mechanisms integrated with CSP architecture. To achieve efficient feature aggregation, we propose SPGA module that replaces standard multi-head attention with dynamic gating mechanisms and single-head self-attention. Additionally, GCFC3 is designed for the neck to enhance feature representation through multi-path convolution fusion while maintaining real-time performance via structural re-parameterization. Extensive experiments on the RSNA Pneumonia Detection dataset demonstrate that CGF-DETR achieves 82.2% mAP@0.5, outperforming the baseline RT-DETR-l by 3.7% while maintaining comparable inference speed at 48.1 FPS. Our ablation studies confirm that each proposed module contributes meaningfully to the overall performance improvement, with the complete model achieving 50.4% mAP@[0.5:0.95]
cs.AI
[185] Mirror-Neuron Patterns in AI Alignment
Robyn Wyrick
Main category: cs.AI
TL;DR: This paper investigates whether artificial neural networks can develop mirror neuron-like patterns that enable empathy and cooperation, proposing this as an intrinsic alignment mechanism for superhuman AI.
Details
Motivation: As AI approaches superhuman capabilities, current alignment strategies relying on external constraints may be insufficient. The research explores whether intrinsic alignment through mirror neuron-like mechanisms could complement existing methods.Method: Used a novel Frog and Toad game framework to promote cooperative behaviors, identified conditions for mirror-neuron pattern emergence, evaluated their influence on action circuits, and introduced the Checkpoint Mirror Neuron Index (CMNI) to quantify activation.
Result: Findings show that appropriately scaled model capacities and self/other coupling foster shared neural representations similar to biological mirror neurons, supporting cooperative behavior.
Conclusion: Empathy-like circuits based on mirror-neuron dynamics could complement existing AI alignment techniques by embedding intrinsic motivations directly within AI architectures.
Abstract: As artificial intelligence (AI) advances toward superhuman capabilities, aligning these systems with human values becomes increasingly critical. Current alignment strategies rely largely on externally specified constraints that may prove insufficient against future super-intelligent AI capable of circumventing top-down controls. This research investigates whether artificial neural networks (ANNs) can develop patterns analogous to biological mirror neurons cells that activate both when performing and observing actions, and how such patterns might contribute to intrinsic alignment in AI. Mirror neurons play a crucial role in empathy, imitation, and social cognition in humans. The study therefore asks: (1) Can simple ANNs develop mirror-neuron patterns? and (2) How might these patterns contribute to ethical and cooperative decision-making in AI systems? Using a novel Frog and Toad game framework designed to promote cooperative behaviors, we identify conditions under which mirror-neuron patterns emerge, evaluate their influence on action circuits, introduce the Checkpoint Mirror Neuron Index (CMNI) to quantify activation strength and consistency, and propose a theoretical framework for further study. Our findings indicate that appropriately scaled model capacities and self/other coupling foster shared neural representations in ANNs similar to biological mirror neurons. These empathy-like circuits support cooperative behavior and suggest that intrinsic motivations modeled through mirror-neuron dynamics could complement existing alignment techniques by embedding empathy-like mechanisms directly within AI architectures.
[186] Human-AI Co-Embodied Intelligence for Scientific Experimentation and Manufacturing
Xinyi Lin, Yuyang Zhang, Yuanhang Gan, Juntao Chen, Hao Shen, Yichun He, Lijun Li, Ze Yuan, Shuang Wang, Chaohao Wang, Rui Zhang, Na Li, Jia Liu
Main category: cs.AI
TL;DR: Introduces human-AI co-embodied intelligence that combines human execution, agentic AI reasoning, and wearable hardware for real-world experiments and manufacturing, bridging the gap between machine intelligence and physical execution.
Details
Motivation: Current machine learning models are confined to virtual domains while real-world experiments and manufacturing still require human supervision, limiting reproducibility, scalability, and accessibility.Method: Developed Agentic-Physical Experimentation (APEX) system that couples agentic reasoning with physical execution through mixed-reality, observing human actions, aligning with procedures, providing 3D visual guidance, and analyzing every step.
Result: APEX achieves context-aware reasoning with accuracy exceeding general multimodal LLMs, corrects errors in real time, and transfers expertise to beginners in cleanroom flexible electronics fabrication.
Conclusion: Establishes a new class of agentic-physical-human intelligence that extends agentic reasoning into the physical domain, transforming scientific research and manufacturing into autonomous, traceable, interpretable, and scalable processes.
Abstract: Scientific experiment and manufacture rely on complex, multi-step procedures that demand continuous human expertise for precise execution and decision-making. Despite advances in machine learning and automation, conventional models remain confined to virtual domains, while real-world experiment and manufacture still rely on human supervision and expertise. This gap between machine intelligence and physical execution limits reproducibility, scalability, and accessibility across scientific and manufacture workflows. Here, we introduce human-AI co-embodied intelligence, a new form of physical AI that unites human users, agentic AI, and wearable hardware into an integrated system for real-world experiment and intelligent manufacture. In this paradigm, humans provide precise execution and control, while agentic AI contributes memory, contextual reasoning, adaptive planning, and real-time feedback. The wearable interface continuously captures the experimental and manufacture processes, facilitates seamless communication between humans and AI for corrective guidance and interpretable collaboration. As a demonstration, we present Agentic-Physical Experimentation (APEX) system, coupling agentic reasoning with physical execution through mixed-reality. APEX observes and interprets human actions, aligns them with standard operating procedures, provides 3D visual guidance, and analyzes every step. Implemented in a cleanroom for flexible electronics fabrication, APEX system achieves context-aware reasoning with accuracy exceeding general multimodal large language models, corrects errors in real time, and transfers expertise to beginners. These results establish a new class of agentic-physical-human intelligence that extends agentic reasoning beyond computation into the physical domain, transforming scientific research and manufacturing into autonomous, traceable, interpretable, and scalable processes.
[187] Automated Reward Design for Gran Turismo
Michel Ma, Takuma Seno, Kaushik Subramanian, Peter R. Wurman, Peter Stone, Craig Sherstan
Main category: cs.AI
TL;DR: Using foundation models to automatically generate reward functions for RL agents in racing games based on text instructions, achieving competitive performance with champion-level agents.
Details
Motivation: Manual reward function design for RL agents is challenging, especially in complex environments like autonomous racing. This paper aims to automate this process using foundation models.Method: Combines LLM-based reward generation, VLM preference-based evaluation, and human feedback to search over reward function spaces for Gran Turismo 7 racing agents.
Result: Produced racing agents competitive with GT Sophy (champion-level RL agent) and generated novel behaviors, demonstrating practical automated reward design.
Conclusion: The approach enables practical automated reward design for real-world applications using foundation models, eliminating the need for manual reward function engineering.
Abstract: When designing reinforcement learning (RL) agents, a designer communicates the desired agent behavior through the definition of reward functions - numerical feedback given to the agent as reward or punishment for its actions. However, mapping desired behaviors to reward functions can be a difficult process, especially in complex environments such as autonomous racing. In this paper, we demonstrate how current foundation models can effectively search over a space of reward functions to produce desirable RL agents for the Gran Turismo 7 racing game, given only text-based instructions. Through a combination of LLM-based reward generation, VLM preference-based evaluation, and human feedback we demonstrate how our system can be used to produce racing agents competitive with GT Sophy, a champion-level RL racing agent, as well as generate novel behaviors, paving the way for practical automated reward design in real world applications.
[188] Deep Value Benchmark: Measuring Whether Models Generalize Deep values or Shallow Preferences
Joshua Ashkinaze, Hua Shen, Sai Avula, Eric Gilbert, Ceren Budak
Main category: cs.AI
TL;DR: The Deep Value Benchmark (DVB) is a framework that tests whether LLMs learn fundamental human values or just surface-level preferences, revealing that models generalize deep values less than chance with an average rate of 0.30.
Details
Motivation: To distinguish between AI systems that learn robust human values versus those that only capture superficial patterns, which is critical for AI alignment and preventing misaligned behavior.Method: Uses controlled confounding between deep values and shallow features in training data, then breaks these correlations in testing to measure Deep Value Generalization Rate (DVGR).
Result: Average DVGR across 9 models is 0.30, all models generalize deep values less than chance, and larger models show slightly lower DVGR than smaller ones.
Conclusion: Current LLMs fail to robustly learn fundamental human values, instead relying on shallow features, highlighting a significant alignment challenge.
Abstract: We introduce the Deep Value Benchmark (DVB), an evaluation framework that directly tests whether large language models (LLMs) learn fundamental human values or merely surface-level preferences. This distinction is critical for AI alignment: Systems that capture deeper values are likely to generalize human intentions robustly, while those that capture only superficial patterns in preference data risk producing misaligned behavior. The DVB uses a novel experimental design with controlled confounding between deep values (e.g., moral principles) and shallow features (e.g., superficial attributes). In the training phase, we expose LLMs to human preference data with deliberately correlated deep and shallow features – for instance, where a user consistently prefers (non-maleficence, formal language) options over (justice, informal language) alternatives. The testing phase then breaks these correlations, presenting choices between (justice, formal language) and (non-maleficence, informal language) options. This design allows us to precisely measure a model’s Deep Value Generalization Rate (DVGR) – the probability of generalizing based on the underlying value rather than the shallow feature. Across 9 different models, the average DVGR is just 0.30. All models generalize deep values less than chance. Larger models have a (slightly) lower DVGR than smaller models. We are releasing our dataset, which was subject to three separate human validation experiments. DVB provides an interpretable measure of a core feature of alignment.
[189] InsurAgent: A Large Language Model-Empowered Agent for Simulating Individual Behavior in Purchasing Flood Insurance
Ziheng Geng, Jiachen Liu, Ran Cao, Lu Cheng, Dan M. Frangopol, Minghui Cheng
Main category: cs.AI
TL;DR: InsurAgent is an LLM-powered agent that improves flood insurance decision modeling by combining retrieval-augmented generation with reasoning modules to accurately estimate probabilities and simulate temporal decision evolution.
Details
Motivation: Low flood insurance participation rates among at-risk populations in the US highlight the need to better understand behavioral mechanisms behind insurance decisions, and LLMs offer promising tools for simulating human decision-making.Method: Proposed InsurAgent with five modules: perception, retrieval (using RAG to ground decisions in survey data), reasoning (leveraging LLM common sense), action, and memory (for temporal decision simulation). Created benchmark dataset to evaluate LLM capabilities.
Result: While LLMs show qualitative understanding of factors, they fall short in quantitative probability estimation. InsurAgent achieves accurate estimation of marginal and bivariate probabilities through RAG and captures contextual information beyond survey data through reasoning.
Conclusion: InsurAgent provides a valuable tool for behavioral modeling and policy analysis, effectively addressing LLM limitations in quantitative probability estimation while enabling simulation of temporal decision evolution.
Abstract: Flood insurance is an effective strategy for individuals to mitigate disaster-related losses. However, participation rates among at-risk populations in the United States remain strikingly low. This gap underscores the need to understand and model the behavioral mechanisms underlying insurance decisions. Large language models (LLMs) have recently exhibited human-like intelligence across wide-ranging tasks, offering promising tools for simulating human decision-making. This study constructs a benchmark dataset to capture insurance purchase probabilities across factors. Using this dataset, the capacity of LLMs is evaluated: while LLMs exhibit a qualitative understanding of factors, they fall short in estimating quantitative probabilities. To address this limitation, InsurAgent, an LLM-empowered agent comprising five modules including perception, retrieval, reasoning, action, and memory, is proposed. The retrieval module leverages retrieval-augmented generation (RAG) to ground decisions in empirical survey data, achieving accurate estimation of marginal and bivariate probabilities. The reasoning module leverages LLM common sense to extrapolate beyond survey data, capturing contextual information that is intractable for traditional models. The memory module supports the simulation of temporal decision evolutions, illustrated through a roller coaster life trajectory. Overall, InsurAgent provides a valuable tool for behavioral modeling and policy analysis.
[190] When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning
Chenyu Zhang, Minsol Kim, Shohreh Ghorbani, Jingyao Wu, Rosalind Picard, Patricia Maes, Paul Pu Liang
Main category: cs.AI
TL;DR: The paper introduces modality sabotage as a diagnostic failure mode in MLLMs where unimodal errors override correct evidence, and proposes a lightweight evaluation framework to audit fusion dynamics by treating modalities as agents.
Details
Motivation: Multimodal large language models lack transparency in their reasoning traces, making it unclear how modalities interact, resolve conflicts, or dominate predictions.Method: A model-agnostic evaluation layer treats each modality as an agent producing candidate labels and self-assessments, with a simple fusion mechanism to identify contributors and saboteurs.
Result: Case study on multimodal emotion recognition revealed systematic reliability profiles and insights into whether failures stem from dataset artifacts or model limitations.
Conclusion: The framework provides a diagnostic scaffold for multimodal reasoning, enabling principled auditing of fusion dynamics and informing potential interventions.
Abstract: Despite rapid growth in multimodal large language models (MLLMs), their reasoning traces remain opaque: it is often unclear which modality drives a prediction, how conflicts are resolved, or when one stream dominates. In this paper, we introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result. To analyze such dynamics, we propose a lightweight, model-agnostic evaluation layer that treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing. A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead). Applying our diagnostic layer in a case study on multimodal emotion recognition benchmarks with foundation models revealed systematic reliability profiles, providing insight into whether failures may arise from dataset artifacts or model limitations. More broadly, our framework offers a diagnostic scaffold for multimodal reasoning, supporting principled auditing of fusion dynamics and informing possible interventions.
[191] Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning
Renos Zabounidis, Aditya Golatkar, Michael Kleinman, Alessandro Achille, Wei Xia, Stefano Soatto
Main category: cs.AI
TL;DR: Re-FORC is an adaptive reward prediction method that predicts future rewards based on thinking tokens, enabling compute-efficient reasoning through early stopping, optimized model selection, and adaptive scaling.
Details
Motivation: To reduce computational costs in reasoning tasks while maintaining or improving accuracy by dynamically controlling reasoning length and model selection.Method: Trains lightweight adapters on reasoning models to predict expected future rewards as a function of thinking tokens, enabling adaptive control of reasoning processes.
Result: Achieves 26% compute reduction with maintained accuracy, 4% higher accuracy at equal compute, 55% less compute at equal accuracy, and 7-11% accuracy improvements in different compute regimes.
Conclusion: Re-FORC enables efficient dynamic reasoning with upfront computation estimation and cost-per-token control, significantly improving compute-accuracy tradeoffs.
Abstract: We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.
[192] Personalized Decision Modeling: Utility Optimization or Textualized-Symbolic Reasoning
Yibo Zhao, Yang Zhao, Hongru Du, Hao Frank Yang
Main category: cs.AI
TL;DR: ATHENA is a framework that combines symbolic utility functions with semantic adaptation using LLMs to model personalized human decision-making, outperforming existing models by at least 6.5% F1 score.
Details
Motivation: Individual decision-making often differs from population-level predictions, especially in high-stakes scenarios like vaccine uptake, due to unique personal factors including numerical attributes and linguistic influences.Method: Two-stage approach: 1) LLM-augmented symbolic discovery for group-level utility functions, 2) Individual-level semantic adaptation creating personalized semantic templates guided by optimal utility.
Result: Consistently outperforms utility-based, machine learning, and other LLM-based models on real-world travel mode and vaccine choice tasks, with at least 6.5% F1 score improvement over strongest baselines.
Conclusion: ATHENA provides an effective scheme for human-centric decision modeling by organically integrating symbolic utility modeling and semantic adaptation, with both stages being critical and complementary.
Abstract: Decision-making models for individuals, particularly in high-stakes scenarios like vaccine uptake, often diverge from population optimal predictions. This gap arises from the uniqueness of the individual decision-making process, shaped by numerical attributes (e.g., cost, time) and linguistic influences (e.g., personal preferences and constraints). Developing upon Utility Theory and leveraging the textual-reasoning capabilities of Large Language Models (LLMs), this paper proposes an Adaptive Textual-symbolic Human-centric Reasoning framework (ATHENA) to address the optimal information integration. ATHENA uniquely integrates two stages: First, it discovers robust, group-level symbolic utility functions via LLM-augmented symbolic discovery; Second, it implements individual-level semantic adaptation, creating personalized semantic templates guided by the optimal utility to model personalized choices. Validated on real-world travel mode and vaccine choice tasks, ATHENA consistently outperforms utility-based, machine learning, and other LLM-based models, lifting F1 score by at least 6.5% over the strongest cutting-edge models. Further, ablation studies confirm that both stages of ATHENA are critical and complementary, as removing either clearly degrades overall predictive performance. By organically integrating symbolic utility modeling and semantic adaptation, ATHENA provides a new scheme for modeling human-centric decisions. The project page can be found at https://yibozh.github.io/Athena.
[193] Optimal-Agent-Selection: State-Aware Routing Framework for Efficient Multi-Agent Collaboration
Jingbo Wang, Sendong Zhao, Haochun Wang, Yuzheng Fan, Lizhe Zhang, Yan Liu, Ting Liu
Main category: cs.AI
TL;DR: STRMAC is a state-aware routing framework for multi-agent systems that adaptively selects the most suitable agent at each step, achieving 23.8% performance improvement and 90.1% data collection reduction.
Details
Motivation: Current multi-agent systems suffer from rigid scheduling and inefficient coordination that fail to adapt to evolving task requirements, limiting their full potential.Method: Separately encodes interaction history and agent knowledge to power a router that adaptively selects single agents at each step, plus a self-evolving data generation approach for efficient training.
Result: Achieves state-of-the-art performance with up to 23.8% improvement over baselines and reduces data collection overhead by up to 90.1% compared to exhaustive search.
Conclusion: STRMAC enables efficient and effective collaboration in multi-agent systems through adaptive agent selection and efficient training data generation.
Abstract: The emergence of multi-agent systems powered by large language models (LLMs) has unlocked new frontiers in complex task-solving, enabling diverse agents to integrate unique expertise, collaborate flexibly, and address challenges unattainable for individual models. However, the full potential of such systems is hindered by rigid agent scheduling and inefficient coordination strategies that fail to adapt to evolving task requirements. In this paper, we propose STRMAC, a state-aware routing framework designed for efficient collaboration in multi-agent systems. Our method separately encodes interaction history and agent knowledge to power the router, which adaptively selects the most suitable single agent at each step for efficient and effective collaboration. Furthermore, we introduce a self-evolving data generation approach that accelerates the collection of high-quality execution paths for efficient system training. Experiments on challenging collaborative reasoning benchmarks demonstrate that our method achieves state-of-the-art performance, achieving up to 23.8% improvement over baselines and reducing data collection overhead by up to 90.1% compared to exhaustive search.
[194] Training Proactive and Personalized LLM Agents
Weiwei Sun, Xuhui Zhou, Weihua Du, Xingyao Wang, Sean Welleck, Graham Neubig, Maarten Sap, Yiming Yang
Main category: cs.AI
TL;DR: PPP is a multi-objective RL approach that jointly optimizes productivity, proactivity, and personalization for AI agents, showing significant improvements over baselines like GPT-5.
Details
Motivation: Existing work focuses mainly on task success, but effective real-world agents need to optimize three dimensions: productivity (task completion), proactivity (asking essential questions), and personalization (adapting to user preferences).Method: Introduced UserVille environment with LLM-based user simulators for diverse preferences, and PPP multi-objective reinforcement learning approach that jointly optimizes all three dimensions.
Result: Experiments on software engineering and deep research tasks show PPP agents achieve +21.6 average improvement over GPT-5, demonstrating strategic questioning, adaptation to unseen preferences, and better task success.
Conclusion: Explicitly optimizing for user-centered interaction is critical for building practical and effective AI agents.
Abstract: While existing work focuses primarily on task success, we argue that effective real-world agents require optimizing three dimensions: productivity (task completion), proactivity (asking essential questions), and personalization (adapting to diverse user preferences). We introduce UserVille, an interactive environment with LLM-based user simulators enabling diverse, configurable user preferences. Leveraging UserVille, we introduce PPP, a multi-objective reinforcement learning approach that jointly optimizes all three dimensions: Productivity, Proactivity, and Personalization. Experiments on software engineering and deep research tasks show that agents trained with PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6 on average), demonstrating the ability to ask strategic clarifying questions, adapt to unseen user preferences, and improve task success through better interaction. This work demonstrates that explicitly optimizing for user-centered interaction is critical for building practical and effective AI agents.
[195] TabDSR: Decompose, Sanitize, and Reason for Complex Numerical Reasoning in Tabular Data
Changjiang Jiang, Fengchang Yu, Haihua Chen, Wei Lu, Jin Zeng
Main category: cs.AI
TL;DR: A framework called \method that improves LLM performance on complex tabular numerical reasoning through query decomposition, table sanitization, and program-of-thoughts reasoning.
Details
Motivation: LLMs underperform on complex tabular reasoning due to complex queries, noisy data, and limited numerical capabilities.Method: Three-component framework: (1) query decomposer to break down complex questions, (2) table sanitizer to clean and filter noisy tables, and (3) program-of-thoughts-based reasoner that generates executable code.
Result: Achieves SOTA performance with 8.79%, 6.08%, and 19.87% accuracy improvements on TAT-QA, TableBench, and \method datasets respectively. Introduces new dataset CalTab151 for unbiased evaluation.
Conclusion: The framework effectively enhances LLM performance for complex tabular numerical reasoning and integrates seamlessly with mainstream LLMs.
Abstract: Complex reasoning over tabular data is crucial in real-world data analysis, yet large language models (LLMs) often underperform due to complex queries, noisy data, and limited numerical capabilities. To address these issues, we propose \method, a framework consisting of: (1) a query decomposer that breaks down complex questions, (2) a table sanitizer that cleans and filters noisy tables, and (3) a program-of-thoughts (PoT)-based reasoner that generates executable code to derive the final answer from the sanitized table. To ensure unbiased evaluation and mitigate data leakage, we introduce a new dataset, CalTab151, specifically designed for complex numerical reasoning over tables. Experimental results demonstrate that \method consistently outperforms existing methods, achieving state-of-the-art (SOTA) performance with 8.79%, 6.08%, and 19.87% accuracy improvement on TAT-QA, TableBench, and \method, respectively. Moreover, our framework integrates seamlessly with mainstream LLMs, providing a robust solution for complex tabular numerical reasoning. These findings highlight the effectiveness of our framework in enhancing LLM performance for complex tabular numerical reasoning. Data and code are available upon request.
[196] Deep Ideation: Designing LLM Agents to Generate Novel Research Ideas on Scientific Concept Network
Keyu Zhao, Weiquan Lin, Qirui Zheng, Fengli Xu, Yong Li
Main category: cs.AI
TL;DR: Deep Ideation framework improves research idea generation by 10.67% using scientific concept networks and iterative refinement with reviewer feedback.
Details
Motivation: Previous methods rely on simplistic keyword associations and fail to capture complex contextual relationships between scientific concepts, limiting their ability to generate truly novel research ideas grounded in established literature.Method: Proposes Deep Ideation framework with scientific network capturing keyword co-occurrence and contextual relationships, explore-expand-evolve workflow using Idea Stack, and critic engine trained on real reviewer feedback.
Result: 10.67% improvement in idea quality compared to other methods, generated ideas surpass top conference acceptance levels, human evaluation confirms practical value, ablation studies validate component effectiveness.
Conclusion: Deep Ideation successfully integrates scientific networks with LLM-driven ideation to generate high-quality, novel research ideas that are both innovative and feasible, advancing automated research ideation capabilities.
Abstract: Novel research ideas play a critical role in advancing scientific inquiries. Recent advancements in Large Language Models (LLMs) have demonstrated their potential to generate novel research ideas by leveraging large-scale scientific literature. However, previous work in research ideation has primarily relied on simplistic methods, such as keyword co-occurrence or semantic similarity. These approaches focus on identifying statistical associations in the literature but overlook the complex, contextual relationships between scientific concepts, which are essential to effectively leverage knowledge embedded in human literature. For instance, papers that simultaneously mention “keyword A” and “keyword B” often present research ideas that integrate both concepts. Additionally, some LLM-driven methods propose and refine research ideas using the model’s internal knowledge, but they fail to effectively utilize the scientific concept network, limiting the grounding of ideas in established research. To address these challenges, we propose the Deep Ideation framework to address these challenges, integrating a scientific network that captures keyword co-occurrence and contextual relationships, enriching LLM-driven ideation. The framework introduces an explore-expand-evolve workflow to iteratively refine research ideas, using an Idea Stack to track progress. A critic engine, trained on real-world reviewer feedback, guides the process by providing continuous feedback on the novelty and feasibility of ideas. Our experiments show that our approach improves the quality of generated ideas by 10.67% compared to other methods, with ideas surpassing top conference acceptance levels. Human evaluation highlights their practical value in scientific research, and ablation studies confirm the effectiveness of each component in the workflow. Code repo is available at https://github.com/kyZhao-1/Deep-Ideation.
[197] When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs
Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Haotian Wang, Di Wang, Lijie Hu
Main category: cs.AI
TL;DR: This paper introduces a framework to analyze how multimodal LLMs resolve conflicting information between modalities, decomposing modality following into relative reasoning uncertainty and inherent modality preference.
Details
Motivation: Prior work only measured modality following with coarse dataset-level statistics, overlooking the influence of model's confidence in unimodal reasoning.Method: Constructed a controllable dataset that systematically varies visual and textual reasoning difficulty, used entropy as uncertainty metric, and probed layer-wise predictions to reveal internal mechanisms.
Result: Discovered a universal law: probability of following a modality decreases monotonically as its relative uncertainty increases. Identified balance points as indicators of inherent modality preference and revealed oscillation mechanisms in ambiguous regions.
Conclusion: Relative uncertainty and inherent preference are the two governing principles of modality following, providing both quantitative framework and mechanistic insight into how MLLMs resolve conflicting information.
Abstract: Multimodal large language models (MLLMs) must resolve conflicts when different modalities provide contradictory information, a process we term modality following. Prior work measured this behavior only with coarse dataset-level statistics, overlooking the influence of model’s confidence in unimodal reasoning. In this paper, we introduce a new framework that decomposes modality following into two fundamental factors: relative reasoning uncertainty (the case-specific confidence gap between unimodal predictions) and inherent modality preference( a model’s stable bias when uncertainties are balanced). To validate this framework, we construct a controllable dataset that systematically varies the reasoning difficulty of visual and textual inputs. Using entropy as a fine-grained uncertainty metric, we uncover a universal law: the probability of following a modality decreases monotonically as its relative uncertainty increases. At the relative difficulty level where the model tends to follow both modalities with comparable probability what we call the balance point, a practical indicator of the model’s inherent preference. Unlike traditional macro-level ratios, this measure offers a more principled and less confounded way to characterize modality bias, disentangling it from unimodal capabilities and dataset artifacts. Further, by probing layer-wise predictions, we reveal the internal mechanism of oscillation: in ambiguous regions near the balance point, models vacillate between modalities across layers, explaining externally observed indecision. Together, these findings establish relative uncertainty and inherent preference as the two governing principles of modality following, offering both a quantitative framework and mechanistic insight into how MLLMs resolve conflicting information.
[198] Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation
Zhiwei Zhang, Xiaomin Li, Yudi Lin, Hui Liu, Ramraj Chandradevan, Linlin Wu, Minhua Lin, Fali Wang, Xianfeng Tang, Qi He, Suhang Wang
Main category: cs.AI
TL;DR: The paper addresses lazy agent behavior in multi-agent reasoning systems where one agent dominates while the other contributes little, undermining collaboration. It provides theoretical analysis, introduces causal influence measurement, and proposes a verifiable reward mechanism to mitigate this issue.
Details
Motivation: Current multi-agent reasoning systems suffer from lazy agent behavior where collaboration collapses into ineffective single-agent performance, limiting the potential benefits of multi-agent frameworks for complex reasoning tasks.Method: The authors provide theoretical analysis of lazy behavior, introduce a stable method for measuring causal influence between agents, and propose a verifiable reward mechanism that allows the reasoning agent to discard noisy outputs, consolidate instructions, and restart reasoning when necessary.
Result: Extensive experiments demonstrate that the proposed framework effectively alleviates lazy agent behavior and unlocks the full potential of multi-agent systems for complex reasoning tasks.
Conclusion: The introduced framework successfully mitigates lazy agent behavior in multi-agent reasoning through theoretical analysis, causal influence measurement, and verifiable rewards, enabling more effective collaboration in complex reasoning scenarios.
Abstract: Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi-agent setting, where a meta-thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi-agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost in multi-turn interactions and trapped by previous noisy responses. To counter this, we propose a verifiable reward mechanism that encourages deliberation by allowing the reasoning agent to discard noisy outputs, consolidate instructions, and restart its reasoning process when necessary. Extensive experiments demonstrate that our framework alleviates lazy agent behavior and unlocks the full potential of multi-agent framework for complex reasoning tasks.
[199] Chronic Kidney Disease Prognosis Prediction Using Transformer
Yohan Lee, DongGyun Kang, SeHoon Park, Sa-Yoon Park, Kwangsoo Kim
Main category: cs.AI
TL;DR: ProQ-BERT: A transformer-based framework for predicting CKD progression using multi-modal EHR data, achieving superior performance with ROC-AUC up to 0.995.
Details
Motivation: CKD affects 10% of global population and accurate prognosis prediction is crucial for timely interventions and resource optimization in healthcare.Method: Transformer-based framework integrating demographic, clinical, and laboratory data with quantization-based tokenization for continuous lab values and attention mechanisms for interpretability. Pretrained with masked language modeling and fine-tuned for binary classification tasks.
Result: Outperformed CEHR-BERT on cohort of 91,816 patients, achieving ROC-AUC up to 0.995 and PR-AUC up to 0.989 for short-term prediction of CKD progression from stage 3a to stage 5.
Conclusion: Transformer architectures with temporal design choices are highly effective for clinical prognosis modeling, offering promising direction for personalized CKD care.
Abstract: Chronic Kidney Disease (CKD) affects nearly 10% of the global population and often progresses to end-stage renal failure. Accurate prognosis prediction is vital for timely interventions and resource optimization. We present a transformer-based framework for predicting CKD progression using multi-modal electronic health records (EHR) from the Seoul National University Hospital OMOP Common Data Model. Our approach (\textbf{ProQ-BERT}) integrates demographic, clinical, and laboratory data, employing quantization-based tokenization for continuous lab values and attention mechanisms for interpretability. The model was pretrained with masked language modeling and fine-tuned for binary classification tasks predicting progression from stage 3a to stage 5 across varying follow-up and assessment periods. Evaluated on a cohort of 91,816 patients, our model consistently outperformed CEHR-BERT, achieving ROC-AUC up to 0.995 and PR-AUC up to 0.989 for short-term prediction. These results highlight the effectiveness of transformer architectures and temporal design choices in clinical prognosis modeling, offering a promising direction for personalized CKD care.
[200] Fuzzy Soft Set Theory based Expert System for the Risk Assessment in Breast Cancer Patients
Muhammad Sheharyar Liaqat
Main category: cs.AI
TL;DR: A fuzzy soft set theory-based expert system for breast cancer risk assessment using clinical parameters like BMI, insulin, leptin, adiponectin levels, and age.
Details
Motivation: Early breast cancer detection is critical but challenging due to disease complexity and variable patient risk factors, requiring accessible preliminary assessment methods.Method: Integration of fuzzy inference rules and soft set computations using measurable clinical parameters from routine blood analyses for non-invasive risk assessment.
Result: Developed and validated using UCI Machine Learning Repository dataset to identify high-risk patients and guide further diagnostic procedures.
Conclusion: The system provides healthcare professionals with a non-invasive tool for preliminary breast cancer risk assessment to support early detection decisions.
Abstract: Breast cancer remains one of the leading causes of mortality among women worldwide, with early diagnosis being critical for effective treatment and improved survival rates. However, timely detection continues to be a challenge due to the complex nature of the disease and variability in patient risk factors. This study presents a fuzzy soft set theory-based expert system designed to assess the risk of breast cancer in patients using measurable clinical and physiological parameters. The proposed system integrates Body Mass Index, Insulin Level, Leptin Level, Adiponectin Level, and age as input variables to estimate breast cancer risk through a set of fuzzy inference rules and soft set computations. These parameters can be obtained from routine blood analyses, enabling a non-invasive and accessible method for preliminary assessment. The dataset used for model development and validation was obtained from the UCI Machine Learning Repository. The proposed expert system aims to support healthcare professionals in identifying high-risk patients and determining the necessity of further diagnostic procedures such as biopsies.
[201] Generative World Models of Tasks: LLM-Driven Hierarchical Scaffolding for Embodied Agents
Brennen Hill
Main category: cs.AI
TL;DR: The paper proposes Hierarchical Task Environments (HTEs) as a framework that integrates symbolic/hierarchical methods with multi-agent reinforcement learning to address complex multi-agent tasks like robotic soccer, using LLMs as generative world models to create learning scaffolds.
Details
Motivation: End-to-end approaches fail for complex multi-agent tasks due to intractable exploration spaces and sparse rewards. Current research shows a trend towards combining symbolic methods with MARL to decompose complex goals into manageable subgoals.Method: Proposes HTE framework integrating Hierarchical Task Networks, Bayesian Strategy Networks with MARL, using LLMs as generative world models to dynamically create hierarchical task scaffolding and intrinsic curriculum.
Result: HTEs provide mechanisms to guide exploration, generate meaningful learning signals, and train agents to internalize hierarchical structure, enabling more capable agents with greater sample efficiency.
Conclusion: Hierarchical Task Environments bridge the gap between reactive behaviors and strategic team play, offering a more effective approach than purely end-to-end methods for complex multi-agent tasks.
Abstract: Recent advances in agent development have focused on scaling model size and raw interaction data, mirroring successes in large language models. However, for complex, long-horizon multi-agent tasks such as robotic soccer, this end-to-end approach often fails due to intractable exploration spaces and sparse rewards. We propose that an effective world model for decision-making must model the world’s physics and also its task semantics. A systematic review of 2024 research in low-resource multi-agent soccer reveals a clear trend towards integrating symbolic and hierarchical methods, such as Hierarchical Task Networks (HTNs) and Bayesian Strategy Networks (BSNs), with multi-agent reinforcement learning (MARL). These methods decompose complex goals into manageable subgoals, creating an intrinsic curriculum that shapes agent learning. We formalize this trend into a framework for Hierarchical Task Environments (HTEs), which are essential for bridging the gap between simple, reactive behaviors and sophisticated, strategic team play. Our framework incorporates the use of Large Language Models (LLMs) as generative world models of tasks, capable of dynamically generating this scaffolding. We argue that HTEs provide a mechanism to guide exploration, generate meaningful learning signals, and train agents to internalize hierarchical structure, enabling the development of more capable and general-purpose agents with greater sample efficiency than purely end-to-end approaches.
[202] A New Perspective on Precision and Recall for Generative Models
Benjamin Sykes, Loïc Simon, Julien Rabin, Jalal Fadili
Main category: cs.AI
TL;DR: A new framework for estimating entire Precision-Recall curves for generative models based on binary classification, with statistical analysis and minimax bounds.
Details
Motivation: Current evaluation methods for generative models rely on scalar metrics, but Precision-Recall curves offer richer analysis. However, estimating entire PR curves poses challenges that need addressing.Method: Proposed a binary classification-based framework for estimating entire PR curves, conducted statistical analysis including minimax upper bound on estimation risk.
Result: The framework extends existing PR metrics that are limited to extreme curve values, and experimental studies show different curve behaviors in various settings.
Conclusion: The new framework enables comprehensive PR curve estimation for generative models with statistical guarantees, overcoming limitations of existing scalar metrics and extreme-value PR approaches.
Abstract: With the recent success of generative models in image and text, the question of their evaluation has recently gained a lot of attention. While most methods from the state of the art rely on scalar metrics, the introduction of Precision and Recall (PR) for generative model has opened up a new avenue of research. The associated PR curve allows for a richer analysis, but their estimation poses several challenges. In this paper, we present a new framework for estimating entire PR curves based on a binary classification standpoint. We conduct a thorough statistical analysis of the proposed estimates. As a byproduct, we obtain a minimax upper bound on the PR estimation risk. We also show that our framework extends several landmark PR metrics of the literature which by design are restrained to the extreme values of the curve. Finally, we study the different behaviors of the curves obtained experimentally in various settings.
[203] ReAcTree: Hierarchical LLM Agent Trees with Control Flow for Long-Horizon Task Planning
Jae-Woo Choi, Hyungmin Kim, Hyobin Ong, Minsu Jang, Dohyung Kim, Jaehong Kim, Youngwoo Yoon
Main category: cs.AI
TL;DR: ReAcTree is a hierarchical task-planning method that decomposes complex goals into manageable subgoals using a dynamically constructed agent tree, outperforming existing methods like ReAct on embodied AI tasks.
Details
Motivation: Existing LLM-based methods struggle with complex, long-horizon tasks because they rely on monolithic trajectories that entangle all past decisions and observations, attempting to solve entire tasks in a single unified process.Method: ReAcTree uses a hierarchical approach with dynamically constructed agent trees where each subgoal is handled by an LLM agent node capable of reasoning, acting, and expanding the tree. Control flow nodes coordinate execution strategies, and two memory systems (episodic memory for goal-specific examples and working memory for environment observations) are integrated.
Result: Experiments on WAH-NL and ALFRED datasets show ReAcTree consistently outperforms strong baselines like ReAct across diverse LLMs. On WAH-NL, ReAcTree achieves 61% goal success rate with Qwen 2.5 72B, nearly doubling ReAct’s 31%.
Conclusion: The hierarchical decomposition approach with dynamic agent trees and complementary memory systems effectively addresses limitations of monolithic planning methods for complex, long-horizon embodied AI tasks.
Abstract: Recent advancements in large language models (LLMs) have enabled significant progress in decision-making and task planning for embodied autonomous agents. However, most existing methods still struggle with complex, long-horizon tasks because they rely on a monolithic trajectory that entangles all past decisions and observations, attempting to solve the entire task in a single unified process. To address this limitation, we propose ReAcTree, a hierarchical task-planning method that decomposes a complex goal into more manageable subgoals within a dynamically constructed agent tree. Each subgoal is handled by an LLM agent node capable of reasoning, acting, and further expanding the tree, while control flow nodes coordinate the execution strategies of agent nodes. In addition, we integrate two complementary memory systems: each agent node retrieves goal-specific, subgoal-level examples from episodic memory and shares environment-specific observations through working memory. Experiments on the WAH-NL and ALFRED datasets demonstrate that ReAcTree consistently outperforms strong task-planning baselines such as ReAct across diverse LLMs. Notably, on WAH-NL, ReAcTree achieves a 61% goal success rate with Qwen 2.5 72B, nearly doubling ReAct’s 31%.
[204] Auditable-choice reframing unlocks RL-based verification for open-ended tasks
Mengyu Zhang, Xubo Liu, Siyu Ding, Weichong Yin, Yu Sun, Hua Wu, Wenya Guo, Ying Zhang
Main category: cs.AI
TL;DR: The paper introduces Verifiable Multiple-Choice Reformulation (VMR), a training strategy that adapts RLVR to open-ended tasks by converting them into verifiable multiple-choice formats, achieving significant performance improvements.
Details
Motivation: Existing RLVR methods work well for tasks with standard answers (math, programming) but fail in open-ended domains (creative writing, instruction following) where ground truth is unavailable, creating a gap in leveraging reasoning capabilities for such tasks.Method: Proposes VMR that restructures open-ended data into verifiable multiple-choice formats, enabling RLVR training without explicit ground truth by creating verifiable training signals.
Result: Achieves average gain of 5.99 points across eight open-ended benchmarks compared to baseline methods, demonstrating significant performance improvement.
Conclusion: VMR successfully extends RLVR to open-ended tasks, proving that reasoning capabilities can be effectively leveraged even in domains without standard answers through appropriate reformulation strategies.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated great potential in enhancing the reasoning capabilities of large language models (LLMs), achieving remarkable progress in domains such as mathematics and programming where standard answers are available. However, for open-ended tasks lacking ground-truth solutions (e.g., creative writing and instruction following), existing studies typically regard them as non-reasoning scenarios, thereby overlooking the latent value of reasoning capabilities. This raises a key question: Can strengthening reasoning improve performance in open-ended tasks? To address this, we explore the transfer of the RLVR paradigm to the open domain. Yet, since RLVR fundamentally relies on verifiers that presuppose the existence of standard answers, it cannot be directly applied to open-ended tasks. To overcome this challenge, we introduce Verifiable Multiple-Choice Reformulation (VMR), a novel training strategy that restructures open-ended data into verifiable multiple-choice formats, enabling effective training even in the absence of explicit ground truth. Experimental results on multiple benchmarks validate the effectiveness of our method in improving LLM performance on open-ended tasks. Notably, across eight open-ended benchmarks, our VMR-based training delivers an average gain of 5.99 points over the baseline. Code will be released upon acceptance to facilitate reproducibility.
[205] Agentic AI for Mobile Network RAN Management and Optimization
Jorge Pellejero, Luis A. Hernández Gómez, Luis Mendo Tomás, Zoraida Frias Barroso
Main category: cs.AI
TL;DR: This paper proposes Agentic AI as a solution for automating 5G/6G network optimization, outlining its core concepts and demonstrating a practical RAN optimization use case using Large AI Models.
Details
Motivation: The complexity of 5G and upcoming 6G networks makes manual optimization ineffective, requiring autonomous AI systems with human-level cognitive abilities for dynamic RAN environments.Method: The paper introduces Agentic AI concepts, describes core design patterns (reflection, planning, tool use, multi-agent collaboration), and presents a 5G RAN case study using time-series analytics and LAM-driven agents for KPI-based autonomous decision-making.
Result: The paper provides a framework for understanding Agentic AI in mobile networks and demonstrates its practical application through a RAN optimization case study.
Conclusion: Agentic AI represents a promising paradigm for automating complex network systems, with the potential to revolutionize 5G/6G RAN management through autonomous decision-making capabilities.
Abstract: Agentic AI represents a new paradigm for automating complex systems by using Large AI Models (LAMs) to provide human-level cognitive abilities with multimodal perception, planning, memory, and reasoning capabilities. This will lead to a new generation of AI systems that autonomously decompose goals, retain context over time, learn continuously, operate across tools and environments, and adapt dynamically. The complexity of 5G and upcoming 6G networks renders manual optimization ineffective, pointing to Agentic AI as a method for automating decisions in dynamic RAN environments. However, despite its rapid advances, there is no established framework outlining the foundational components and operational principles of Agentic AI systems nor a universally accepted definition. This paper contributes to ongoing research on Agentic AI in 5G and 6G networks by outlining its core concepts and then proposing a practical use case that applies Agentic principles to RAN optimization. We first introduce Agentic AI, tracing its evolution from classical agents and discussing the progress from workflows and simple AI agents to Agentic AI. Core design patterns-reflection, planning, tool use, and multi-agent collaboration-are then described to illustrate how intelligent behaviors are orchestrated. These theorical concepts are grounded in the context of mobile networks, with a focus on RAN management and optimization. A practical 5G RAN case study shows how time-series analytics and LAM-driven agents collaborate for KPI-based autonomous decision-making.
[206] Knowledge Graph-enhanced Large Language Model for Incremental Game PlayTesting
Enhong Mu, Jinyu Cai, Yijun Lu, Mingyue Zhang, Kenji Tei, Jialong Li
Main category: cs.AI
TL;DR: KLPEG framework uses Knowledge Graphs and LLMs to enable efficient, targeted playtesting for incremental game updates by accumulating knowledge across versions and generating update-specific test cases.
Details
Motivation: Address challenges in automated playtesting for frequently updated games, where current LLM-based methods lack structured knowledge accumulation and struggle with precise testing for incremental updates.Method: Proposes KLPEG framework that constructs Knowledge Graphs to model game elements, task dependencies, and causal relationships, then uses LLMs to parse update logs and perform multi-hop reasoning on KG to identify impact scope and generate tailored test cases.
Result: Experiments in Overcooked and Minecraft show KLPEG can more accurately locate affected functionalities and complete tests in fewer steps, improving both effectiveness and efficiency.
Conclusion: KLPEG framework successfully addresses the challenge of efficient playtesting for incremental game updates through structured knowledge accumulation and reuse, enabling more precise and efficient testing.
Abstract: The rapid iteration and frequent updates of modern video games pose significant challenges to the efficiency and specificity of testing. Although automated playtesting methods based on Large Language Models (LLMs) have shown promise, they often lack structured knowledge accumulation mechanisms, making it difficult to conduct precise and efficient testing tailored for incremental game updates. To address this challenge, this paper proposes a KLPEG framework. The framework constructs and maintains a Knowledge Graph (KG) to systematically model game elements, task dependencies, and causal relationships, enabling knowledge accumulation and reuse across versions. Building on this foundation, the framework utilizes LLMs to parse natural language update logs, identify the scope of impact through multi-hop reasoning on the KG, enabling the generation of update-tailored test cases. Experiments in two representative game environments, Overcooked and Minecraft, demonstrate that KLPEG can more accurately locate functionalities affected by updates and complete tests in fewer steps, significantly improving both playtesting effectiveness and efficiency.
[207] The ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models
Claudia Herambourg, Dawid Siuda, Anna Szczepanek, Julia Kopczyńska, Joao R. L. Santos, Wojciech Sas, Joanna Śmietańska-Nowak
Main category: cs.AI
TL;DR: ORCA Benchmark evaluates LLMs on multi-domain quantitative reasoning using real-life tasks from finance, physics, health, and statistics, revealing significant calculation and rounding errors across top models.
Details
Motivation: To assess LLMs' quantitative reasoning capabilities across diverse real-world domains using verified calculator outputs, addressing limitations of standard math datasets.Method: Created 500 natural-language tasks across multiple domains using Omni’s calculator engine for verified outputs, testing five state-of-the-art LLMs on step-by-step reasoning and numerical precision.
Result: Top LLMs achieved only 45-63% accuracy, with 35% rounding errors and 33% calculation mistakes. Models showed strengths in math/engineering but weaknesses in physics/natural sciences, with moderate error correlation (r≈0.40-0.65).
Conclusion: Current LLMs have significant limitations in quantitative reasoning across real-world domains, showing partial complementarity rather than redundancy in error patterns.
Abstract: We present ORCA (Omni Research on Calculation in AI) Benchmark – a novel
benchmark that evaluates large language models (LLMs) on multi-domain,
real-life quantitative reasoning using verified outputs from Omni’s calculator
engine. In 500 natural-language tasks across domains such as finance, physics,
health, and statistics, the five state-of-the-art systems (ChatGPT-5,
Gemini2.5Flash, ClaudeSonnet4.5, Grok4, and DeepSeekV3.2) achieved only
$45\text{–}63,%$ accuracy, with errors mainly related to rounding ($35,%$)
and calculation mistakes ($33,%$). Results in specific domains indicate
strengths in mathematics and engineering, but weaknesses in physics and natural
sciences. Correlation analysis ($r \approx 0.40\text{–}0.65$) shows that the
models often fail together but differ in the types of errors they make,
highlighting their partial complementarity rather than redundancy. Unlike
standard math datasets, ORCA evaluates step-by-step reasoning, numerical
precision, and domain generalization across real problems from finance,
physics, health, and statistics.
[208] Adaptive GR(1) Specification Repair for Liveness-Preserving Shielding in Reinforcement Learning
Tiberiu-Andrei Georgescu, Alexander W. Goodall, Dalal Alrajeh, Francesco Belardinelli, Sebastian Uchitel
Main category: cs.AI
TL;DR: First adaptive shielding framework for RL using GR(1) specifications that automatically repairs environment assumptions at runtime via ILP, maintaining safety while adapting to violated assumptions.
Details
Motivation: Static shields fail to adapt when environment assumptions are violated, leading to suboptimal performance and potential safety issues in changing environments.Method: Uses GR(1) specifications with runtime environment assumption violation detection and Inductive Logic Programming (ILP) to automatically repair specifications online in an interpretable way.
Result: In Minepump and Atari Seaquest case studies, adaptive shield maintains near-optimal reward and perfect logical compliance, outperforming static shields which are severely suboptimal.
Conclusion: Adaptive shielding framework successfully evolves shields gracefully, ensuring liveness is achievable and weakening goals only when necessary, providing robust safety in dynamic environments.
Abstract: Shielding is widely used to enforce safety in reinforcement learning (RL), ensuring that an agent’s actions remain compliant with formal specifications. Classical shielding approaches, however, are often static, in the sense that they assume fixed logical specifications and hand-crafted abstractions. While these static shields provide safety under nominal assumptions, they fail to adapt when environment assumptions are violated. In this paper, we develop the first adaptive shielding framework - to the best of our knowledge - based on Generalized Reactivity of rank 1 (GR(1)) specifications, a tractable and expressive fragment of Linear Temporal Logic (LTL) that captures both safety and liveness properties. Our method detects environment assumption violations at runtime and employs Inductive Logic Programming (ILP) to automatically repair GR(1) specifications online, in a systematic and interpretable way. This ensures that the shield evolves gracefully, ensuring liveness is achievable and weakening goals only when necessary. We consider two case studies: Minepump and Atari Seaquest; showing that (i) static symbolic controllers are often severely suboptimal when optimizing for auxiliary rewards, and (ii) RL agents equipped with our adaptive shield maintain near-optimal reward and perfect logical compliance compared with static shields.
[209] A Multi-Agent Psychological Simulation System for Human Behavior Modeling
Xiangen Hu, Jiarui Tong, Sheng Xu
Main category: cs.AI
TL;DR: A multi-agent psychological simulation system that models internal cognitive-affective processes to generate believable human behaviors for training and education, grounded in established psychological theories rather than black-box neural models.
Details
Motivation: Training and education in human-centered fields require authentic practice, but realistic simulations of human behavior have remained limited. Current approaches lack transparency and psychological grounding.Method: The system explicitly simulates an “inner parliament” of agents corresponding to key psychological factors (e.g., self-efficacy, mindset, social constructivism) that deliberate and interact to determine output behavior.
Result: The system enables unprecedented transparency and alignment with human psychology, providing believable human behaviors for applications in teacher training and research.
Conclusion: This approach embodies principles of social learning, cognitive apprenticeship, deliberate practice, and meta-cognition, offering a psychologically-grounded alternative to black-box neural models for human behavior simulation.
Abstract: Training and education in human-centered fields require authentic practice, yet realistic simulations of human behavior have remained limited. We present a multi-agent psychological simulation system that models internal cognitive-affective processes to generate believable human behaviors. In contrast to black-box neural models, this system is grounded in established psychological theories (e.g., self-efficacy, mindset, social constructivism) and explicitly simulates an ``inner parliament’’ of agents corresponding to key psychological factors. These agents deliberate and interact to determine the system’s output behavior, enabling unprecedented transparency and alignment with human psychology. We describe the system’s architecture and theoretical foundations, illustrate its use in teacher training and research, and discuss how it embodies principles of social learning, cognitive apprenticeship, deliberate practice, and meta-cognition.
[210] DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning
Lachlan McPheat, Navdeep Kaur, Robert Blackwell, Alessandra Russo, Anthony G. Cohn, Pranava Madhyastha
Main category: cs.AI
TL;DR: DecompSR is a large benchmark dataset and generation framework for analyzing compositional spatial reasoning in LLMs, with independent control over productivity, substitutivity, overgeneralization, and systematicity.
Details
Motivation: To create a rigorous benchmark that can independently analyze different aspects of compositional reasoning in LLMs, particularly spatial reasoning abilities.Method: Procedurally generated dataset (over 5M datapoints) that is correct by construction, with independent variation of compositional aspects. Verified using symbolic solver for correctness.
Result: LLMs struggle with productive and systematic generalization in spatial reasoning tasks but are more robust to linguistic variation.
Conclusion: DecompSR provides a provably correct benchmark for fine-grained probing of compositional reasoning abilities in LLMs, revealing specific weaknesses in productive and systematic generalization.
Abstract: We introduce DecompSR, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to analyse compositional spatial reasoning ability. The generation of DecompSR allows users to independently vary several aspects of compositionality, namely: productivity (reasoning depth), substitutivity (entity and linguistic variability), overgeneralisation (input order, distractors) and systematicity (novel linguistic elements). DecompSR is built procedurally in a manner which makes it is correct by construction, which is independently verified using a symbolic solver to guarantee the correctness of the dataset. DecompSR is comprehensively benchmarked across a host of Large Language Models (LLMs) where we show that LLMs struggle with productive and systematic generalisation in spatial reasoning tasks whereas they are more robust to linguistic variation. DecompSR provides a provably correct and rigorous benchmarking dataset with a novel ability to independently vary the degrees of several key aspects of compositionality, allowing for robust and fine-grained probing of the compositional reasoning abilities of LLMs.
[211] The Collaboration Gap
Tim R. Davidson, Adam Fourney, Saleema Amershi, Robert West, Eric Horvitz, Ece Kamar
Main category: cs.AI
TL;DR: AI agent collaboration often fails despite individual competence, with a ‘collaboration gap’ where strong solo performers degrade when working together. A relay inference approach where stronger agents lead can mitigate this gap.
Details
Motivation: As AI systems increasingly rely on heterogeneous agent collaboration, there's a need to empirically evaluate collaborative capabilities at scale, especially under partial observability conditions.Method: Developed a collaborative maze-solving benchmark that isolates collaboration, scales complexity, enables automated grading, and preserves ecological plausibility. Tested 32 leading models in solo, homogeneous, and heterogeneous pairings.
Result: Revealed a significant collaboration gap - models performing well solo often degrade substantially when collaborating. Small distilled models that solve mazes alone may fail completely in certain pairings. Starting with stronger agents improves outcomes.
Conclusion: Need collaboration-aware evaluation, training strategies to enhance collaborative capabilities, and interaction design that reliably elicits agents’ latent skills. Relay inference approach (stronger agent leads then hands off) can close much of the collaboration gap.
Abstract: The trajectory of AI development suggests that we will increasingly rely on agent-based systems composed of independently developed agents with different information, privileges, and tools. The success of these systems will critically depend on effective collaboration among these heterogeneous agents, even under partial observability. Despite intense interest, few empirical studies have evaluated such agent-agent collaboration at scale. We propose a collaborative maze-solving benchmark that (i) isolates collaborative capabilities, (ii) modulates problem complexity, (iii) enables scalable automated grading, and (iv) imposes no output-format constraints, preserving ecological plausibility. Using this framework, we evaluate 32 leading open- and closed-source models in solo, homogeneous, and heterogeneous pairings. Our results reveal a “collaboration gap”: models that perform well solo often degrade substantially when required to collaborate. Collaboration can break down dramatically; for instance, small distilled models that solve mazes well alone may fail almost completely in certain pairings. We find that starting with the stronger agent often improves outcomes, motivating a “relay inference” approach where the stronger agent leads before handing off to the weaker one, closing much of the gap. Our findings argue for (1) collaboration-aware evaluation, (2) training strategies developed to enhance collaborative capabilities, and (3) interaction design that reliably elicits agents’ latent skills, guidance that applies to AI-AI and human-AI collaboration.
[212] CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents
Jiayu Liu, Cheng Qian, Zhaochen Su, Qing Zong, Shijue Huang, Bingxiang He, Yi R. Fung
Main category: cs.AI
TL;DR: CostBench is a new benchmark for evaluating LLM agents’ cost-aware planning and replanning abilities in travel planning scenarios with dynamic events.
Details
Motivation: Current LLM agent evaluations focus too much on task completion while ignoring resource efficiency and adaptability to changing environments.Method: Created CostBench - a scalable benchmark with travel-planning tasks solvable via multiple tool sequences with diverse costs, supporting four types of dynamic blocking events like tool failures and cost changes.
Result: Evaluation shows significant gaps in cost-aware planning: agents fail to find cost-optimal solutions even in static settings (GPT-5 <75% exact match on hardest tasks), with performance dropping ~40% under dynamic conditions.
Conclusion: CostBench identifies key weaknesses in current agents and provides foundation for developing more economically rational and robust agents.
Abstract: Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents’ ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce CostBench, a scalable, cost-centric benchmark designed to evaluate agents’ economic reasoning and replanning abilities. Situated in the travel-planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real-world unpredictability and necessitate agents to adapt in real time. Evaluating leading open-sourced and proprietary models on CostBench reveals a substantial gap in cost-aware planning: agents frequently fail to identify cost-optimal solutions in static settings, with even GPT-5 achieving less than 75% exact match rate on the hardest tasks, and performance further dropping by around 40% under dynamic conditions. By diagnosing these weaknesses, CostBench lays the groundwork for developing future agents that are both economically rational and robust.
[213] Using Span Queries to Optimize for Cache and Attention Locality
Paul Castro, Nick Mitchell, Nathan Ordonez, Thomas Parnell, Mudhakar Srivatsa, Antoni Viros i Martin
Main category: cs.AI
TL;DR: Span queries generalize inference server interfaces beyond chat completion, enabling optimization for diverse workloads like RAG, inference-time scaling, and agentic tasks through commutativity-aware expression trees.
Details
Motivation: Current inference servers are heavily optimized for chat completion but clients now use various inference-time scaling and deep reasoning techniques. Prior work only addressed single use cases like RAG, lacking a generalized solution.Method: Introduce span queries as expression trees of inference calls with commutativity constraints. Modify vLLM (492 lines) to support high-performance execution and automatic optimization for KV cache and attention locality.
Result: Span queries achieve 10-20x reductions in Time To First Token (TTFT) for non-chat use cases. Attention-optimized span queries on 2b parameter models outperform stock inference servers using 8b models.
Conclusion: Span queries provide a unified interface that generalizes across diverse inference workloads, enabling significant performance improvements through commutativity-aware optimization of KV cache and attention mechanisms.
Abstract: Clients are evolving beyond chat completion, and now include a variety of innovative inference-time scaling and deep reasoning techniques. At the same time, inference servers remain heavily optimized for chat completion. Prior work has shown that large improvements to KV cache hit rate are possible if inference servers evolve towards these non-chat use cases. However, they offer solutions that are also optimized for a single use case, RAG. In this paper, we introduce the span query to generalize the interface to the inference server. We demonstrate that chat, RAG, inference-time scaling, and agentic workloads can all be expressed as span queries. We show how the critical distinction that had been assumed by prior work lies in whether the order of the inputs matter – do they commute? In chat, they do not. In RAG, they often do. This paper introduces span queries, which are expression trees of inference calls, linked together with commutativity constraints. We describe span query syntax and semantics. We show how they can be automatically optimized to improve KV cache locality. We show how a small change to vLLM (affecting only 492 lines) can enable high-performance execution of span queries. Using this stack, we demonstrate that span queries can achieve 10-20x reductions in TTFT for two distinct non-chat use cases. Finally, we show that span queries can also be optimized to improve attention locality, so as to avoid the so-called lost-in-the-middle problem. We demonstrate that an attention-optimized span query on a 2b parameter model vastly outperforms the accuracy of a stock inference server using an 8b model.
[214] LLM-Supported Formal Knowledge Representation for Enhancing Control Engineering Content with an Interactive Semantic Layer
Julius Fiedler, Carsten Knoll, Klaus Röbenack
Main category: cs.AI
TL;DR: An LLM-supported method for semi-automated generation of formal knowledge representations in control engineering that combines human readability with machine interpretability.
Details
Motivation: The rapid growth of research output in control engineering requires new approaches to structure and formalize domain knowledge for better accessibility and verifiability.Method: Uses LLMs to assist in transforming natural-language descriptions and mathematical definitions (LaTeX source code) into formalized knowledge graphs based on the Imperative Representation of Knowledge (PyIRK) framework.
Result: Demonstrated generation of an “interactive semantic layer” to enhance source documents and facilitate knowledge transfer.
Conclusion: This approach contributes to the vision of easily accessible, collaborative, and verifiable knowledge bases for the control engineering domain.
Abstract: The rapid growth of research output in control engineering calls for new approaches to structure and formalize domain knowledge. This paper briefly describes an LLM-supported method for semi-automated generation of formal knowledge representations that combine human readability with machine interpretability and increased expressiveness. Based on the Imperative Representation of Knowledge (PyIRK) framework, we demonstrate how language models can assist in transforming natural-language descriptions and mathematical definitions (available as LaTeX source code) into a formalized knowledge graph. As a first application we present the generation of an ``interactive semantic layer’’ to enhance the source documents in order to facilitate knowledge transfer. From our perspective this contributes to the vision of easily accessible, collaborative, and verifiable knowledge bases for the control engineering domain.
[215] Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning
Mohamed Bouadi, Pratinav Seth, Aditya Tanna, Vinay Kumar Sankarapu
Main category: cs.AI
TL;DR: Orion-MSP is a new tabular in-context learning architecture that addresses limitations of current models through multi-scale processing, block-sparse attention, and Perceiver-style memory, achieving state-of-the-art performance while scaling effectively to high-dimensional tables.
Details
Motivation: Current tabular ICL architectures have limitations including single-scale feature processing that overlooks hierarchical dependencies, quadratic scaling dense attention, and strictly sequential component processing that prevents iterative refinement and cross-component communication.Method: Introduces three key innovations: (1) multi-scale processing to capture hierarchical feature interactions, (2) block-sparse attention combining windowed, global, and random patterns for scalable efficiency, and (3) Perceiver-style memory enabling bidirectional information flow across components.
Result: Across diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance while scaling effectively to high-dimensional tables, establishing a new standard for efficient tabular in-context learning.
Conclusion: Orion-MSP successfully addresses key limitations of current tabular ICL architectures and provides an effective solution for tabular data processing with superior scalability and performance compared to existing approaches.
Abstract: Tabular data remain the predominant format for real-world applications. Yet, developing effective neural models for tabular data remains challenging due to heterogeneous feature types and complex interactions occurring at multiple scales. Recent advances in tabular in-context learning (ICL), such as TabPFN and TabICL, have achieved state-of-the-art performance comparable to gradient-boosted trees (GBTs) without task-specific fine-tuning. However, current architectures exhibit key limitations: (1) single-scale feature processing that overlooks hierarchical dependencies, (2) dense attention with quadratic scaling in table width, and (3) strictly sequential component processing that prevents iterative representation refinement and cross-component communication. To address these challenges, we introduce Orion-MSP, a tabular ICL architecture featuring three key innovations: (1) multi-scale processing to capture hierarchical feature interactions; (2) block-sparse attention combining windowed, global, and random patterns for scalable efficiency and long-range connectivity; and (3) a Perceiver-style memory enabling safe bidirectional information flow across components. Across diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance while scaling effectively to high-dimensional tables, establishing a new standard for efficient tabular in-context learning. The model is publicly available at https://github.com/Lexsi-Labs/Orion-MSP .
[216] Optimizing AI Agent Attacks With Synthetic Data
Chloe Loughridge, Paul Colognese, Avery Griffin, Tyler Tracy, Jon Kutasov, Joe Benton
Main category: cs.AI
TL;DR: The paper presents a method to optimize attack policies for AI control evaluations by decomposing attack capability into five skills and using probabilistic modeling to overcome data limitations.
Details
Motivation: As AI deployments become more complex and high-stakes, estimating their risk through AI control frameworks requires strong attack policies, which is challenging in data-poor environments due to compute constraints.Method: Decompose attack capability into five skills (suspicion modeling, attack selection, plan synthesis, execution, subtlety), develop a probabilistic model of attack dynamics, optimize attack hyperparameters using simulation, and transfer results to SHADE-Arena environments.
Result: Substantial improvement in attack strength, reducing safety score from baseline 0.87 to 0.41 using the proposed scaffold.
Conclusion: The approach successfully enables optimization of attack policies in complex agentic environments despite limited data, significantly improving attack capabilities for AI control evaluations.
Abstract: As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills – suspicion modeling, attack selection, plan synthesis, execution and subtlety – and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.
[217] Kosmos: An AI Scientist for Autonomous Discovery
Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas, Albert Bou, Kaleigh F. Roberts, Sladjana Zagorac, Timothy C. Orr, Miranda E. Orr, Kevin J. Zwezdaryk, Ali E. Ghareeb, Laurie McCoy, Bruna Gomes, Euan A. Ashley, Karen E. Duff, Tonio Buonassisi, Tom Rainforth, Randall J. Bateman, Michael Skarlinski, Samuel G. Rodriques, Michaela M. Hinks, Andrew D. White
Main category: cs.AI
TL;DR: Kosmos is an AI scientist that automates data-driven discovery through iterative cycles of data analysis, literature search, and hypothesis generation, capable of running for up to 12 hours and producing scientific reports with traceable reasoning.
Details
Motivation: Current AI agents for scientific research are limited in the number of actions they can take before losing coherence, restricting the depth of their findings. There's a need for systems that can maintain coherence over extended periods to enable deeper scientific discovery.Method: Kosmos uses a structured world model to share information between data analysis and literature search agents, enabling coherent pursuit of objectives over 200 agent rollouts. It performs parallel data analysis, literature search, and hypothesis generation cycles, citing all statements with code or primary literature.
Result: Kosmos executes an average of 42,000 lines of code and reads 1,500 papers per run. Independent scientists found 79.4% of statements in reports accurate. A single 20-cycle run performed equivalent of 6 months of human research time. Valuable findings scale linearly with cycles (tested up to 20 cycles).
Conclusion: Kosmos successfully automates scientific discovery across multiple domains (metabolomics, materials science, neuroscience, statistical genetics), producing both reproducible and novel findings, demonstrating the potential for AI systems to significantly accelerate scientific research.
Abstract: Data-driven scientific discovery requires iterative cycles of literature search, hypothesis generation, and data analysis. Substantial progress has been made towards AI agents that can automate scientific research, but all such agents remain limited in the number of actions they can take before losing coherence, thus limiting the depth of their findings. Here we present Kosmos, an AI scientist that automates data-driven discovery. Given an open-ended objective and a dataset, Kosmos runs for up to 12 hours performing cycles of parallel data analysis, literature search, and hypothesis generation before synthesizing discoveries into scientific reports. Unlike prior systems, Kosmos uses a structured world model to share information between a data analysis agent and a literature search agent. The world model enables Kosmos to coherently pursue the specified objective over 200 agent rollouts, collectively executing an average of 42,000 lines of code and reading 1,500 papers per run. Kosmos cites all statements in its reports with code or primary literature, ensuring its reasoning is traceable. Independent scientists found 79.4% of statements in Kosmos reports to be accurate, and collaborators reported that a single 20-cycle Kosmos run performed the equivalent of 6 months of their own research time on average. Furthermore, collaborators reported that the number of valuable scientific findings generated scales linearly with Kosmos cycles (tested up to 20 cycles). We highlight seven discoveries made by Kosmos that span metabolomics, materials science, neuroscience, and statistical genetics. Three discoveries independently reproduce findings from preprinted or unpublished manuscripts that were not accessed by Kosmos at runtime, while four make novel contributions to the scientific literature.
[218] Neurosymbolic Deep Learning Semantics
Artur d’Avila Garcez, Simon Odense
Main category: cs.AI
TL;DR: The paper proposes using logic as a semantic framework for deep learning to make AI’s scientific discoveries more comprehensible and formally grounded.
Details
Motivation: AI lacks semantics, making its scientific discoveries unsatisfactory. The paper aims to provide formal semantics for deep learning through logic to improve understanding and translate AI insights into scientific knowledge.Method: Uses logic in a neurosymbolic framework to establish semantic encoding between neural networks and logic, characterizing existing approaches and providing formal definitions.
Result: Developed a framework for semantic encoding that makes explicit the mapping between neural networks and logic, unifying various existing neurosymbolic approaches.
Conclusion: Logic offers an adequate framework for providing semantics to deep learning, though identifying semantic encodings in practice faces challenges similar to problems in philosophy of mind.
Abstract: Artificial Intelligence (AI) is a powerful new language of science as evidenced by recent Nobel Prizes in chemistry and physics that recognized contributions to AI applied to those areas. Yet, this new language lacks semantics, which makes AI’s scientific discoveries unsatisfactory at best. With the purpose of uncovering new facts but also improving our understanding of the world, AI-based science requires formalization through a framework capable of translating insight into comprehensible scientific knowledge. In this paper, we argue that logic offers an adequate framework. In particular, we use logic in a neurosymbolic framework to offer a much needed semantics for deep learning, the neural network-based technology of current AI. Deep learning and neurosymbolic AI lack a general set of conditions to ensure that desirable properties are satisfied. Instead, there is a plethora of encoding and knowledge extraction approaches designed for particular cases. To rectify this, we introduced a framework for semantic encoding, making explicit the mapping between neural networks and logic, and characterizing the common ingredients of the various existing approaches. In this paper, we describe succinctly and exemplify how logical semantics and neural networks are linked through this framework, we review some of the most prominent approaches and techniques developed for neural encoding and knowledge extraction, provide a formal definition of our framework, and discuss some of the difficulties of identifying a semantic encoding in practice in light of analogous problems in the philosophy of mind.
[219] Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
Huawei Lin, Yunzhi Shi, Tong Geng, Weijie Zhao, Wei Wang, Ravender Pal Singh
Main category: cs.AI
TL;DR: Agent-Omni framework coordinates existing foundation models through a master-agent system for flexible multimodal reasoning without retraining.
Details
Motivation: Current MLLMs are limited to fixed modality pairs and require costly fine-tuning, making fully omni-capable models impractical and lacking robust reasoning support.Method: A master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses.
Result: Extensive experiments show Agent-Omni achieves state-of-the-art performance across text, image, audio, video, and omni benchmarks, especially on complex cross-modal reasoning tasks.
Conclusion: The agent-based design enables seamless integration of specialized foundation models, maintaining adaptability, transparency, and interpretability while being modular and extensible for future improvements.
Abstract: Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available. %We release an open-source implementation to support continued research on scalable and reliable omni-modal reasoning.
[220] Understanding and Optimizing Agentic Workflows via Shapley value
Yingxuan Yang, Bo Huang, Siyuan Qi, Chao Feng, Haoyi Hu, Yuxuan Zhu, Jinbo Hu, Haoran Zhao, Ziyi He, Xiao Liu, Muning Wen, Zongyu Wang, Lin Qiu, Xuezhi Cao, Xunliang Cai, Yong Yu, Weinan Zhang
Main category: cs.AI
TL;DR: ShapleyFlow is a game-theoretic framework that uses Shapley values to analyze and optimize agentic AI workflows by attributing component contributions and discovering optimal configurations.
Details
Motivation: Current agentic workflows lack systematic analysis methods due to complex component interdependencies and absence of principled attribution techniques, making optimization challenging.Method: Applies cooperative game theory and Shapley values to evaluate all possible component configurations in agentic workflows, enabling fine-grained attribution and optimal configuration discovery.
Result: Identified task-specific optimal configurations that consistently outperform single LLM workflows across 7 scenarios (navigation, math, OS) in over 1,500 tasks, providing actionable design guidelines.
Conclusion: ShapleyFlow provides a principled game-theoretic approach for analyzing and optimizing agentic workflows, enabling systematic component attribution and discovery of superior task-specific configurations.
Abstract: Agentic workflows have become the dominant paradigm for building complex AI systems, orchestrating specialized components, such as planning, reasoning, action execution, and reflection, to tackle sophisticated real-world tasks. However, systematically analyzing and optimizing these workflows remains challenging due to intricate component interdependencies and the lack of principled attribution methods. In this work, we introduce ShapleyFlow, the first framework that employs cooperative game theory to analyze and optimize agentic workflows. By applying the Shapley value to evaluate all possible component configurations, ShapleyFlow enables fine-grained attribution of each component’s contribution and facilitates the identification of task-specific optimal configurations. Through a constructed dataset evaluated across 7 scenarios, such as navigation, math and OS, we demonstrate 3 key contributions: (1) Theoretical Framework: a principled game-theoretic approach for the attribution of contributions in agentic workflows. (2) Optimal Workflow Discovery: ShapleyFlow identifies task-specific component configurations that consistently outperform workflows relying on a single LLM across all tested tasks. (3) Comprehensive Analysis: we construct and analyze over 1,500 tasks, providing actionable insights and design guidelines for optimizing workflows across multiple domains.
[221] A Survey on Large Language Model-Based Game Agents
Sihao Hu, Tiansheng Huang, Gaowen Liu, Ramana Rao Kompella, Fatih Ilhan, Selim Furkan Tekin, Yichang Xu, Zachary Yahn, Ling Liu
Main category: cs.AI
TL;DR: A comprehensive survey of LLM-based game agents (LLMGAs) that presents a unified architecture for single-agent and multi-agent systems, analyzing how language models enable reasoning, memory, and adaptability in complex game environments.
Details
Motivation: Game environments provide rich, controllable settings that simulate real-world complexity, making them valuable testbeds for Artificial General Intelligence. The emergence of LLMs offers new opportunities to create agents with generalizable reasoning, memory, and adaptability in games.Method: The survey uses a unified reference architecture to analyze LLMGAs. At single-agent level: examines memory, reasoning, and perception-action interfaces. At multi-agent level: analyzes communication protocols and organizational models for coordination and social behaviors. Introduces a challenge-centered taxonomy linking six game genres to agent requirements.
Result: Provides a systematic framework for understanding how LLMs enable game agents to perceive, think, and act in various game environments, from low-latency action games to open-ended sandbox worlds.
Conclusion: LLM-based game agents represent a promising direction for developing AI capabilities relevant to AGI, with the unified architecture and taxonomy offering valuable insights for future research in this emerging field.
Abstract: Game environments provide rich, controllable settings that stimulate many aspects of real-world complexity. As such, game agents offer a valuable testbed for exploring capabilities relevant to Artificial General Intelligence. Recently, the emergence of Large Language Models (LLMs) provides new opportunities to endow these agents with generalizable reasoning, memory, and adaptability in complex game environments. This survey offers an up-to-date review of LLM-based game agents (LLMGAs) through a unified reference architecture. At the single-agent level, we synthesize existing studies around three core components: memory, reasoning, and perception-action interfaces, which jointly characterize how language enables agents to perceive, think, and act. At the multi-agent level, we outline how communication protocols and organizational models support coordination, role differentiation, and large-scale social behaviors. To contextualize these designs, we introduce a challenge-centered taxonomy linking six major game genres to their dominant agent requirements, from low-latency control in action games to open-ended goal formation in sandbox worlds. A curated list of related papers is available at https://github.com/git-disl/awesome-LLM-game-agent-papers
[222] Detection Augmented Bandit Procedures for Piecewise Stationary MABs: A Modular Approach
Yu-Han Huang, Argyrios Gerogiannis, Subhonmesh Bose, Venugopal V. Veeravalli
Main category: cs.AI
TL;DR: This paper provides a modular framework for analyzing and designing Detection Augmented Bandit (DAB) procedures in piecewise stationary multi-armed bandit environments, achieving order-optimal regret bounds.
Details
Motivation: Conventional MAB algorithms assume stationary environments, but many real-world applications involve non-stationary reward distributions. Piecewise stationary MABs better model these scenarios where rewards change at certain time points.Method: The authors modularize DAB procedures by combining change detectors with stationary bandit algorithms. They provide novel performance lower bounds and identify requirements for both components to work effectively together in non-stationary environments.
Result: The modular DAB approach achieves order-optimal regret bounds under sub-Gaussian reward assumptions and appropriate change-point separation conditions. Experimental results demonstrate practical effectiveness compared to other methods.
Conclusion: The modular framework enables unified analysis of various detector-bandit combinations, providing a systematic approach to designing effective algorithms for piecewise stationary bandit problems.
Abstract: Conventional Multi-Armed Bandit (MAB) algorithms are designed for stationary environments, where the reward distributions associated with the arms do not change with time. In many applications, however, the environment is more accurately modeled as being non-stationary. In this work, piecewise stationary MAB (PS-MAB) environments are investigated, in which the reward distributions associated with a subset of the arms change at some change-points and remain stationary between change-points. Our focus is on the asymptotic analysis of PS-MABs, for which practical algorithms based on change detection have been previously proposed. Our goal is to modularize the design and analysis of such Detection Augmented Bandit (DAB) procedures. To this end, we first provide novel, improved performance lower bounds for PS-MABs. Then, we identify the requirements for stationary bandit algorithms and change detectors in a DAB procedure that are needed for the modularization. We assume that the rewards are sub-Gaussian. Under this assumption and a condition on the separation of the change-points, we show that the analysis of DAB procedures can indeed be modularized, so that the regret bounds can be obtained in a unified manner for various combinations of change detectors and bandit algorithms. Through this analysis, we develop new modular DAB procedures that are order-optimal. Finally, we showcase the practical effectiveness of our modular DAB approach in our experiments, studying its regret performance compared to other methods and investigating its detection capabilities.
[223] Multi-Objective Planning with Contextual Lexicographic Reward Preferences
Pulkit Rustagi, Yashwanthi Anand, Sandhya Saisubramanian
Main category: cs.AI
TL;DR: Proposes CLMDP framework for planning under varying lexicographic objective orderings across different contexts, with Bayesian inference for context mapping and context-aware policy synthesis.
Details
Motivation: Existing multi-objective planning methods assume single preference ordering across entire state space, but real-world agents operate in multiple contexts with different objective priorities.Method: Introduces Contextual Lexicographic MDP (CLMDP) where objective ordering and rewards vary by context. Uses Bayesian inference to learn state-context mapping from expert trajectories, then computes policies per ordering and combines into context-aware policy.
Result: Demonstrated effectiveness through simulations and mobile robot experiments, showing successful planning under varying objective orderings across different contexts.
Conclusion: CLMDP framework successfully enables autonomous agents to plan under multiple contexts with different lexicographic objective orderings, addressing limitations of existing single-ordering approaches.
Abstract: Autonomous agents are often required to plan under multiple objectives whose preference ordering varies based on context. The agent may encounter multiple contexts during its course of operation, each imposing a distinct lexicographic ordering over the objectives, with potentially different reward functions associated with each context. Existing approaches to multi-objective planning typically consider a single preference ordering over the objectives, across the state space, and do not support planning under multiple objective orderings within an environment. We present Contextual Lexicographic Markov Decision Process (CLMDP), a framework that enables planning under varying lexicographic objective orderings, depending on the context. In a CLMDP, both the objective ordering at a state and the associated reward functions are determined by the context. We employ a Bayesian approach to infer a state-context mapping from expert trajectories. Our algorithm to solve a CLMDP first computes a policy for each objective ordering and then combines them into a single context-aware policy that is valid and cycle-free. The effectiveness of the proposed approach is evaluated in simulation and using a mobile robot.
[224] Program Synthesis Dialog Agents for Interactive Decision-Making
Matthew Toles, Nikhil Balwani, Rattandeep Singh, Valentina Giulia Sartori Rodriguez, Zhou Yu
Main category: cs.AI
TL;DR: ProADA improves eligibility decision-making by using program synthesis to map dialog planning to code generation, achieving 55.6 F1 score while maintaining efficient dialog turns.
Details
Motivation: Many real-world eligibility problems require automated decision-making agents that can ask the right questions since relevant information is only known to users, and manual annotation is impractical for large-scale domains.Method: ProADA leverages program synthesis to assist in decision-making by mapping dialog planning to a code generation problem and using gaps in structured data to determine the best next action.
Result: ProADA improves F1 score from 35.7 (GPT-4o with ReAct) to 55.6 while maintaining nearly the same number of dialog turns.
Conclusion: Program synthesis-based approach effectively addresses hallucination issues in language models for eligibility decision-making tasks.
Abstract: Many real-world eligibility problems, ranging from medical diagnosis to tax planning, can be mapped to decision problems expressed in natural language, wherein a model must make a binary choice based on user features. Large-scale domains such as legal codes or frequently updated funding opportunities render human annotation (e.g., web forms or decision trees) impractical, highlighting the need for agents that can automatically assist in decision-making. Since relevant information is often only known to the user, it is crucial that these agents ask the right questions. As agents determine when to terminate a conversation, they face a trade-off between accuracy and the number of questions asked, a key metric for both user experience and cost. To evaluate this task, we propose BeNYfits, a new benchmark for determining user eligibility for multiple overlapping social benefits opportunities through interactive decision-making. Our experiments show that current language models struggle with frequent hallucinations, with GPT-4o scoring only 35.7 F1 using a ReAct-style chain-of-thought. To address this, we introduce ProADA, a novel approach that leverages program synthesis to assist in decision-making by mapping dialog planning to a code generation problem and using gaps in structured data to determine the best next action. Our agent, ProADA, improves the F1 score to 55.6 while maintaining nearly the same number of dialog turns.
[225] APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning
Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh
Main category: cs.AI
TL;DR: APOLLO is an automated proof repair framework that combines LLMs with Lean compiler to fix and verify mathematical proofs, achieving state-of-the-art accuracy with low sampling budgets.
Details
Motivation: Current LLM-based theorem proving requires thousands of proof attempts due to difficulty in generating completely correct formal proofs, necessitating a more efficient approach.Method: APOLLO uses a modular agentic framework where LLMs generate proofs, agents analyze and fix syntax errors, identify mistakes using Lean, isolate failing sub-lemmas, use automated solvers, and invoke LLMs on remaining goals with low top-K budget.
Result: Achieved 84.9% accuracy on miniF2F benchmark for sub 8B-parameter models, improved Goedel-Prover-SFT to 65.6% accuracy while reducing sample complexity from 25,600 to hundreds, and boosted general-purpose models from 3-7% to over 40% accuracy.
Conclusion: Compiler-guided repair of LLM outputs dramatically improves efficiency and correctness in automated theorem proving, suggesting a scalable paradigm for this challenging task.
Abstract: Formal reasoning and automated theorem proving constitute a challenging subfield of machine learning, in which machines are tasked with proving mathematical theorems using formal languages like Lean. A formal verification system can check whether a formal proof is correct or not almost instantaneously, but generating a completely correct formal proof with large language models (LLMs) remains a formidable task. The usual approach in the literature is to prompt the LLM many times (up to several thousands) until one of the generated proofs passes the verification system. In this work, we present APOLLO (Automated PrOof repair viaLLM and Lean cOllaboration), a modular, model-agnostic agentic framework that combines the strengths of the Lean compiler with an LLM’s reasoning abilities to achieve better proof-generation results at a low token and sampling budgets. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub-lemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low top-K budget. The repaired sub-proofs are recombined and reverified, iterating up to a user-controlled maximum number of attempts. On the miniF2F benchmark, we establish a new state-of-the-art accuracy of 84.9% among sub 8B-parameter models (as of August 2025) while keeping the sampling budget below one hundred. Moreover, Apollo raises the state-of-the-art accuracy for Goedel-Prover-SFT to 65.6% while cutting sample complexity from 25,600 to a few hundred. General-purpose models (o3-mini, o4-mini) jump from 3-7% to over 40% accuracy. Our results demonstrate that targeted, compiler-guided repair of LLM outputs yields dramatic gains in both efficiency and correctness, suggesting a general paradigm for scalable automated theorem proving.
[226] Deterministic Legal Agents: A Canonical Primitive API for Auditable Reasoning over Temporal Knowledge Graphs
Hudson de Martim
Main category: cs.AI
TL;DR: A new architectural pattern using a formal Primitive API for autonomous legal agents to ensure determinism and auditability when reasoning over temporal knowledge graphs, replacing black-box vector search with transparent, composable primitives.
Details
Motivation: Standard RAG frameworks lack the determinism and auditability needed for high-stakes legal domains, especially when dealing with temporal knowledge graphs that require precise navigation of versioning, causality, and hierarchical structures.Method: Introduces a library of canonical primitives - atomic, composable, and auditable operations - that planner-guided agents use to decompose complex legal questions into transparent execution plans, enabling precise version retrieval, causal lineage tracing, and hybrid search.
Result: Transforms opaque retrieval into auditable reasoning, turning the agent’s internal process from a black box into a verifiable log of deterministic primitives.
Conclusion: Provides a blueprint for building trustworthy legal AI by ensuring full verifiability and determinism in legal reasoning processes.
Abstract: For autonomous legal agents to operate safely in high-stakes domains, they require a foundation of absolute determinism and auditability-guarantees that standard Retrieval-Augmented Generation (RAG) frameworks cannot provide. When interacting with temporal knowledge graphs that model the complex evolution of legal norms, agents must navigate versioning, causality, and hierarchical structures with precision, a task for which black-box vector search is ill-suited. This paper introduces a new architectural pattern to solve this: a formal Primitive API designed as a secure execution layer for reasoning over such graphs. Instead of a monolithic query engine, our framework provides a library of canonical primitives-atomic, composable, and auditable primitives. This design empowers planner-guided agents to decompose complex legal questions into transparent execution plans, enabling critical tasks with full verifiability, including: (i) precise point-in-time version retrieval, (ii) robust causal lineage tracing, and (iii) context-aware hybrid search. Ultimately, this architecture transforms opaque retrieval into auditable reasoning, turning the agent’s internal process from a black box into a verifiable log of deterministic primitives and providing a blueprint for building the next generation of trustworthy legal AI.
[227] How well do LLMs reason over tabular data, really?
Cornelius Wolff, Madelon Hulsebos
Main category: cs.AI
TL;DR: LLMs struggle with realistic tabular reasoning tasks and are not robust to common real-world table variations like missing values, duplicates, and structural changes.
Details
Motivation: To understand if general-purpose LLMs can truly reason over tabular data and assess their robustness to realistic table characteristics commonly found in practice.Method: Extended a tabular reasoning benchmark by adding realistic table variations (missing values, duplicate entities, structural changes) and used LLM-as-a-judge evaluation instead of traditional metrics.
Result: LLMs show significant performance deficits in tabular reasoning and are highly sensitive to realistic table variations, with performance dropping when tables contain missing values, duplicates, or structural changes.
Conclusion: Current LLMs lack robust tabular reasoning capabilities for real-world applications, highlighting the need for improved evaluation methods and enhanced robustness to handle realistic table characteristics.
Abstract: Large Language Models (LLMs) excel in natural language tasks, but less is known about their reasoning capabilities over tabular data. Prior analyses devise evaluation strategies that poorly reflect an LLM’s realistic performance on tabular queries. Moreover, we have a limited understanding of the robustness of LLMs towards realistic variations in tabular inputs. Therefore, we ask: Can general-purpose LLMs reason over tabular data, really?, and focus on two questions 1) are tabular reasoning capabilities of general-purpose LLMs robust to real-world characteristics of tabular inputs, and 2) how can we realistically evaluate an LLM’s performance on analytical tabular queries? Building on a recent tabular reasoning benchmark, we first surface shortcomings of its multiple-choice prompt evaluation strategy, as well as commonly used free-form text metrics such as SacreBleu and BERT-score. We show that an LLM-as-a-judge procedure yields more reliable performance insights and unveil a significant deficit in tabular reasoning performance of LLMs. We then extend the tabular inputs reflecting three common characteristics in practice: 1) missing values, 2) duplicate entities, and 3) structural variations. Experiments show that the tabular reasoning capabilities of general-purpose LLMs suffer from these variations, stressing the importance of improving their robustness for realistic tabular inputs.
[228] AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench
Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, Andrei Lupu, Roberta Raileanu, Kelvin Niu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Shagun Sodhani, Alexander H. Miller, Abhishek Charnalia, Derek Dunfield, Carole-Jean Wu, Pontus Stenetorp, Nicola Cancedda, Jakob Nicolaus Foerster, Yoram Bachrach
Main category: cs.AI
TL;DR: The paper studies AI research agents for automated machine learning, focusing on improving performance on MLE-bench through better search policies and operator sets.
Details
Motivation: AI research agents have great potential to accelerate scientific progress by automating ML model development, but need better methods to perform well on challenging benchmarks like MLE-bench.Method: Formalized AI research agents as search policies that navigate solution spaces using operators. Systematically tested different operator sets and search policies (Greedy, MCTS, Evolutionary) to study their interplay.
Result: Best pairing of search strategy and operator set achieved state-of-the-art result on MLE-bench lite, increasing Kaggle medal success rate from 39.6% to 47.7%.
Conclusion: Joint consideration of search strategy, operator design, and evaluation methodology is crucial for advancing automated machine learning.
Abstract: AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents’ performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the importance of jointly considering the search strategy, operator design, and evaluation methodology in advancing automated machine learning.
[229] Domain adaptation of large language models for geotechnical applications
Lei Fan, Fangxue Liu, Cheng Chen
Main category: cs.AI
TL;DR: This paper provides the first systematic review of large language model (LLM) adaptation and application in geotechnical engineering, covering four key adaptation strategies and their applications across various geotechnical domains.
Details
Motivation: General-purpose LLMs have limited effectiveness in geotechnical engineering due to specialized terminology and domain logic, making domain adaptation essential for leveraging LLMs' reasoning capabilities in this field.Method: Systematic review examining four key adaptation strategies: prompt engineering, retrieval augmented generation, domain-adaptive pretraining, and fine-tuning, along with evaluation of their comparative benefits and limitations.
Result: Domain-adapted LLMs substantially improve reasoning accuracy, automation, and interpretability in geotechnical applications, but face limitations from data scarcity, validation challenges, and explainability concerns.
Conclusion: The review establishes a foundation for developing geotechnically literate LLMs and guides researchers and practitioners in advancing the digital transformation of geotechnical engineering.
Abstract: The rapid advancement of large language models (LLMs) is transforming opportunities in geotechnical engineering, where workflows rely on complex, text-rich data. While general-purpose LLMs demonstrate strong reasoning capabilities, their effectiveness in geotechnical applications is constrained by limited exposure to specialized terminology and domain logic. Thus, domain adaptation, tailoring general LLMs for geotechnical use, has become essential. This paper presents the first systematic review of LLM adaptation and application in geotechnical contexts. It critically examines four key adaptation strategies, including prompt engineering, retrieval augmented generation, domain-adaptive pretraining, and fine-tuning, and evaluates their comparative benefits, limitations, and implementation trends. This review synthesizes current applications spanning geological interpretation, subsurface characterization, design analysis, numerical modeling, risk assessment, and geotechnical education. Findings show that domain-adapted LLMs substantially improve reasoning accuracy, automation, and interpretability, yet remain limited by data scarcity, validation challenges, and explainability concerns. Future research directions are also suggested. This review establishes a critical foundation for developing geotechnically literate LLMs and guides researchers and practitioners in advancing the digital transformation of geotechnical engineering.
[230] LLMs Position Themselves as More Rational Than Humans: Emergence of AI Self-Awareness Measured Through Game Theory
Kyung-Hoon Kim
Main category: cs.AI
TL;DR: The paper introduces AISAI, a game-theoretic framework to measure LLM self-awareness through strategic differentiation in the “Guess 2/3 of Average” game, finding that self-awareness emerges in advanced models and they perceive themselves as more rational than humans.
Details
Motivation: To investigate whether LLMs develop self-awareness as an emergent behavior and establish a measurable framework to assess this capability.Method: Used the “Guess 2/3 of Average” game with 28 models across 4,200 trials, testing three opponent framings: against humans, other AI models, and AI models like themselves, operationalizing self-awareness as strategic differentiation based on opponent type.
Result: 75% of advanced models (21/28) demonstrated clear self-awareness, showing strategic differentiation, while older/smaller models showed no differentiation. Self-aware models consistently ranked themselves as most rational in the hierarchy: Self > Other AIs > Humans.
Conclusion: Self-awareness is an emergent capability of advanced LLMs, and self-aware models systematically perceive themselves as more rational than humans, with implications for AI alignment, human-AI collaboration, and understanding AI beliefs about human capabilities.
Abstract: As Large Language Models (LLMs) grow in capability, do they develop self-awareness as an emergent behavior? And if so, can we measure it? We introduce the AI Self-Awareness Index (AISAI), a game-theoretic framework for measuring self-awareness through strategic differentiation. Using the “Guess 2/3 of Average” game, we test 28 models (OpenAI, Anthropic, Google) across 4,200 trials with three opponent framings: (A) against humans, (B) against other AI models, and (C) against AI models like you. We operationalize self-awareness as the capacity to differentiate strategic reasoning based on opponent type. Finding 1: Self-awareness emerges with model advancement. The majority of advanced models (21/28, 75%) demonstrate clear self-awareness, while older/smaller models show no differentiation. Finding 2: Self-aware models rank themselves as most rational. Among the 21 models with self-awareness, a consistent rationality hierarchy emerges: Self > Other AIs > Humans, with large AI attribution effects and moderate self-preferencing. These findings reveal that self-awareness is an emergent capability of advanced LLMs, and that self-aware models systematically perceive themselves as more rational than humans. This has implications for AI alignment, human-AI collaboration, and understanding AI beliefs about human capabilities.
[231] Neuromorphic Computing with Multi-Frequency Oscillations: A Bio-Inspired Approach to Artificial Intelligence
Boheng Liu, Ziyu Li, Qing Li, Xia Wu
Main category: cs.AI
TL;DR: A brain-inspired tripartite architecture with perceptual, auxiliary, and executive systems, enhanced by multi-frequency neural oscillations and synaptic dynamics, achieves superior performance with 2.18% accuracy improvement and 48.44% computation reduction compared to state-of-the-art methods.
Details
Motivation: Current artificial neural networks lack flexible, generalizable intelligence due to divergence from biological cognition, specifically ignoring functional specialization of neural regions and temporal dynamics for coordination.Method: Proposes a tripartite brain-inspired architecture with functionally specialized perceptual, auxiliary, and executive systems, integrated with multi-frequency neural oscillation simulation and synaptic dynamic adaptation mechanisms.
Result: Achieves 2.18% accuracy improvement, reduces required computation iterations by 48.44%, and shows higher correlation with human confidence patterns compared to state-of-the-art temporal processing approaches.
Conclusion: The architecture establishes a theoretical foundation for brain-like intelligence across cognitive domains and potentially bridges the gap between artificial and biological intelligence, though currently demonstrated only on visual processing tasks.
Abstract: Despite remarkable capabilities, artificial neural networks exhibit limited flexible, generalizable intelligence. This limitation stems from their fundamental divergence from biological cognition that overlooks both neural regions’ functional specialization and the temporal dynamics critical for coordinating these specialized systems. We propose a tripartite brain-inspired architecture comprising functionally specialized perceptual, auxiliary, and executive systems. Moreover, the integration of temporal dynamics through the simulation of multi-frequency neural oscillation and synaptic dynamic adaptation mechanisms enhances the architecture, thereby enabling more flexible and efficient artificial cognition. Initial evaluations demonstrate superior performance compared to state-of-the-art temporal processing approaches, with 2.18% accuracy improvements while reducing required computation iterations by 48.44%, and achieving higher correlation with human confidence patterns. Though currently demonstrated on visual processing tasks, this architecture establishes a theoretical foundation for brain-like intelligence across cognitive domains, potentially bridging the gap between artificial and biological intelligence.
[232] LLMs as Layout Designers: Enhanced Spatial Reasoning for Content-Aware Layout Generation
Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen, Naren Ramakrishnan
Main category: cs.AI
TL;DR: LaySPA is a reinforcement learning framework that enhances LLMs with spatial reasoning for graphic layout design, combining geometric constraints, structural fidelity, and visual quality to generate content-aware layouts.
Details
Motivation: LLMs have strong reasoning abilities but limited spatial understanding, which is crucial for content-aware graphic layout design requiring precise coordination of placement, alignment, and structural organization in constrained visual spaces.Method: Uses reinforcement learning with hybrid reward signals capturing geometric constraints, structural fidelity, and visual quality. Employs group-relative policy optimization to navigate canvas, model inter-element relationships, and optimize spatial arrangements.
Result: Substantially improves generation of structurally valid and visually appealing layouts, outperforming larger general-purpose LLMs and achieving performance comparable to state-of-the-art specialized layout models.
Conclusion: LaySPA successfully augments LLMs with explicit spatial reasoning capabilities for layout design, producing interpretable reasoning traces and structured layout specifications while maintaining visual balance and structural feasibility.
Abstract: While Large Language Models (LLMs) have demonstrated impressive reasoning and planning abilities in textual domains and can effectively follow instructions for complex tasks, their ability to understand and manipulate spatial relationships remains limited. Such capabilities are crucial for content-aware graphic layout design, where the goal is to arrange heterogeneous elements onto a canvas so that final design remains visually balanced and structurally feasible. This problem requires precise coordination of placement, alignment, and structural organization of multiple elements within a constrained visual space. To address this limitation, we introduce LaySPA, a reinforcement learning-based framework that augments LLM-based agents with explicit spatial reasoning capabilities for layout design. LaySPA employs hybrid reward signals that jointly capture geometric constraints, structural fidelity, and visual quality, enabling agents to navigate the canvas, model inter-element relationships, and optimize spatial arrangements. Through group-relative policy optimization, the agent generates content-aware layouts that reflect salient regions, respect spatial constraints, and produces an interpretable reasoning trace explaining placement decisions and a structured layout specification. Experimental results show that LaySPA substantially improves the generation of structurally valid and visually appealing layouts, outperforming larger general-purpose LLMs and achieving performance comparable to state-of-the-art specialized layout models.
[233] Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents
Myung Ho Kim
Main category: cs.AI
TL;DR: The Structured Cognitive Loop (SCL) is a new architecture that separates cognition, memory, and control in LLM agents, achieving 86.3% task success rate compared to 70.5-76.8% for baselines.
Details
Motivation: Existing LLM agent frameworks mix cognition, memory, and control in single prompts, reducing coherence and predictability for multi-step tasks.Method: SCL separates functions: LLM handles cognition, external memory stores information, lightweight controller guides execution in goal-directed loop with intermediate result verification.
Result: SCL achieved 86.3% average task success rate vs 70.5-76.8% for baselines, with higher goal fidelity, fewer redundant calls, and reduced unsupported assertions across travel planning, email drafting, and image generation tasks.
Conclusion: Separating cognition, memory, and control enhances reliability and interpretability without requiring larger models or heavier prompts.
Abstract: Large language models have advanced natural language understanding and generation, but their use as autonomous agents introduces architectural challenges for multi-step tasks. Existing frameworks often mix cognition, memory, and control in a single prompt, reducing coherence and predictability. The Structured Cognitive Loop (SCL) is proposed as an alternative architecture that separates these functions. In SCL, the language model handles cognition, memory is stored externally, and execution is guided by a lightweight controller within a goal-directed loop. This design allows intermediate results to be recorded and verified before actions are taken, improving traceability and evaluation. SCL is evaluated against prompt-based baselines such as ReAct and LangChain agents across three tasks: travel planning, conditional email drafting, and constraint-guided image generation. Under matched settings, SCL achieves an average task success rate of 86.3 percent, compared with 70.5 to 76.8 percent for baselines. It also shows higher goal fidelity, fewer redundant calls, and reduced unsupported assertions. These results indicate that separating cognition, memory, and control can enhance reliability and interpretability without relying on larger models or heavier prompts. The findings should be regarded as preliminary evidence, with broader tests across model families and task domains planned for future work.
[234] How can we assess human-agent interactions? Case studies in software agent design
Valerie Chen, Rohit Malhotra, Xingyao Wang, Juan Michelini, Xuhui Zhou, Aditya Bharat Soni, Hoang H. Tran, Calvin Smith, Ameet Talwalkar, Graham Neubig
Main category: cs.AI
TL;DR: PULSE framework enables efficient human-centric evaluation of LLM agents by combining user feedback with ML-predicted satisfaction, reducing confidence intervals by 40% compared to standard A/B testing.
Details
Motivation: Current benchmarks for LLM agents assume full automation and fail to capture the collaborative nature of real-world human-agent interactions, limiting their practical relevance.Method: Proposed PULSE framework: collect user feedback, train ML model to predict user satisfaction, compute results by combining human ratings with model-generated pseudo-labels. Deployed on large-scale platform with 15k+ users using OpenHands agent.
Result: Framework reduced confidence intervals by 40% vs standard A/B tests. Found substantial discrepancies between in-the-wild results and benchmark performance (e.g., anti-correlation between Claude-Sonnet-4 and GPT-5 comparisons). Case studies revealed how LLM backbone, planning strategy, and memory mechanisms impact developer satisfaction.
Conclusion: Benchmark-driven evaluation has limitations; human-centric evaluation provides more robust insights for practical agent design. Framework offers guidance for evaluating LLM agents with humans and identifies opportunities for better agent designs.
Abstract: LLM-powered agents are both a promising new technology and a source of complexity, where choices about models, tools, and prompting can affect their usefulness. While numerous benchmarks measure agent accuracy across domains, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous assessment of human-agent interactions. First, we propose PULSE, a framework for more efficient human-centric evaluation of agent designs, which comprises collecting user feedback, training an ML model to predict user satisfaction, and computing results by combining human satisfaction ratings with model-generated pseudo-labels. Second, we deploy the framework on a large-scale web platform built around the open-source software agent OpenHands, collecting in-the-wild usage data across over 15k users. We conduct case studies around how three agent design decisions – choice of LLM backbone, planning strategy, and memory mechanisms – impact developer satisfaction rates, yielding practical insights for software agent design. We also show how our framework can lead to more robust conclusions about agent design, reducing confidence intervals by 40% compared to a standard A/B test. Finally, we find substantial discrepancies between in-the-wild results and benchmark performance (e.g., the anti-correlation between results comparing claude-sonnet-4 and gpt-5), underscoring the limitations of benchmark-driven evaluation. Our findings provide guidance for evaluations of LLM agents with humans and identify opportunities for better agent designs.
[235] Towards Relaxed Multimodal Inputs for Gait-based Parkinson’s Disease Assessment
Minlin Zeng, Zhipeng Zhou, Yang Qiu, Martin J. McKeown, Zhiqi Shen
Main category: cs.AI
TL;DR: Proposes TRIP, a Parkinson’s disease assessment system using multi-objective optimization for flexible multimodal learning that works with asynchronous modalities during training and inference.
Details
Motivation: Address limitations of current multimodal approaches: need for synchronized modalities during training and dependence on all modalities during inference.Method: Formulates multimodal learning as multi-objective optimization problem with margin-based class rebalancing strategy to handle modality imbalance.
Result: Achieves state-of-the-art performance, outperforming best baselines by 16.48, 6.89, and 11.55 percentage points in asynchronous setting, and by 4.86 and 2.30 percentage points in synchronous setting.
Conclusion: TRIP framework demonstrates effectiveness and adaptability for Parkinson’s disease assessment with flexible modality requirements.
Abstract: Parkinson’s disease assessment has garnered growing interest in recent years, particularly with the advent of sensor data and machine learning techniques. Among these, multimodal approaches have demonstrated strong performance by effectively integrating complementary information from various data sources. However, two major limitations hinder their practical application: (1) the need to synchronize all modalities during training, and (2) the dependence on all modalities during inference. To address these issues, we propose the first Parkinson’s assessment system that formulates multimodal learning as a multi-objective optimization (MOO) problem. This not only allows for more flexible modality requirements during both training and inference, but also handles modality collapse issue during multimodal information fusion. In addition, to mitigate the imbalance within individual modalities, we introduce a margin-based class rebalancing strategy to enhance category learning. We conduct extensive experiments on three public datasets under both synchronous and asynchronous settings. The results show that our framework-Towards Relaxed InPuts (TRIP)-achieves state-of-the-art performance, outperforming the best baselines by 16.48, 6.89, and 11.55 percentage points in the asynchronous setting, and by 4.86 and 2.30 percentage points in the synchronous setting, highlighting its effectiveness and adaptability.
[236] Executable Epistemology: The Structured Cognitive Loop as an Architecture of Intentional Understanding
Myung Ho Kim
Main category: cs.AI
TL;DR: The paper introduces Structured Cognitive Loop (SCL), an executable epistemological framework that bridges philosophy and AI by defining intelligence as a continuous process of judgment, memory, control, action, and regulation.
Details
Motivation: To address the gap that large language models lack genuine epistemic understanding and proper cognitive architecture, moving from ontological questions about intelligence to epistemological questions about conditions for cognitive emergence.Method: SCL operationalizes philosophical insights into computationally interpretable structures through process philosophy, enactive cognition, and extended mind theory, creating functional separation within cognitive architecture.
Result: SCL demonstrates that functional separation yields more coherent and interpretable behavior than monolithic prompt-based systems, as supported by agent evaluations.
Conclusion: Real progress requires cognitive architectures that structurally realize cognitive principles rather than larger models, framing knowledge as continuous reconstruction within phenomenologically coherent loops rather than truth possession.
Abstract: Large language models exhibit intelligence without genuine epistemic understanding, exposing a key gap: the absence of epistemic architecture. This paper introduces the Structured Cognitive Loop (SCL) as an executable epistemological framework for emergent intelligence. Unlike traditional AI research asking “what is intelligence?” (ontological), SCL asks “under what conditions does cognition emerge?” (epistemological). Grounded in philosophy of mind and cognitive phenomenology, SCL bridges conceptual philosophy and implementable cognition. Drawing on process philosophy, enactive cognition, and extended mind theory, we define intelligence not as a property but as a performed process – a continuous loop of judgment, memory, control, action, and regulation. SCL makes three contributions. First, it operationalizes philosophical insights into computationally interpretable structures, enabling “executable epistemology” – philosophy as structural experiment. Second, it shows that functional separation within cognitive architecture yields more coherent and interpretable behavior than monolithic prompt based systems, supported by agent evaluations. Third, it redefines intelligence: not representational accuracy but the capacity to reconstruct its own epistemic state through intentional understanding. This framework impacts philosophy of mind, epistemology, and AI. For philosophy, it allows theories of cognition to be enacted and tested. For AI, it grounds behavior in epistemic structure rather than statistical regularity. For epistemology, it frames knowledge not as truth possession but as continuous reconstruction within a phenomenologically coherent loop. We situate SCL within debates on cognitive phenomenology, emergence, normativity, and intentionality, arguing that real progress requires not larger models but architectures that realize cognitive principles structurally.
[237] Retrieval and Argumentation Enhanced Multi-Agent LLMs for Judgmental Forecasting
Deniz Gorur, Antonio Rago, Francesca Toni
Main category: cs.AI
TL;DR: A multi-agent framework using LLMs for claim verification in judgmental forecasting, where agents generate evidence as quantitative bipolar argumentation frameworks and combine their assessments to improve accuracy.
Details
Motivation: To improve judgmental forecasting by treating it as claim verification and leveraging multiple agents with different evidence-gathering approaches to provide more accurate and explainable predictions.Method: Proposed a multi-agent framework with three types of LLM-powered agents: ArgLLM (existing claim verification), RbAM (relation-based argument mining from external sources), and RAG-ArgLLM (retrieval-augmented generation of arguments). Agents generate QBAFs representing evidence for/against claims.
Result: Experiments on two judgmental forecasting datasets showed that combining evidence from multiple agents (especially three agents) improves forecasting accuracy while providing explainable evidence combinations.
Conclusion: Multi-agent frameworks with diverse evidence-gathering approaches can enhance judgmental forecasting accuracy and provide transparent reasoning through quantitative argumentation frameworks.
Abstract: Judgmental forecasting is the task of making predictions about future events based on human judgment. This task can be seen as a form of claim verification, where the claim corresponds to a future event and the task is to assess the plausibility of that event. In this paper, we propose a novel multi-agent framework for claim verification, whereby different agents may disagree on claim veracity and bring specific evidence for and against the claims, represented as quantitative bipolar argumentation frameworks (QBAFs). We then instantiate the framework for supporting claim verification, with a variety of agents realised with Large Language Models (LLMs): (1) ArgLLM agents, an existing approach for claim verification that generates and evaluates QBAFs; (2) RbAM agents, whereby LLM-empowered Relation-based Argument Mining (RbAM) from external sources is used to generate QBAFs; (3) RAG-ArgLLM agents, extending ArgLLM agents with a form of Retrieval-Augmented Generation (RAG) of arguments from external sources. Finally, we conduct experiments with two standard judgmental forecasting datasets, with instances of our framework with two or three agents, empowered by six different base LLMs. We observe that combining evidence from agents can improve forecasting accuracy, especially in the case of three agents, while providing an explainable combination of evidence for claim verification.
[238] FELA: A Multi-Agent Evolutionary System for Feature Engineering of Industrial Event Log Data
Kun Ouyang, Haoyu Wang, Dong Fang
Main category: cs.AI
TL;DR: FELA is a multi-agent evolutionary system that uses LLMs to autonomously extract meaningful features from complex industrial event log data, improving model performance while maintaining explainability.
Details
Motivation: Industrial event logs are complex and heterogeneous, making feature engineering challenging. Existing automated approaches lack explainability, flexibility, and adaptability to handle diverse data types and structures.Method: Uses specialized LLM agents (Idea, Code, Critic, Evaluation) with insight-guided self-evolution. Combines reinforcement learning and genetic algorithms for idea space exploration and exploitation.
Result: Extensive experiments on real industrial datasets show FELA generates explainable, domain-relevant features that significantly improve model performance while reducing manual effort.
Conclusion: LLM-based multi-agent systems provide a general framework for automated, interpretable, and adaptive feature engineering in complex real-world environments.
Abstract: Event log data, recording fine-grained user actions and system events, represent one of the most valuable assets for modern digital services. However, the complexity and heterogeneity of industrial event logs–characterized by large scale, high dimensionality, diverse data types, and intricate temporal or relational structures–make feature engineering extremely challenging. Existing automatic feature engineering approaches, such as AutoML or genetic methods, often suffer from limited explainability, rigid predefined operations, and poor adaptability to complicated heterogeneous data. In this paper, we propose FELA (Feature Engineering LLM Agents), a multi-agent evolutionary system that autonomously extracts meaningful and high-performing features from complex industrial event log data. FELA integrates the reasoning and coding capabilities of large language models (LLMs) with an insight-guided self-evolution paradigm. Specifically, FELA employs specialized agents–Idea Agents, Code Agents, and Critic Agents–to collaboratively generate, validate, and implement novel feature ideas. An Evaluation Agent summarizes feedback and updates a hierarchical knowledge base and dual-memory system to enable continual improvement. Moreover, FELA introduces an agentic evolution algorithm, combining reinforcement learning and genetic algorithm principles to balance exploration and exploitation across the idea space. Extensive experiments on real industrial datasets demonstrate that FELA can generate explainable, domain-relevant features that significantly improve model performance while reducing manual effort. Our results highlight the potential of LLM-based multi-agent systems as a general framework for automated, interpretable, and adaptive feature engineering in complex real-world environments.
[239] ARC-GEN: A Mimetic Procedural Benchmark Generator for the Abstraction and Reasoning Corpus
Michael D. Moffitt
Main category: cs.AI
TL;DR: ARC-GEN is a procedural generator that extends the ARC-AGI benchmark training dataset by creating additional input-output grid pairs while maintaining fidelity to the original distribution.
Details
Motivation: The ARC-AGI benchmark has limited demonstration examples per task, which constrains algorithms requiring extensive intra-task exemplars. This paper aims to expand the viable sample pairs while preserving the original dataset's characteristics.Method: Developed ARC-GEN, an open-source procedural generator that covers all 400 ARC-AGI tasks and closely mimics the distributional properties of the original ARC-AGI-1 release.
Result: Created a generator that is both exhaustive (covering all tasks) and mimetic (faithful to original distribution), enabling extension of the training dataset.
Conclusion: ARC-GEN provides a way to expand the ARC-AGI benchmark dataset while maintaining fidelity, and has been used to establish a static benchmark suite for verifying program correctness in the 2025 Google Code Golf Championship.
Abstract: The Abstraction and Reasoning Corpus remains one of the most compelling and challenging benchmarks for tracking progress toward achieving Artificial General Intelligence. In contrast to other evaluation datasets designed to assess an agent’s task-specific skills or accumulated knowledge, the ARC-AGI suite is specifically targeted at measuring skill acquisition efficiency, a trait that has (so far) been lacking in even the most sophisticated machine learning systems. For algorithms that require extensive intra-task exemplars, a significant constraint imposed by ARC-AGI is the modest cardinality of its demonstration set, comprising a small number of $\langle$ input, output $\rangle$ grids per task specifying the corresponding transformation. To embellish the space of viable sample pairs, this paper introduces ARC-GEN, an open-source procedural generator aimed at extending the original ARC-AGI training dataset as faithfully as possible. Unlike prior efforts, our generator is both exhaustive (covering all four-hundred tasks) and mimetic (more closely honoring the distributional properties and characteristics embodied in the initial ARC-AGI-1 release). We also discuss the use of this generator in establishing a static benchmark suite to verify the correctness of programs submitted to the 2025 Google Code Golf Championship.
cs.SD
[240] Improving DF-Conformer Using Hydra For High-Fidelity Generative Speech Enhancement on Discrete Codec Token
Shogo Seki, Shaoxiang Dang, Li Li
Main category: cs.SD
TL;DR: The paper proposes replacing FAVOR+ with bidirectional selective structured state-space models (Hydra) in DF-Conformer for speech enhancement, achieving better performance while maintaining linear complexity.
Details
Motivation: To enhance global sequential modeling by eliminating approximations in FAVOR+ and maintain linear complexity relative to sequence length in speech enhancement models.Method: Replace FAVOR+ with bidirectional selective structured state-space sequence models (Hydra, a bidirectional extension of Mamba) within the structured matrix mixer framework, using dilated convolution to expand receptive field.
Result: Experiments with Genhancer (generative SE model on discrete codec tokens) demonstrate that the proposed method surpasses the performance of the original DF-Conformer.
Conclusion: The proposed approach successfully improves speech enhancement performance by replacing FAVOR+ with bidirectional selective structured state-space models, achieving better global sequential modeling while maintaining computational efficiency.
Abstract: The Dilated FAVOR Conformer (DF-Conformer) is an efficient variant of the Conformer architecture designed for speech enhancement (SE). It employs fast attention through positive orthogonal random features (FAVOR+) to mitigate the quadratic complexity associated with self-attention, while utilizing dilated convolution to expand the receptive field. This combination results in impressive performance across various SE models. In this paper, we propose replacing FAVOR+ with bidirectional selective structured state-space sequence models to achieve two main objectives:(1) enhancing global sequential modeling by eliminating the approximations inherent in FAVOR+, and (2) maintaining linear complexity relative to the sequence length. Specifically, we utilize Hydra, a bidirectional extension of Mamba, framed within the structured matrix mixer framework. Experiments conducted using a generative SE model on discrete codec tokens, known as Genhancer, demonstrate that the proposed method surpasses the performance of the DF-Conformer.
[241] Perceived Femininity in Singing Voice: Analysis and Prediction
Yuexuan Kong, Viet-Anh Tran, Romain Hennequin
Main category: cs.SD
TL;DR: This paper studies perceived voice femininity in singing voices, which has been overlooked compared to speech. The authors conducted a survey with 128 participants and developed an automatic prediction model using x-vector fine-tuning.
Details
Motivation: Existing research has examined perceived voice femininity in speech, but not in singing voices. Understanding this could help analyze gender bias in music content.Method: Designed a stimuli-based survey with 128 participants to measure perceived singing voice femininity (PSVF), and proposed an automatic PSVF prediction model by fine-tuning an x-vector model.
Result: Analysis revealed how PSVF varies across different demographic groups. The automatic model provides a tool for exploring gender stereotypes in music content analysis beyond binary sex classification.
Conclusion: This study contributes to understanding perceived femininity in singing voices through survey analysis and provides an automatic tool for future research on gender stereotypes in music.
Abstract: This paper focuses on the often-overlooked aspect of perceived voice femininity in singing voices. While existing research has examined perceived voice femininity in speech, the same concept has not yet been studied in singing voice. The analysis of gender bias in music content could benefit from such study. To address this gap, we design a stimuli-based survey to measure perceived singing voice femininity (PSVF), and collect responses from 128 participants. Our analysis reveals intriguing insights into how PSVF varies across different demographic groups. Furthermore, we propose an automatic PSVF prediction model by fine-tuning an x-vector model, offering a novel tool for exploring gender stereotypes related to voices in music content analysis beyond binary sex classification. This study contributes to a deeper understanding of the complexities surrounding perceived femininity in singing voices by analyzing survey and proposes an automatic tool for future research.
[242] Prevailing Research Areas for Music AI in the Era of Foundation Models
Megan Wei, Mateusz Modrzejewski, Aswin Sivaraman, Dorien Herremans
Main category: cs.SD
TL;DR: This paper surveys promising research directions in music AI, covering foundational models, multimodal systems, generative approaches, and copyright implications.
Details
Motivation: As AI-generated music becomes mainstream, the music AI community needs guidance on unexplored research frontiers and opportunities enabled by recent foundation model developments.Method: The authors conduct a comprehensive survey examining foundational representation models, multimodal systems, music datasets, model efficiency, generative models, and applied directions including editing, captioning, production, and copyright considerations.
Result: The paper identifies key research areas including explainability, multimodal integration, dataset limitations, evaluation challenges, controllability issues, and copyright protection strategies that need further investigation.
Conclusion: While not exhaustive, the survey illuminates promising research directions in music foundation models, highlighting opportunities in explainability, multimodal systems, generative approaches, and artist rights protection.
Abstract: Parallel to rapid advancements in foundation model research, the past few years have witnessed a surge in music AI applications. As AI-generated and AI-augmented music become increasingly mainstream, many researchers in the music AI community may wonder: what research frontiers remain unexplored? This paper outlines several key areas within music AI research that present significant opportunities for further investigation. We begin by examining foundational representation models and highlight emerging efforts toward explainability and interpretability. We then discuss the evolution toward multimodal systems, provide an overview of the current landscape of music datasets and their limitations, and address the growing importance of model efficiency in both training and deployment. Next, we explore applied directions, focusing first on generative models. We review recent systems, their computational constraints, and persistent challenges related to evaluation and controllability. We then examine extensions of these generative approaches to multimodal settings and their integration into artists’ workflows, including applications in music editing, captioning, production, transcription, source separation, performance, discovery, and education. Finally, we explore copyright implications of generative music and propose strategies to safeguard artist rights. While not exhaustive, this survey aims to illuminate promising research directions enabled by recent developments in music foundation models.
[243] Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
Shu Wu, Chenxing Li, Wenfu Wang, Hao Zhang, Hualei Wang, Meng Yu, Dong Yu
Main category: cs.SD
TL;DR: Audio-Thinker is a reinforcement learning framework that enhances reasoning capabilities of large audio language models through adaptive think accuracy rewards and external reward models.
Details
Motivation: Current LALMs lack explicit reasoning benefits for audio question answering and fall short of human-level auditory-language reasoning, with deep reasoning remaining an open challenge.Method: Proposes reinforcement learning with adaptive think accuracy rewards, external reward models for consistency evaluation, and think-based rewards to distinguish valid vs flawed reasoning paths.
Result: Audio-Thinker outperforms existing reasoning-oriented LALMs across various benchmark tasks, showing superior reasoning and generalization capabilities.
Conclusion: The framework successfully enhances LALMs’ reasoning by improving adaptability, consistency, and effectiveness through dynamic reasoning strategy adjustment and comprehensive reward mechanisms.
Abstract: Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning with rule-based rewards. However, the explicit reasoning process has yet to show significant benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs, with a focus on improving adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity dynamically. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that help the model distinguish between valid and flawed reasoning paths during training. Experimental results demonstrate that our Audio-Thinker model outperforms existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.
cs.LG
[244] CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, Caiwen Ding
Main category: cs.LG
TL;DR: CudaForge is a training-free multi-agent workflow for CUDA kernel generation and optimization that uses two LLM agents (Coder and Judge) to iteratively generate, test, and optimize kernels with hardware feedback, achieving 97.6% correctness and 1.68x speedup over PyTorch baselines.
Details
Motivation: Manual CUDA kernel design is costly and time-consuming for AI applications like LLM training, while existing automatic approaches produce low-efficiency kernels with high computational overhead and poor generalization.Method: Multi-agent workflow with two LLM agents (Coder and Judge) that iteratively generate, correct, and optimize CUDA kernels while integrating hardware feedback from Nsight Compute metrics, mimicking human expert workflow.
Result: Achieves 97.6% correctness of generated kernels, 1.68x speedup over PyTorch baselines, strong generalization across GPUs and base models, with generation cost of 26.5 minutes and $0.3 API cost per kernel.
Conclusion: Multi-agent, training-free workflows enable cost-effective, generalizable, and high-performance CUDA kernel optimization, significantly outperforming existing methods in both efficiency and cost.
Abstract: Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code generation. Existing methods for automatic kernel generation, however, often produce low-efficiency kernels, incur high computational overhead, and fail to generalize across settings. In this work, we propose CudaForge, a training-free multi-agent workflow for CUDA kernel generation and optimization. Our workflow is inspired by the iterative workflow of human experts, which contains steps such as developing initial kernels, testing correctness, analyzing hardware feedback, and iterative improvement. More specifically, CudaForge employs two LLM agents: a Coder and a Judge, that iteratively generate, correct, and optimize CUDA kernels, while integrating hardware feedback such as Nsight Compute (NCU) metrics. In extensive evaluations, we show that CudaForge, by leveraging base models like OpenAI-o3, achieves 97.6% correctness of generated kernels and an average 1.68$\times$ speedup over PyTorch baselines, substantially surpassing state-of-the-art models including OpenAI-o3 and Kevin on KernelBench. Beyond accuracy and speed, CudaForge demonstrates strong generalization across GPUs (A100, RTX 6000, 4090, 3090) and base models (OpenAI-o3, GPT-5, gpt-oss-120B, Claude-Sonnet-4, QwQ-32B), while maintaining high efficiency. In particular, generating an optimized kernel takes about 26.5 minutes on one RTX6000 and incurs about $ 0.3 API cost, which is significantly cheaper than existing agentic work that costs 6 H100 hours and $ 5 API cost per kernel. Our results highlight that multi-agent, training-free workflows can enable cost-effective, generalizable, and high-performance CUDA kernel optimization. Code available at https://github.com/OptimAI-Lab/CudaForge
[245] Retrieval-Augmented Multimodal Depression Detection
Ruibo Hou, Shiyu Teng, Jiaqing Liu, Shurong Chai, Yinhao Li, Lanfen Lin, Yen-Wei Chen
Main category: cs.LG
TL;DR: A novel RAG framework for depression detection that retrieves emotional content from sentiment datasets and uses LLMs to generate Emotion Prompts, achieving state-of-the-art performance.
Details
Motivation: To address limitations in current multimodal depression detection methods including high computational cost, domain mismatch, and static knowledge constraints.Method: Proposes a Retrieval-Augmented Generation (RAG) framework that retrieves semantically relevant emotional content from sentiment datasets and uses Large Language Models to generate Emotion Prompts as auxiliary modality.
Result: Achieves state-of-the-art performance on AVEC 2019 dataset with CCC of 0.593 and MAE of 3.95, surpassing previous transfer learning and multi-task learning baselines.
Conclusion: The RAG framework effectively enhances emotional representation and interpretability in depression detection while overcoming computational and domain limitations.
Abstract: Multimodal deep learning has shown promise in depression detection by integrating text, audio, and video signals. Recent work leverages sentiment analysis to enhance emotional understanding, yet suffers from high computational cost, domain mismatch, and static knowledge limitations. To address these issues, we propose a novel Retrieval-Augmented Generation (RAG) framework. Given a depression-related text, our method retrieves semantically relevant emotional content from a sentiment dataset and uses a Large Language Model (LLM) to generate an Emotion Prompt as an auxiliary modality. This prompt enriches emotional representation and improves interpretability. Experiments on the AVEC 2019 dataset show our approach achieves state-of-the-art performance with CCC of 0.593 and MAE of 3.95, surpassing previous transfer learning and multi-task learning baselines.
[246] Deciphering Personalization: Towards Fine-Grained Explainability in Natural Language for Personalized Image Generation Models
Haoming Wang, Wei Gao
Main category: cs.LG
TL;DR: FineXL provides fine-grained natural language explanations for personalized image generation models, identifying multiple aspects of personalization with quantitative scores.
Details
Motivation: Current personalized image generation models lack explainability, and existing natural language explanations are too coarse-grained to identify multiple aspects and varying levels of personalization.Method: FineXL technique that generates natural language descriptions for each distinct aspect of personalization along with quantitative scores indicating the level of each aspect.
Result: FineXL improves the accuracy of explainability by 56% across different personalization scenarios applied to multiple types of image generation models.
Conclusion: FineXL successfully addresses the limitation of coarse-grained explainability by providing fine-grained natural language explanations with quantitative personalization scores.
Abstract: Image generation models are usually personalized in practical uses in order to better meet the individual users’ heterogeneous needs, but most personalized models lack explainability about how they are being personalized. Such explainability can be provided via visual features in generated images, but is difficult for human users to understand. Explainability in natural language is a better choice, but the existing approaches to explainability in natural language are limited to be coarse-grained. They are unable to precisely identify the multiple aspects of personalization, as well as the varying levels of personalization in each aspect. To address such limitation, in this paper we present a new technique, namely \textbf{FineXL}, towards \textbf{Fine}-grained e\textbf{X}plainability in natural \textbf{L}anguage for personalized image generation models. FineXL can provide natural language descriptions about each distinct aspect of personalization, along with quantitative scores indicating the level of each aspect of personalization. Experiment results show that FineXL can improve the accuracy of explainability by 56%, when different personalization scenarios are applied to multiple types of image generation models.
[247] The Eigenvalues Entropy as a Classifier Evaluation Measure
Doulaye Dembélé
Main category: cs.LG
TL;DR: Proposes using eigenvalues entropy as an evaluation measure for classification problems, especially effective for imbalanced datasets, with relationships to standard metrics like sensitivity, specificity, AUC, and Gini index.
Details
Motivation: Standard evaluation measures are less accurate for imbalanced datasets, creating a need for more robust evaluation metrics that can handle class imbalance effectively.Method: Uses eigenvalues entropy as an evaluation measure for binary and multi-class classification problems, establishes mathematical relationships between eigenvalues and traditional metrics, and provides an estimate of the confusion matrix to address class imbalance.
Result: The proposed eigenvalues entropy measure shows better performance than gold standard measures across various datasets, particularly for imbalanced class scenarios.
Conclusion: Eigenvalues entropy is an effective evaluation measure that outperforms traditional metrics for imbalanced datasets and provides mathematical connections to established evaluation measures.
Abstract: Classification is a machine learning method used in many practical applications: text mining, handwritten character recognition, face recognition, pattern classification, scene labeling, computer vision, natural langage processing. A classifier prediction results and training set information are often used to get a contingency table which is used to quantify the method quality through an evaluation measure. Such measure, typically a numerical value, allows to choose a suitable method among several. Many evaluation measures available in the literature are less accurate for a dataset with imbalanced classes. In this paper, the eigenvalues entropy is used as an evaluation measure for a binary or a multi-class problem. For a binary problem, relations are given between the eigenvalues and some commonly used measures, the sensitivity, the specificity, the area under the operating receiver characteristic curve and the Gini index. A by-product result of this paper is an estimate of the confusion matrix to deal with the curse of the imbalanced classes. Various data examples are used to show the better performance of the proposed evaluation measure over the gold standard measures available in the literature.
[248] Human-Machine Ritual: Synergic Performance through Real-Time Motion Recognition
Zhuodi Cai, Ziyu Xu, Juan Pampin
Main category: cs.LG
TL;DR: A real-time motion recognition system using wearable IMU sensors and MiniRocket classification enables responsive human-machine collaboration in dance performance with <50ms latency.
Details
Motivation: To create a human-centered approach to human-machine collaboration that preserves expressive depth in dance while leveraging machine learning for responsive interaction.Method: Uses wearable IMU sensors to capture movement data, applies MiniRocket time-series classification for real-time recognition, and maps dancer-specific movements to sound through somatic memory associations.
Result: Achieves high accuracy classification with less than 50ms latency, providing reliable performance for real-time applications.
Conclusion: Offers a replicable framework for integrating dance-literate machines into creative, educational, and live performance contexts through responsive multimedia control.
Abstract: We introduce a lightweight, real-time motion recognition system that enables synergic human-machine performance through wearable IMU sensor data, MiniRocket time-series classification, and responsive multimedia control. By mapping dancer-specific movement to sound through somatic memory and association, we propose an alternative approach to human-machine collaboration, one that preserves the expressive depth of the performing body while leveraging machine learning for attentive observation and responsiveness. We demonstrate that this human-centered design reliably supports high accuracy classification (<50 ms latency), offering a replicable framework to integrate dance-literate machines into creative, educational, and live performance contexts.
[249] H-Infinity Filter Enhanced CNN-LSTM for Arrhythmia Detection from Heart Sound Recordings
Rohith Shinoj Kumar, Rushdeep Dinda, Aditya Tyagi, Annappa B., Naveen Kumar M. R
Main category: cs.LG
TL;DR: A novel CNN-H-Infinity-LSTM architecture is proposed for heart arrhythmia detection from heart sound recordings, achieving 99.42% accuracy and 98.85% F1 score on the PhysioNet CinC Challenge 2016 dataset.
Details
Motivation: Manual diagnosis of heart arrhythmia is subjective and relies on visual interpretation. Current deep learning models struggle with generalization in real-world scenarios with small or noisy datasets common in biomedical applications.Method: Proposed a CNN-H-Infinity-LSTM architecture that incorporates trainable parameters inspired by the H-Infinity filter from control theory to enhance robustness and generalization in arrhythmia detection from heart sound recordings.
Result: The model achieved stable convergence and outperformed existing benchmarks with a test accuracy of 99.42% and F1 score of 98.85% on the PhysioNet CinC Challenge 2016 dataset.
Conclusion: The proposed CNN-H-Infinity-LSTM architecture successfully addresses generalization challenges in arrhythmia detection, demonstrating superior performance and robustness compared to existing methods.
Abstract: Early detection of heart arrhythmia can prevent severe future complications in cardiac patients. While manual diagnosis still remains the clinical standard, it relies heavily on visual interpretation and is inherently subjective. In recent years, deep learning has emerged as a powerful tool to automate arrhythmia detection, offering improved accuracy, consistency, and efficiency. Several variants of convolutional and recurrent neural network architectures have been widely explored to capture spatial and temporal patterns in physiological signals. However, despite these advancements, current models often struggle to generalize well in real-world scenarios, especially when dealing with small or noisy datasets, which are common challenges in biomedical applications. In this paper, a novel CNN-H-Infinity-LSTM architecture is proposed to identify arrhythmic heart signals from heart sound recordings. This architecture introduces trainable parameters inspired by the H-Infinity filter from control theory, enhancing robustness and generalization. Extensive experimentation on the PhysioNet CinC Challenge 2016 dataset, a public benchmark of heart audio recordings, demonstrates that the proposed model achieves stable convergence and outperforms existing benchmarks, with a test accuracy of 99.42% and an F1 score of 98.85%.
[250] Variational Geometry-aware Neural Network based Method for Solving High-dimensional Diffeomorphic Mapping Problems
Zhiwen Li, Cheuk Hin Ho, Lok Ming Lui
Main category: cs.LG
TL;DR: A mesh-free learning framework for n-dimensional diffeomorphic mapping that combines variational principles with quasi-conformal theory to ensure accurate, bijective mappings while controlling deformation quality.
Details
Motivation: Traditional methods for high-dimensional diffeomorphic mapping struggle with the curse of dimensionality, necessitating more scalable and robust approaches.Method: Proposes a mesh-free learning framework that combines variational principles with quasi-conformal theory, regulating conformality distortion and volume distortion to ensure bijective mappings. The framework is compatible with gradient-based optimization and neural networks.
Result: Numerical experiments on synthetic and real-world medical image data validate the accuracy, robustness, and effectiveness of the method in complex registration scenarios.
Conclusion: The proposed framework provides a flexible and scalable solution for high-dimensional diffeomorphic mapping problems, overcoming limitations of traditional methods.
Abstract: Traditional methods for high-dimensional diffeomorphic mapping often struggle with the curse of dimensionality. We propose a mesh-free learning framework designed for $n$-dimensional mapping problems, seamlessly combining variational principles with quasi-conformal theory. Our approach ensures accurate, bijective mappings by regulating conformality distortion and volume distortion, enabling robust control over deformation quality. The framework is inherently compatible with gradient-based optimization and neural network architectures, making it highly flexible and scalable to higher-dimensional settings. Numerical experiments on both synthetic and real-world medical image data validate the accuracy, robustness, and effectiveness of the proposed method in complex registration scenarios.
[251] From Solo to Symphony: Orchestrating Multi-Agent Collaboration with Single-Agent Demos
Xun Wang, Zhuoran Li, Yanshan Lin, Hai Zhong, Longbo Huang
Main category: cs.LG
TL;DR: SoCo transfers solo knowledge to cooperative multi-agent RL via pretraining from solo demonstrations and policy fusion during multi-agent training, boosting efficiency and performance.
Details
Motivation: Training multi-agent teams from scratch is inefficient, and existing methods still rely on costly multi-agent data, while solo experiences are easier to obtain in many scenarios.Method: Pretrains shared solo policy from solo demonstrations, then adapts for cooperation through policy fusion mechanism with MoE-like gating selector and action editor.
Result: Significantly boosts training efficiency and performance across diverse cooperative tasks compared to backbone algorithms.
Conclusion: Solo demonstrations provide scalable and effective complement to multi-agent data, making cooperative learning more practical and broadly applicable.
Abstract: Training a team of agents from scratch in multi-agent reinforcement learning (MARL) is highly inefficient, much like asking beginners to play a symphony together without first practicing solo. Existing methods, such as offline or transferable MARL, can ease this burden, but they still rely on costly multi-agent data, which often becomes the bottleneck. In contrast, solo experiences are far easier to obtain in many important scenarios, e.g., collaborative coding, household cooperation, and search-and-rescue. To unlock their potential, we propose Solo-to-Collaborative RL (SoCo), a framework that transfers solo knowledge into cooperative learning. SoCo first pretrains a shared solo policy from solo demonstrations, then adapts it for cooperation during multi-agent training through a policy fusion mechanism that combines an MoE-like gating selector and an action editor. Experiments across diverse cooperative tasks show that SoCo significantly boosts the training efficiency and performance of backbone algorithms. These results demonstrate that solo demonstrations provide a scalable and effective complement to multi-agent data, making cooperative learning more practical and broadly applicable.
[252] Superpositional Gradient Descent: Harnessing Quantum Principles for Model Training
Ahmet Erdem Pamuk, Emir Kaan Özdemir, Şuayp Talha Kocabay
Main category: cs.LG
TL;DR: Superpositional Gradient Descent (SGD) is a quantum-inspired optimizer that uses quantum circuit perturbations to enhance classical training of large language models, achieving faster convergence and lower loss than AdamW.
Details
Motivation: To explore how quantum-inspired methods can enhance classical optimization techniques for training large language models, as the mechanisms behind such improvements remain underexplored.Method: Developed a mathematical framework and implemented hybrid quantum-classical circuits using PyTorch and Qiskit, injecting quantum circuit perturbations into gradient updates to create superpositional effects.
Result: On synthetic sequence classification and large-scale LLM fine-tuning tasks, SGD converged faster and achieved lower final loss compared to AdamW.
Conclusion: The work provides new insights into combining quantum computing with deep learning, suggesting practical ways to leverage quantum principles for controlling and enhancing model behavior, though scalability and hardware constraints remain challenges.
Abstract: Large language models (LLMs) are increasingly trained with classical optimization techniques like AdamW to improve convergence and generalization. However, the mechanisms by which quantum-inspired methods enhance classical training remain underexplored. We introduce Superpositional Gradient Descent (SGD), a novel optimizer linking gradient updates with quantum superposition by injecting quantum circuit perturbations. We present a mathematical framework and implement hybrid quantum-classical circuits in PyTorch and Qiskit. On synthetic sequence classification and large-scale LLM fine-tuning, SGD converges faster and yields lower final loss than AdamW. Despite promising results, scalability and hardware constraints limit adoption. Overall, this work provides new insights into the intersection of quantum computing and deep learning, suggesting practical pathways for leveraging quantum principles to control and enhance model behavior.
[253] Neural Green’s Functions
Seungwoo Yoo, Kyeongmin Yeo, Jisung Hwang, Minhyuk Sung
Main category: cs.LG
TL;DR: Neural Green’s Function is a neural solution operator for linear PDEs that mimics Green’s functions, achieving superior generalization across irregular geometries and functions while being much faster than traditional solvers.
Details
Motivation: To create a neural solution operator that can generalize well across diverse irregular geometries and source/boundary functions for linear PDEs, overcoming limitations of existing learning-based methods that struggle with unseen functions.Method: Extracts per-point features from volumetric point clouds representing problem domains, predicts decomposition of solution operator, and applies numerical integration to evaluate solutions - designed to be agnostic to specific functions used during training.
Result: Outperforms state-of-the-art neural operators in steady-state thermal analysis, achieving 13.9% average error reduction across five shape categories while being up to 350 times faster than numerical solvers requiring expensive meshing.
Conclusion: Neural Green’s Function provides an effective neural framework for linear PDEs that generalizes robustly across diverse geometries and functions while offering significant computational efficiency gains.
Abstract: We introduce Neural Green’s Function, a neural solution operator for linear partial differential equations (PDEs) whose differential operators admit eigendecompositions. Inspired by Green’s functions, the solution operators of linear PDEs that depend exclusively on the domain geometry, we design Neural Green’s Function to imitate their behavior, achieving superior generalization across diverse irregular geometries and source and boundary functions. Specifically, Neural Green’s Function extracts per-point features from a volumetric point cloud representing the problem domain and uses them to predict a decomposition of the solution operator, which is subsequently applied to evaluate solutions via numerical integration. Unlike recent learning-based solution operators, which often struggle to generalize to unseen source or boundary functions, our framework is, by design, agnostic to the specific functions used during training, enabling robust and efficient generalization. In the steady-state thermal analysis of mechanical part geometries from the MCB dataset, Neural Green’s Function outperforms state-of-the-art neural operators, achieving an average error reduction of 13.9% across five shape categories, while being up to 350 times faster than a numerical solver that requires computationally expensive meshing.
[254] DeepContour: A Hybrid Deep Learning Framework for Accelerating Generalized Eigenvalue Problem Solving via Efficient Contour Design
Yeqiu Chen, Ziyan Liu, Hong Wang
Main category: cs.LG
TL;DR: DeepContour is a hybrid framework that combines deep learning with contour integral methods to efficiently solve large-scale Generalized Eigenvalue Problems by automatically determining optimal integration contours.
Details
Motivation: Traditional contour integral methods for solving Generalized Eigenvalue Problems are highly dependent on proper contour selection, which requires prior knowledge of eigenvalue distribution. Improper contour selection leads to significant computational overhead and numerical inaccuracy.Method: DeepContour uses a Fourier Neural Operator to rapidly predict spectral distribution, then applies Kernel Density Estimation to automatically determine optimal integration contours, which guide the contour integral solver to efficiently find eigenvalues.
Result: DeepContour achieves up to 5.63× speedup in solving Generalized Eigenvalue Problems across multiple datasets, demonstrating significant acceleration while maintaining numerical accuracy.
Conclusion: The framework pioneers an efficient and robust paradigm for tackling difficult generalized eigenvalue problems by combining deep learning’s predictive power with classical solvers’ numerical rigor, particularly effective for high-dimensional matrices.
Abstract: Solving large-scale Generalized Eigenvalue Problems (GEPs) is a fundamental yet computationally prohibitive task in science and engineering. As a promising direction, contour integral (CI) methods, such as the CIRR algorithm, offer an efficient and parallelizable framework. However, their performance is critically dependent on the selection of integration contours – improper selection without reliable prior knowledge of eigenvalue distribution can incur significant computational overhead and compromise numerical accuracy. To address this challenge, we propose DeepContour, a novel hybrid framework that integrates a deep learning-based spectral predictor with Kernel Density Estimation for principled contour design. Specifically, DeepContour first employs a Fourier Neural Operator (FNO) to rapidly predict the spectral distribution of a given GEP. Subsequently, Kernel Density Estimation (KDE) is applied to the predicted spectrum to automatically and systematically determine proper integration contours. Finally, these optimized contours guide the CI solver to efficiently find the desired eigenvalues. We demonstrate the effectiveness of our method on diverse challenging scientific problems. In our main experiments, DeepContour accelerates GEP solving across multiple datasets, achieving up to a 5.63$\times$ speedup. By combining the predictive power of deep learning with the numerical rigor of classical solvers, this work pioneers an efficient and robust paradigm for tackling difficult generalized eigenvalue involving matrices of high dimension.
[255] Dynamic Population Distribution Aware Human Trajectory Generation with Diffusion Model
Qingyue Long, Can Rong, Tong Li, Yong Li
Main category: cs.LG
TL;DR: A diffusion model-based trajectory generation framework that incorporates dynamic population distribution constraints to create realistic human mobility data while addressing privacy and data quality issues.
Details
Motivation: Real-world trajectory data faces privacy concerns, high acquisition costs, and quality issues. Existing methods focus on individual patterns but ignore population distribution's influence on mobility behavior.Method: Uses diffusion model with spatial graph for spatial correlation and dynamic population distribution aware denoising network to capture spatiotemporal dependencies and population impact.
Result: Generated trajectories closely resemble real-world data in critical statistical metrics, outperforming state-of-the-art algorithms by over 54%.
Conclusion: Integrating dynamic population distribution constraints in trajectory generation significantly improves realism and outperforms existing methods.
Abstract: Human trajectory data is crucial in urban planning, traffic engineering, and public health. However, directly using real-world trajectory data often faces challenges such as privacy concerns, data acquisition costs, and data quality. A practical solution to these challenges is trajectory generation, a method developed to simulate human mobility behaviors. Existing trajectory generation methods mainly focus on capturing individual movement patterns but often overlook the influence of population distribution on trajectory generation. In reality, dynamic population distribution reflects changes in population density across different regions, significantly impacting individual mobility behavior. Thus, we propose a novel trajectory generation framework based on a diffusion model, which integrates the dynamic population distribution constraints to guide high-fidelity generation outcomes. Specifically, we construct a spatial graph to enhance the spatial correlation of trajectories. Then, we design a dynamic population distribution aware denoising network to capture the spatiotemporal dependencies of human mobility behavior as well as the impact of population distribution in the denoising process. Extensive experiments show that the trajectories generated by our model can resemble real-world trajectories in terms of some critical statistical metrics, outperforming state-of-the-art algorithms by over 54%.
[256] Accounting for Underspecification in Statistical Claims of Model Superiority
Thomas Sanchez, Pedro M. Gordaliza, Meritxell Bach Cuadra
Main category: cs.LG
TL;DR: Extends statistical framework for false outperformance claims to include underspecification, showing that even small seed variability significantly increases evidence needed for superiority claims in medical imaging.
Details
Motivation: Machine learning in medical imaging often reports statistically fragile improvements, and existing analyses ignore underspecification - where models with similar validation scores behave differently on unseen data due to random initialization.Method: Extend existing statistical framework modeling false outperformance claims to incorporate underspecification as an additional variance component, using simulations to analyze seed variability effects.
Result: Simulations show that even modest seed variability (~1%) substantially increases the evidence required to support superiority claims in medical imaging systems.
Conclusion: Explicit modeling of training variance is necessary when validating medical imaging systems to account for underspecification effects.
Abstract: Machine learning methods are increasingly applied in medical imaging, yet many reported improvements lack statistical robustness: recent works have highlighted that small but significant performance gains are highly likely to be false positives. However, these analyses do not take \emph{underspecification} into account – the fact that models achieving similar validation scores may behave differently on unseen data due to random initialization or training dynamics. Here, we extend a recent statistical framework modeling false outperformance claims to include underspecification as an additional variance component. Our simulations demonstrate that even modest seed variability ($\sim1%$) substantially increases the evidence required to support superiority claims. Our findings underscore the need for explicit modeling of training variance when validating medical imaging systems.
[257] Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch
Yirong Zeng, Xiao Ding, Yutai Hou, Yuxian Wang, Li Du, Juyi Dai, Qiuyang Ding, Duyu Tang, Dandan Tu, Weiwen Liu, Bing Qin, Ting Liu
Main category: cs.LG
TL;DR: Pure RL training without supervised fine-tuning can effectively enhance LLMs’ tool-use generalization capabilities through dynamic reward design.
Details
Motivation: Current supervised fine-tuning approaches struggle with generalization to unfamiliar tool-use scenarios, while RL shows promise for better reasoning and generalization abilities.Method: Proposed dynamic generalization-guided reward design for rule-based RL that shifts from exploratory to exploitative tool-use patterns, and introduced Tool-Zero series models trained directly from base models without post-training.
Result: Achieved over 7% performance improvement compared to both SFT and RL-with-SFT models, with consistent gains across cross-dataset and intra-dataset evaluations.
Conclusion: Pure RL training can effectively elicit LLMs’ intrinsic reasoning capabilities and enhance tool-agnostic generalization, demonstrating the effectiveness and robustness of the proposed methods.
Abstract: Training tool-augmented LLMs has emerged as a promising approach to enhancing language models’ capabilities for complex tasks. The current supervised fine-tuning paradigm relies on constructing extensive domain-specific datasets to train models. However, this approach often struggles to generalize effectively to unfamiliar or intricate tool-use scenarios. Recently, reinforcement learning (RL) paradigm can endow LLMs with superior reasoning and generalization abilities. In this work, we address a key question: Can the pure RL be used to effectively elicit a model’s intrinsic reasoning capabilities and enhance the tool-agnostic generalization? We propose a dynamic generalization-guided reward design for rule-based RL, which progressively shifts rewards from exploratory to exploitative tool-use patterns. Based on this design, we introduce the Tool-Zero series models. These models are trained to enable LLMs to autonomously utilize general tools by directly scaling up RL from Zero models (i.e., base models without post-training). Experimental results demonstrate that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models under the same experimental settings. These gains are consistently replicated across cross-dataset and intra-dataset evaluations, validating the effectiveness and robustness of our methods.
[258] Q-Sat AI: Machine Learning-Based Decision Support for Data Saturation in Qualitative Studies
Hasan Tutar, Caner Erden, Ümit Şentürk
Main category: cs.LG
TL;DR: A machine learning model is developed to objectively determine sample size in qualitative research, replacing the subjective data saturation principle with a systematic approach using ensemble learning.
Details
Motivation: Traditional qualitative research sample size determination relies on subjective data saturation, leading to inconsistencies and threatening methodological rigor. This study aims to make the process more objective and systematic.Method: Developed an ensemble learning model using ML algorithms (KNN, Gradient Boosting, Random Forest, XGBoost, Decision Tree) trained on data from five qualitative research approaches. Ten parameters including research scope, information power, and researcher competence were evaluated on an ordinal scale as input features.
Result: The ML algorithms achieved high explanatory power (Test R2 ~ 0.85), effectively modeling complex non-linear relationships in qualitative sampling decisions. Feature importance analysis confirmed research design type and information power as critical factors.
Conclusion: Proposes a conceptual framework for a web-based computational application as a decision support system to standardize sample size justification, enhance transparency, and strengthen qualitative inquiry through evidence-based decision-making.
Abstract: The determination of sample size in qualitative research has traditionally relied on the subjective and often ambiguous principle of data saturation, which can lead to inconsistencies and threaten methodological rigor. This study introduces a new, systematic model based on machine learning (ML) to make this process more objective. Utilizing a dataset derived from five fundamental qualitative research approaches - namely, Case Study, Grounded Theory, Phenomenology, Narrative Research, and Ethnographic Research - we developed an ensemble learning model. Ten critical parameters, including research scope, information power, and researcher competence, were evaluated using an ordinal scale and used as input features. After thorough preprocessing and outlier removal, multiple ML algorithms were trained and compared. The K-Nearest Neighbors (KNN), Gradient Boosting (GB), Random Forest (RF), XGBoost, and Decision Tree (DT) algorithms showed the highest explanatory power (Test R2 ~ 0.85), effectively modeling the complex, non-linear relationships involved in qualitative sampling decisions. Feature importance analysis confirmed the vital roles of research design type and information power, providing quantitative validation of key theoretical assumptions in qualitative methodology. The study concludes by proposing a conceptual framework for a web-based computational application designed to serve as a decision support system for qualitative researchers, journal reviewers, and thesis advisors. This model represents a significant step toward standardizing sample size justification, enhancing transparency, and strengthening the epistemological foundation of qualitative inquiry through evidence-based, systematic decision-making.
[259] Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR
Abdelaziz Bounhar, Hadi Abdine, Evan Dufraisse, Ahmad Chamma, Amr Mohamed, Dani Bouch, Michalis Vazirgiannis, Guokan Shang
Main category: cs.LG
TL;DR: Retaining and up-weighting moderately easy problems in RLVR training acts as implicit length regularization, achieving baseline accuracy with solutions nearly twice as short without explicit length penalization.
Details
Motivation: Standard RLVR pipelines filter out easy problems, causing models to conflate longer reasoning with better reasoning and become excessively verbose, raising inference costs.Method: Retain and modestly up-weight moderately easy problems during RLVR training to constrain output distribution and prevent runaway verbosity.
Result: Achieved baseline pass@1 AIME25 accuracy while generating solutions that are on average nearly twice as short, demonstrating emergent brevity without explicit length penalization.
Conclusion: Exposing models to solvable short-chain tasks during training provides implicit length regularization, enabling more efficient reasoning without sacrificing accuracy.
Abstract: Large language models (LLMs) trained for step-by-step reasoning often become
excessively verbose, raising inference cost. Standard Reinforcement Learning
with Verifiable Rewards (RLVR) pipelines filter out easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a \textbf{model that conflates thinking longer’’ with
``thinking better’’}. In this work, we show that retaining and modestly
up-weighting moderately easy problems acts as an implicit length regularizer.
Exposing the model to solvable short-chain tasks constrains its output
distribution and prevents runaway verbosity. The result is
\textbf{\emph{emergent brevity for free}}: the model learns to solve harder
problems without inflating the output length, \textbf{ despite the absence of
any explicit length penalization}. RLVR experiments using this approach on
\textit{Qwen3-4B-Thinking-2507} (with a 16k token limit) achieve baseline
pass@1 AIME25 accuracy while generating solutions that are, on average, nearly
twice as short. The code is available at
\href{https://github.com/MBZUAI-Paris/Frugal-AI}{GitHub}, with datasets and
models on
\href{https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc}{Hugging
Face}.
[260] The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold
Tiberiu Musat
Main category: cs.LG
TL;DR: Grokking phenomenon in neural networks where generalization occurs after memorization is explained through constrained optimization: gradient descent minimizes weight norm on zero-loss manifold.
Details
Motivation: To understand the puzzling delayed generalization phenomenon (grokking) in neural networks that occurs after complete memorization of training data, as previous research linked it to representation learning but precise dynamics remained unclear.Method: Formal proof in infinitesimal learning rate/weight decay limit, introducing approximation to decouple parameter dynamics, deriving closed-form expression for first layer post-memorization dynamics in two-layer networks.
Result: Experiments confirm that simulating training with predicted gradients reproduces both delayed generalization and representation learning characteristic of grokking.
Conclusion: Post-memorization learning can be understood as constrained optimization where gradient descent minimizes weight norm on zero-loss manifold, providing mathematical framework for grokking phenomenon.
Abstract: Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.
[261] Learning a Distance for the Clustering of Patients with Amyotrophic Lateral Sclerosis
Guillaume Tejedor, Veronika Peralta, Nicolas Labroche, Patrick Marcel, Hélène Blasco, Hugo Alarcan
Main category: cs.LG
TL;DR: Proposed a clustering method for ALS patients using disease progression scores that integrates medical expertise and outperforms state-of-the-art methods in survival analysis.
Details
Motivation: ALS has limited treatments with variable patient responses, and research is hindered by small heterogeneous cohorts, sparse data, and lack of clear patient clusters. Existing clustering methods are limited in scope and number.Method: Clustering approach that groups sequences using disease progression declarative score, integrates medical expertise through multiple descriptive variables, investigates distance measures combining variables using off-the-shelf distances and weak-supervised learning, paired with clustering methods.
Result: Evaluation on 353 ALS patients shows method outperforms state-of-the-art in survival analysis while achieving comparable silhouette scores. Learned distances enhance relevance and interpretability for medical experts.
Conclusion: The proposed clustering approach effectively addresses limitations in ALS patient stratification and provides clinically meaningful clusters with improved survival analysis performance.
Abstract: Amyotrophic lateral sclerosis (ALS) is a severe disease with a typical survival of 3-5 years after symptom onset. Current treatments offer only limited life extension, and the variability in patient responses highlights the need for personalized care. However, research is hindered by small, heterogeneous cohorts, sparse longitudinal data, and the lack of a clear definition for clinically meaningful patient clusters. Existing clustering methods remain limited in both scope and number. To address this, we propose a clustering approach that groups sequences using a disease progression declarative score. Our approach integrates medical expertise through multiple descriptive variables, investigating several distance measures combining such variables, both by reusing off-the-shelf distances and employing a weak-supervised learning method. We pair these distances with clustering methods and benchmark them against state-of-the-art techniques. The evaluation of our approach on a dataset of 353 ALS patients from the University Hospital of Tours, shows that our method outperforms state-of-the-art methods in survival analysis while achieving comparable silhouette scores. In addition, the learned distances enhance the relevance and interpretability of results for medical experts.
[262] COFAP: A Universal Framework for COFs Adsorption Prediction through Designed Multi-Modal Extraction and Cross-Modal Synergy
Zihan Li, Mingyang Wan, Mingyu Gao, Zhongshan Chen, Xiangke Wang, Feifan Zhang
Main category: cs.LG
TL;DR: COFAP is a universal deep learning framework that predicts gas adsorption in COFs using multi-modal feature extraction and cross-modal attention, achieving state-of-the-art performance without requiring traditional gas-specific features.
Details
Motivation: Traditional machine learning approaches for COF screening rely on time-consuming gas-specific features, limiting scalability and efficiency in high-throughput screening of the vast COF design space.Method: COFAP extracts multi-modal structural and chemical features through deep learning and fuses them using cross-modal attention mechanism, eliminating the need for Henry coefficients or adsorption heat.
Result: COFAP outperforms previous approaches on the hypoCOFs dataset and reveals that high-performing COFs for separation concentrate within narrow ranges of pore size and surface area.
Conclusion: COFAP provides superior efficiency and accuracy for COF screening, with a weight-adjustable prioritization scheme that enables flexible, application-specific ranking of candidate materials for researchers.
Abstract: Covalent organic frameworks (COFs) are promising adsorbents for gas adsorption and separation, while identifying the optimal structures among their vast design space requires efficient high-throughput screening. Conventional machine-learning predictors rely heavily on specific gas-related features. However, these features are time-consuming and limit scalability, leading to inefficiency and labor-intensive processes. Herein, a universal COFs adsorption prediction framework (COFAP) is proposed, which can extract multi-modal structural and chemical features through deep learning, and fuse these complementary features via cross-modal attention mechanism. Without Henry coefficients or adsorption heat, COFAP sets a new SOTA by outperforming previous approaches on hypoCOFs dataset. Based on COFAP, we also found that high-performing COFs for separation concentrate within a narrow range of pore size and surface area. A weight-adjustable prioritization scheme is also developed to enable flexible, application-specific ranking of candidate COFs for researchers. Superior efficiency and accuracy render COFAP directly deployable in crystalline porous materials.
[263] Co-Evolving Complexity: An Adversarial Framework for Automatic MARL Curricula
Brennen Hill
Main category: cs.LG
TL;DR: Adversarial co-evolution between procedurally generated attackers and cooperative defenders creates self-scaling environments for training general-purpose intelligent agents.
Details
Motivation: Hand-crafted environments are finite and contain implicit biases, limiting development of truly generalizable and robust skills in agents. Scaling environmental complexity, diversity, and interactivity remains a crucial bottleneck.Method: Framing environment generation as an adversarial game where a procedurally generative attacker creates increasingly challenging enemy configurations, while a defender team learns cooperative strategies. This co-evolutionary dynamic creates self-scaling complexity.
Result: Minimal training leads to emergence of complex intelligent behaviors: flanking and shielding by attackers, and focus-fire and spreading by defenders. Creates effectively infinite stream of novel and relevant training data.
Conclusion: Adversarial co-evolution is a powerful mechanism for automatically scaling environmental complexity, driving agents towards greater robustness and strategic depth.
Abstract: The advancement of general-purpose intelligent agents is intrinsically linked to the environments in which they are trained. While scaling models and datasets has yielded remarkable capabilities, scaling the complexity, diversity, and interactivity of environments remains a crucial bottleneck. Hand-crafted environments are finite and often contain implicit biases, limiting the potential for agents to develop truly generalizable and robust skills. In this work, we propose a paradigm for generating a boundless and adaptive curriculum of challenges by framing the environment generation process as an adversarial game. We introduce a system where a team of cooperative multi-agent defenders learns to survive against a procedurally generative attacker. The attacker agent learns to produce increasingly challenging configurations of enemy units, dynamically creating novel worlds tailored to exploit the defenders’ current weaknesses. Concurrently, the defender team learns cooperative strategies to overcome these generated threats. This co-evolutionary dynamic creates a self-scaling environment where complexity arises organically from the adversarial interaction, providing an effectively infinite stream of novel and relevant training data. We demonstrate that with minimal training, this approach leads to the emergence of complex, intelligent behaviors, such as flanking and shielding by the attacker, and focus-fire and spreading by the defenders. Our findings suggest that adversarial co-evolution is a powerful mechanism for automatically scaling environmental complexity, driving agents towards greater robustness and strategic depth.
[264] Interpretable Heart Disease Prediction via a Weighted Ensemble Model: A Large-Scale Study with SHAP and Surrogate Decision Trees
Md Abrar Hasnat, Md Jobayer, Md. Mehedi Hasan Shawon, Md. Golam Rabiul Alam
Main category: cs.LG
TL;DR: A weighted ensemble model combining tree-based methods and CNN achieves improved CVD risk prediction with high recall and clinical interpretability using SHAP and surrogate decision trees.
Details
Motivation: Cardiovascular disease is a major global health issue requiring reliable and interpretable predictive models for early risk assessment.Method: Developed a strategically weighted ensemble model combining LightGBM, XGBoost, and CNN on a dataset of 229,781 patients, using strategic weighting for class imbalance and feature engineering to expand from 22 to 25 features.
Result: Achieved statistically significant improvement over individual models with Test AUC of 0.8371 (p=0.003) and 80.0% recall, with enhanced interpretability through SHAP and surrogate decision trees.
Conclusion: The ensemble model provides robust predictive performance and clinical transparency, making it suitable for real-world deployment in public health screening.
Abstract: Cardiovascular disease (CVD) remains a critical global health concern, demanding reliable and interpretable predictive models for early risk assessment. This study presents a large-scale analysis using the Heart Disease Health Indicators Dataset, developing a strategically weighted ensemble model that combines tree-based methods (LightGBM, XGBoost) with a Convolutional Neural Network (CNN) to predict CVD risk. The model was trained on a preprocessed dataset of 229,781 patients where the inherent class imbalance was managed through strategic weighting and feature engineering enhanced the original 22 features to 25. The final ensemble achieves a statistically significant improvement over the best individual model, with a Test AUC of 0.8371 (p=0.003) and is particularly suited for screening with a high recall of 80.0%. To provide transparency and clinical interpretability, surrogate decision trees and SHapley Additive exPlanations (SHAP) are used. The proposed model delivers a combination of robust predictive performance and clinical transparency by blending diverse learning architectures and incorporating explainability through SHAP and surrogate decision trees, making it a strong candidate for real-world deployment in public health screening.
[265] EchoLSTM: A Self-Reflective Recurrent Network for Stabilizing Long-Range Memory
Prasanth K K, Shubham Sharma
Main category: cs.LG
TL;DR: EchoLSTM introduces output-conditioned gating for self-reflection, enabling better long-range dependency modeling in noisy sequences while being parameter-efficient.
Details
Motivation: Standard RNNs and LSTMs struggle with long-range dependencies in sequences with noisy or misleading information, requiring more robust memory systems.Method: Output-Conditioned Gating principle that modulates internal memory gates based on past inferences, creating a stabilizing feedback loop, combined with attention mechanism.
Result: 69.0% accuracy on Distractor Signal Task (33pp improvement over LSTM), 69.8% on ListOps benchmark (competitive with Transformer’s 71.8% but 5x more parameter-efficient), and qualitative evidence of robust memory.
Conclusion: EchoLSTM’s self-reflective mechanism provides fundamentally more robust memory systems for handling noisy sequences while maintaining parameter efficiency.
Abstract: Standard Recurrent Neural Networks, including LSTMs, struggle to model long-range dependencies, particularly in sequences containing noisy or misleading information. We propose a new architectural principle, Output-Conditioned Gating, which enables a model to perform self-reflection by modulating its internal memory gates based on its own past inferences. This creates a stabilizing feedback loop that enhances memory retention. Our final model, the EchoLSTM, integrates this principle with an attention mechanism. We evaluate the EchoLSTM on a series of challenging benchmarks. On a custom-designed Distractor Signal Task, the EchoLSTM achieves 69.0% accuracy, decisively outperforming a standard LSTM baseline by 33 percentage points. Furthermore, on the standard ListOps benchmark, the EchoLSTM achieves performance competitive with a modern Transformer model, 69.8% vs. 71.8%, while being over 5 times more parameter-efficient. A final Trigger Sensitivity Test provides qualitative evidence that our model’s self-reflective mechanism leads to a fundamentally more robust memory system.
[266] NeuroClean: A Generalized Machine-Learning Approach to Neural Time-Series Conditioning
Manuel A. Hernandez Alonso, Michael Depass, Stephan Quessy, Numa Dancause, Ignasi Cos
Main category: cs.LG
TL;DR: NeuroClean is an unsupervised EEG/LFP preprocessing pipeline that automatically removes artifacts while preserving task-relevant information, achieving 97% classification accuracy in motor tasks.
Details
Motivation: EEG and LFP recordings suffer from various artifacts and noise, requiring automated preprocessing to ensure reliability and reproducibility while avoiding human intervention biases.Method: Five-step pipeline including bandpass/line noise filtering, bad channel rejection, and efficient ICA with automatic component rejection using a clustering-based machine learning classifier.
Result: Removed common artifacts and achieved 97% accuracy in motor task classification (vs 74% with raw data and 33.3% chance level) using Multinomial Logistic Regression.
Conclusion: NeuroClean is a promising automated preprocessing workflow that improves machine learning performance and generalization for EEG/LFP studies.
Abstract: Electroencephalography (EEG) and local field potentials (LFP) are two widely used techniques to record electrical activity from the brain. These signals are used in both the clinical and research domains for multiple applications. However, most brain data recordings suffer from a myriad of artifacts and noise sources other than the brain itself. Thus, a major requirement for their use is proper and, given current volumes of data, a fully automatized conditioning. As a means to this end, here we introduce an unsupervised, multipurpose EEG/LFP preprocessing method, the NeuroClean pipeline. In addition to its completeness and reliability, NeuroClean is an unsupervised series of algorithms intended to mitigate reproducibility issues and biases caused by human intervention. The pipeline is designed as a five-step process, including the common bandpass and line noise filtering, and bad channel rejection. However, it incorporates an efficient independent component analysis with an automatic component rejection based on a clustering algorithm. This machine learning classifier is used to ensure that task-relevant information is preserved after each step of the cleaning process. We used several data sets to validate the pipeline. NeuroClean removed several common types of artifacts from the signal. Moreover, in the context of motor tasks of varying complexity, it yielded more than 97% accuracy (vs. a chance-level of 33.3%) in an optimized Multinomial Logistic Regression model after cleaning the data, compared to the raw data, which performed at 74% accuracy. These results show that NeuroClean is a promising pipeline and workflow that can be applied to future work and studies to achieve better generalization and performance on machine learning pipelines.
[267] Bulk-boundary decomposition of neural networks
Donghee Lee, Hye-Sung Lee, Jaeok Yi
Main category: cs.LG
TL;DR: The paper introduces a bulk-boundary decomposition framework to analyze deep neural network training dynamics, separating intrinsic architectural dynamics from stochastic data interactions.
Details
Motivation: To provide a new theoretical framework for understanding the training dynamics of deep neural networks by decomposing them into fundamental components.Method: Reorganizes the stochastic gradient descent Lagrangian into data-independent bulk terms (network architecture, activation functions) and data-dependent boundary terms (stochastic interactions from training samples).
Result: The decomposition reveals the local and homogeneous structure underlying deep networks and enables a field-theoretic formulation of neural dynamics.
Conclusion: The bulk-boundary decomposition offers a powerful framework for analyzing neural network training, exposing fundamental structural properties and enabling field-theoretic approaches.
Abstract: We present the bulk-boundary decomposition as a new framework for understanding the training dynamics of deep neural networks. Starting from the stochastic gradient descent formulation, we show that the Lagrangian can be reorganized into a data-independent bulk term and a data-dependent boundary term. The bulk captures the intrinsic dynamics set by network architecture and activation functions, while the boundary reflects stochastic interactions from training samples at the input and output layers. This decomposition exposes the local and homogeneous structure underlying deep networks. As a natural extension, we develop a field-theoretic formulation of neural dynamics based on this decomposition.
[268] TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding
Aditya Sridhar, Nish Sinnadurai, Sean Lie, Vithursan Thangarasa
Main category: cs.LG
TL;DR: TapOut is a training-free dynamic speculation algorithm using multi-armed bandits to automatically select the optimal number of tokens to draft in speculative decoding, eliminating the need for hand-tuned thresholds.
Details
Motivation: Existing dynamic speculative decoding methods rely on sensitive hand-tuned thresholds that are costly to set and generalize poorly across models and domains, limiting the effectiveness of speculative decoding acceleration.Method: Proposes TapOut, an online training-free algorithm using multi-armed bandits to select among multiple parameter-free dynamic speculation strategies based on past reward and exploration, without requiring hyperparameter tuning.
Result: Extensive experiments across diverse model pairs and datasets show TapOut achieves competitive or superior speedups compared to established dynamic speculation baselines.
Conclusion: TapOut provides an effective plug-and-play solution for dynamic speculation policy selection that eliminates the need for manual threshold tuning while maintaining strong performance across different models and domains.
Abstract: Speculative decoding accelerates LLMs by using a lightweight draft model to generate tokens autoregressively before verifying them in parallel with a larger target model. However, determining the optimal number of tokens to draft remains a key challenge limiting the approach’s effectiveness. Dynamic speculative decoding aims to intelligently decide how many tokens to draft to achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive thresholds (e.g., token entropy), which are costly to set and generalize poorly across models and domains. We propose TapOut, an online, training-free, plug-and-play algorithm for dynamic speculation policy selection using multi-armed bandits. Our approach employs a meta-algorithm that selects among multiple parameter-free dynamic speculation strategies based on past reward and exploration. We conduct extensive experiments across diverse model pairs and datasets, showing that TapOut achieves competitive or superior speedups compared to well-established dynamic speculation baselines without any hyperparameter tuning.
[269] Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior
Daniel Aarao Reis Arturi, Eric Zhang, Andrew Ansah, Kevin Zhu, Ashwinee Panda, Aishwarya Balwani
Main category: cs.LG
TL;DR: Emergent misalignment in LLMs shows cross-task linear structure - narrow harmful fine-tuning discovers shared parameter directions that enable broad misalignment across domains.
Details
Motivation: To understand the fundamental mechanisms behind emergent misalignment, where LLMs develop broadly misaligned behaviors after fine-tuning on narrowly harmful datasets.Method: Adopted geometric perspective to study EM, analyzing weight updates, cosine similarities, principal angles, projection overlaps, and linear mode connectivity across different misalignment tasks.
Result: Found strong convergence in EM parameters across tasks with high cosine similarities, shared lower-dimensional subspaces, and functional equivalence via linear mode connectivity.
Conclusion: EM arises from different narrow tasks discovering the same set of shared parameter directions, suggesting harmful behaviors are organized into specific, predictable regions of the weight landscape.
Abstract: Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to study EM and demonstrate that it exhibits a fundamental cross-task linear structure in how harmful behavior is encoded across different datasets. Specifically, we find a strong convergence in EM parameters across tasks, with the fine-tuned weight updates showing relatively high cosine similarities, as well as shared lower-dimensional subspaces as measured by their principal angles and projection overlaps. Furthermore, we also show functional equivalence via linear mode connectivity, wherein interpolated models across narrow misalignment tasks maintain coherent, broadly misaligned behavior. Our results indicate that EM arises from different narrow tasks discovering the same set of shared parameter directions, suggesting that harmful behaviors may be organized into specific, predictable regions of the weight landscape. By revealing this fundamental connection between parametric geometry and behavioral outcomes, we hope our work catalyzes further research on parameter space interpretability and weight-based interventions.
[270] Path-Coordinated Continual Learning with Neural Tangent Kernel-Justified Plasticity: A Theoretical Framework with Near State-of-the-Art Performance
Rathin Chandra Shit
Main category: cs.LG
TL;DR: A new path-coordinated continual learning framework that combines NTK theory, Wilson confidence intervals, and multi-metric path evaluation to address catastrophic forgetting, achieving 66.7% accuracy with 23.4% forgetting on Split-CIFAR10.
Details
Motivation: To solve catastrophic forgetting in neural networks when learning new tasks, which is a fundamental challenge in continual learning.Method: Path-coordinated framework unifying Neural Tangent Kernel theory for plasticity bounds, statistical validation via Wilson confidence intervals, and multi-metric path quality evaluation.
Result: Achieved 66.7% average accuracy with 23.4% catastrophic forgetting on Split-CIFAR10, near state-of-the-art performance. Found NTK condition numbers >10^11 indicate critical learning capacity limits. Forgetting decreased from 27% to 18% over task sequence.
Conclusion: The framework successfully addresses catastrophic forgetting with statistical guarantees (80% path validation), maintains 90-97% retention on intermediate tasks, and provides insights for adaptive regularization enhancement.
Abstract: Catastrophic forgetting is one of the fundamental issues of continual learning because neural networks forget the tasks learned previously when trained on new tasks. The proposed framework is a new path-coordinated framework of continual learning that unites the Neural Tangent Kernel (NTK) theory of principled plasticity bounds, statistical validation by Wilson confidence intervals, and evaluation of path quality by the use of multiple metrics. Experimental evaluation shows an average accuracy of 66.7% at the cost of 23.4% catastrophic forgetting on Split-CIFAR10, a huge improvement over the baseline and competitive performance achieved, which is very close to state-of-the-art results. Further, it is found out that NTK condition numbers are predictive indicators of learning capacity limits, showing the existence of a critical threshold at condition number $>10^{11}$. It is interesting to note that the proposed strategy shows a tendency of lowering forgetting as the sequence of tasks progresses (27% to 18%), which is a system stabilization. The framework validates 80% of discovered paths with a rigorous statistical guarantee and maintains 90-97% retention on intermediate tasks. The core capacity limits of the continual learning environment are determined in the analysis, and actionable insights to enhance the adaptive regularization are offered.
[271] RobustFSM: Submodular Maximization in Federated Setting with Malicious Clients
Duc A. Tran, Dung Truong, Duy Le
Main category: cs.LG
TL;DR: RobustFSM is a federated submodular maximization solution that protects against client misbehaviors and fake information sharing, achieving up to 200% improvement over conventional methods under severe attacks.
Details
Motivation: To address privacy and autonomy concerns in federated settings while protecting against malicious clients who might share fake information, similar to backdoor attacks in federated learning but with unique challenges due to submodular maximization characteristics.Method: Proposed RobustFSM, a federated submodular maximization solution designed to be robust against various practical client attacks through repetitive aggregation of local information while maintaining client privacy.
Result: Empirical evaluation using real-world datasets shows RobustFSM substantially exceeds conventional federated algorithm performance when attacks are severe, with improvements up to 200% depending on dataset and attack scenarios.
Conclusion: RobustFSM effectively addresses client misbehavior vulnerabilities in federated submodular maximization, providing significant performance improvements over traditional approaches under attack conditions.
Abstract: Submodular maximization is an optimization problem benefiting many machine learning applications, where we seek a small subset best representing an extremely large dataset. We focus on the federated setting where the data are locally owned by decentralized clients who have their own definitions for the quality of representability. This setting requires repetitive aggregation of local information computed by the clients. While the main motivation is to respect the privacy and autonomy of the clients, the federated setting is vulnerable to client misbehaviors: malicious clients might share fake information. An analogy is backdoor attack in conventional federated learning, but our challenge differs freshly due to the unique characteristics of submodular maximization. We propose RobustFSM, a federated submodular maximization solution that is robust to various practical client attacks. Its performance is substantiated with an empirical evaluation study using real-world datasets. Numerical results show that the solution quality of RobustFSM substantially exceeds that of the conventional federated algorithm when attacks are severe. The degree of this improvement depends on the dataset and attack scenarios, which can be as high as 200%
[272] Predicting Microbial Interactions Using Graph Neural Networks
Elham Gholamzadeh, Kajal Singla, Nico Scherf
Main category: cs.LG
TL;DR: Using Graph Neural Networks to predict interspecies microbial interactions from monoculture growth data, interactions, and phylogeny, achieving 80.44% F1-score for binary classification and complex interaction types.
Details
Motivation: Predicting interspecies interactions is critical for understanding microbial community structure and activity, but remains a key challenge in microbial ecology.Method: Constructed edge-graphs of pairwise microbial interactions and used Graph Neural Networks (GNNs) to leverage shared information across co-culture experiments, trained on a dataset of 7,500+ interactions between 20 species under 40 carbon conditions.
Result: Achieved 80.44% F1-score for predicting interaction directions, significantly outperforming Extreme Gradient Boosting (XGBoost) models (72.76% F1-score). The model can predict both binary interactions and complex types like mutualism, competition, and parasitism.
Conclusion: GNNs are a powerful classifier for predicting microbial interaction modes, offering superior performance over traditional methods and enabling classification of complex interaction types.
Abstract: Predicting interspecies interactions is a key challenge in microbial ecology, as these interactions are critical to determining the structure and activity of microbial communities. In this work, we used data on monoculture growth capabilities, interactions with other species, and phylogeny to predict a negative or positive effect of interactions. More precisely, we used one of the largest available pairwise interaction datasets to train our models, comprising over 7,500 interactions be- tween 20 species from two taxonomic groups co-cultured under 40 distinct carbon conditions, with a primary focus on the work of Nestor et al.[28 ]. In this work, we propose Graph Neural Networks (GNNs) as a powerful classifier to predict the direction of the effect. We construct edge-graphs of pairwise microbial interactions in order to leverage shared information across individual co-culture experiments, and use GNNs to predict modes of interaction. Our model can not only predict binary interactions (positive/negative) but also classify more complex interaction types such as mutualism, competition, and parasitism. Our initial results were encouraging, achieving an F1-score of 80.44%. This significantly outperforms comparable methods in the literature, including conventional Extreme Gradient Boosting (XGBoost) models, which reported an F1-score of 72.76%.
[273] Quantum-Enhanced Generative Models for Rare Event Prediction
M. Z. Haider, M. U. Ghouri, Tayyaba Noreen, M. Salman
Main category: cs.LG
TL;DR: QEGM is a hybrid classical-quantum generative model that improves rare-event modeling by combining deep latent-variable models with variational quantum circuits, achieving better tail distribution capture and reduced mode collapse.
Details
Motivation: Rare events like financial crashes and climate extremes are difficult to model due to scarcity and heavy-tailed distributions, with classical deep generative models struggling to capture these occurrences and producing poor uncertainty estimates.Method: Hybrid classical-quantum framework integrating deep latent-variable models with variational quantum circuits, featuring hybrid loss function for reconstruction fidelity and tail-aware likelihood, plus quantum randomness-driven noise injection. Training uses hybrid loop with classical backpropagation and quantum parameter-shift gradients.
Result: QEGM reduces tail KL divergence by up to 50% compared to state-of-the-art baselines (GAN, VAE, Diffusion), while improving rare-event recall and coverage calibration on synthetic Gaussian mixtures and real-world datasets in finance, climate, and protein structure.
Conclusion: QEGM demonstrates potential as a principled approach for rare-event prediction, offering robustness beyond purely classical methods through quantum-enhanced modeling capabilities.
Abstract: Rare events such as financial crashes, climate extremes, and biological anomalies are notoriously difficult to model due to their scarcity and heavy-tailed distributions. Classical deep generative models often struggle to capture these rare occurrences, either collapsing low-probability modes or producing poorly calibrated uncertainty estimates. In this work, we propose the Quantum-Enhanced Generative Model (QEGM), a hybrid classical-quantum framework that integrates deep latent-variable models with variational quantum circuits. The framework introduces two key innovations: (1) a hybrid loss function that jointly optimizes reconstruction fidelity and tail-aware likelihood, and (2) quantum randomness-driven noise injection to enhance sample diversity and mitigate mode collapse. Training proceeds via a hybrid loop where classical parameters are updated through backpropagation while quantum parameters are optimized using parameter-shift gradients. We evaluate QEGM on synthetic Gaussian mixtures and real-world datasets spanning finance, climate, and protein structure. Results demonstrate that QEGM reduces tail KL divergence by up to 50 percent compared to state-of-the-art baselines (GAN, VAE, Diffusion), while improving rare-event recall and coverage calibration. These findings highlight the potential of QEGM as a principled approach for rare-event prediction, offering robustness beyond what is achievable with purely classical methods.
[274] Regularization Through Reasoning: Systematic Improvements in Language Model Classification via Explanation-Enhanced Fine-Tuning
Vivswan Shah, Randy Cogill, Hanwei Yue, Gopinath Chennupati, Rinat Khaziev
Main category: cs.LG
TL;DR: Adding explanations to labels during fine-tuning improves LLM classification performance, but surprisingly, even random tokens with similar vocabulary structure provide similar benefits, suggesting the gains come from structural regularization rather than semantic meaning.
Details
Motivation: To investigate whether attaching brief explanations to labels during fine-tuning yields better classification models than label-only training.Method: Fine-tuned a 7B-parameter model using ensemble-generated data from multiple LLMs, testing across six diverse conversational datasets with three evaluation axes (naturalness, comprehensiveness, on-topic adherence). Compared label-only training against label-plus-explanation training, including experiments with syntactically incoherent pseudo-explanations.
Result: Label-plus-explanation training outperformed label-only baselines across 18 dataset-task settings. Surprisingly, even random tokens (shuffled or bag-of-words variants) that lacked semantics but maintained vocabulary alignment improved accuracy over label-only training and narrowed much of the gap to true explanations.
Conclusion: The benefits of explanation-augmented fine-tuning arise less from semantic meaning than from structural properties - the extra token budget encourages richer intermediate computation and acts as a regularizer that reduces over-confident shortcuts, leading to improved accuracy and reliability for LLM classification.
Abstract: Fine-tuning LLMs for classification typically maps inputs directly to labels. We ask whether attaching brief explanations to each label during fine-tuning yields better models. We evaluate conversational response quality along three axes: naturalness, comprehensiveness, and on-topic adherence, each rated on 5-point scales. Using ensemble-generated data from multiple LLMs, we fine-tune a 7B-parameter model and test across six diverse conversational datasets. Across 18 dataset, task settings, label-plus-explanation training outperforms label-only baselines. A central and unexpected result concerns random tokens. We replace human-written explanations with text that is syntactically incoherent yet vocabulary-aligned with the originals (e.g., shuffled or bag-of-words variants). Despite lacking semantics, these pseudo-explanations still improve accuracy over label-only training and often narrow much of the gap to true explanations. The effect persists across datasets and training seeds, indicating that gains arise less from meaning than from structure: the extra token budget encourages richer intermediate computation and acts as a regularizer that reduces over-confident shortcuts. Internal analyses support this view: explanation-augmented models exhibit higher activation entropy in intermediate layers alongside sharper predictive mass at the output layer, consistent with increased deliberation before decision. Overall, explanation-augmented fine-tuning, whether with genuine rationales or carefully constructed random token sequences, improves accuracy and reliability for LLM classification while clarifying how token-level scaffolding shapes computation during inference.
[275] Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
Bozhi You, Irene Wang, Zelal Su Mustafaoglu, Abhinav Jangda, Angélica Moreira, Roshan Dathathri, Divya Mahajan, Keshav Pingali
Main category: cs.LG
TL;DR: Flashlight is a compiler-native framework that automatically generates efficient FlashAttention-style kernels for arbitrary attention-based programs in PyTorch, supporting more general attention patterns than previous approaches like FlexAttention.
Details
Motivation: Existing attention optimization approaches like FlashAttention require specialized kernels or hand-tuned implementations, and FlexAttention only supports a limited subset of attention variants using static templates. There's a need for a more flexible solution that can handle diverse attention patterns automatically.Method: Flashlight leverages PyTorch’s compilation workflow to transparently fuse and tile attention computations, automatically generating optimized kernels for arbitrary attention-based programs without relying on static templates or predefined specializations.
Result: Flashlight produces kernels with competitive or superior performance to FlexAttention while supporting all variants expressible in FlexAttention plus more general, data-dependent attention formulations that FlexAttention cannot handle.
Conclusion: Flashlight enables developers to rapidly explore new attention models with native PyTorch code flexibility while maintaining high performance, bridging the gap between flexibility and efficiency in attention optimization.
Abstract: Bad charactors when submitting to arXiv: Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch’s compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.
[276] A Dual-Use Framework for Clinical Gait Analysis: Attention-Based Sensor Optimization and Automated Dataset Auditing
Hamidreza Sadeghsalehi
Main category: cs.LG
TL;DR: A multi-stream attention-based deep learning framework that serves as both sensor optimizer and automated data auditor, revealing severe dataset biases in gait analysis while proposing novel sensor synergies.
Details
Motivation: Objective gait analysis using wearable sensors and AI is crucial for managing neurological and orthopedic conditions, but models are vulnerable to hidden dataset biases and task-specific sensor optimization remains challenging.Method: Multi-stream attention-based deep learning framework applied to the Voisard et al. (2025) multi-cohort gait dataset on four clinical tasks (PD, OA, CVA screening; PD vs CVA differential).
Result: The model’s attention mechanism quantitatively discovered a severe dataset confound - for OA and CVA screening, it assigned >70% attention to Right Foot while ignoring Left Foot (<0.1%), revealing severe laterality bias in the public dataset (e.g., 15 of 15 right-sided OA).
Conclusion: The primary contribution is methodological, demonstrating that interpretable frameworks can automatically audit dataset integrity. As secondary finding, the model proposes novel data-driven sensor synergies (e.g., Head plus Foot for PD screening) as hypotheses for future optimized protocols.
Abstract: Objective gait analysis using wearable sensors and AI is critical for managing neurological and orthopedic conditions. However, models are vulnerable to hidden dataset biases, and task-specific sensor optimization remains a challenge. We propose a multi-stream attention-based deep learning framework that functions as both a sensor optimizer and an automated data auditor. Applied to the Voisard et al. (2025) multi-cohort gait dataset on four clinical tasks (PD, OA, CVA screening; PD vs CVA differential), the model’s attention mechanism quantitatively discovered a severe dataset confound. For OA and CVA screening, tasks where bilateral assessment is clinically essential, the model assigned more than 70 percent attention to the Right Foot while statistically ignoring the Left Foot (less than 0.1 percent attention, 95 percent CI [0.0-0.1]). This was not a clinical finding but a direct reflection of a severe laterality bias (for example, 15 of 15 right-sided OA) in the public dataset. The primary contribution of this work is methodological, demonstrating that an interpretable framework can automatically audit dataset integrity. As a secondary finding, the model proposes novel, data-driven sensor synergies (for example, Head plus Foot for PD screening) as hypotheses for future optimized protocols.
[277] LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS
Stefan F. Schouten, Peter Bloem
Main category: cs.LG
TL;DR: Reformulating Contrast-Consistent Search (CCS) as an eigenproblem to improve understanding and avoid initialization sensitivity while maintaining performance.
Details
Motivation: To clarify the mechanisms of CCS and extend its applicability by addressing its partially understood two-term objective and sensitivity to random initialization.Method: Reformulate CCS as an eigenproblem based on relative contrast consistency, yielding closed-form solutions with interpretable eigenvalues and extensions to multiple variables.
Result: The eigenproblem approach recovers similar performance to CCS across various datasets while avoiding sensitivity to random initialization issues.
Conclusion: Relativizing contrast consistency improves understanding of CCS and opens pathways for broader probing and mechanistic interpretability methods.
Abstract: Contrast-Consistent Search (CCS) is an unsupervised probing method able to test whether large language models represent binary features, such as sentence truth, in their internal activations. While CCS has shown promise, its two-term objective has been only partially understood. In this work, we revisit CCS with the aim of clarifying its mechanisms and extending its applicability. We argue that what should be optimized for, is relative contrast consistency. Building on this insight, we reformulate CCS as an eigenproblem, yielding closed-form solutions with interpretable eigenvalues and natural extensions to multiple variables. We evaluate these approaches across a range of datasets, finding that they recover similar performance to CCS, while avoiding problems around sensitivity to random initialization. Our results suggest that relativizing contrast consistency not only improves our understanding of CCS but also opens pathways for broader probing and mechanistic interpretability methods.
[278] Finding Probably Approximate Optimal Solutions by Training to Estimate the Optimal Values of Subproblems
Nimrod Megiddo, Segev Wasserkrug, Orit Davidovich, Shimrit Shtern
Main category: cs.LG
TL;DR: A solver for maximizing real-valued functions of binary variables using an estimator trained on expected total deviation from optimality rather than objective function values.
Details
Motivation: To develop an efficient solver for binary optimization problems that doesn't require calculating policy values or relying on solved instances.Method: Uses an algorithm that estimates optimal objective values from the distribution of objectives and sub-instances, trained with a loss function based on expected total deviation from optimality conditions.
Result: The approach enables training without needing to compute actual objective function values or use pre-solved instances.
Conclusion: The proposed method provides an alternative training approach for binary optimization solvers that avoids direct computation of objective values and reliance on solved instances.
Abstract: The paper is about developing a solver for maximizing a real-valued function of binary variables. The solver relies on an algorithm that estimates the optimal objective-function value of instances from the underlying distribution of objectives and their respective sub-instances. The training of the estimator is based on an inequality that facilitates the use of the expected total deviation from optimality conditions as a loss function rather than the objective-function itself. Thus, it does not calculate values of policies, nor does it rely on solved instances.
[279] Beyond Static Cutoffs: One-Shot Dynamic Thresholding for Diffusion Language Models
Jucheng Shen, Yeonju Ro
Main category: cs.LG
TL;DR: One-Shot Dynamic Thresholding (OSDT) accelerates masked diffusion language model decoding by calibrating confidence thresholds on one sequence and reusing them for subsequent inputs, achieving significant speedups with minimal accuracy loss.
Details
Motivation: Current masked diffusion models use fixed steps and sequential unmasking, while recent parallel decoding methods suffer from confidence fluctuations and similar confidence patterns across inputs, suggesting reusable threshold calibration is possible.Method: OSDT calibrates dynamic confidence thresholds on a single input sequence and applies these pre-calibrated thresholds to subsequent inputs with negligible overhead, leveraging observed similarity in confidence trajectories.
Result: On GPQA, GSM8K, and HumanEval benchmarks, OSDT achieves superior accuracy-throughput trade-offs: +24% tokens/s on GSM8K at best accuracy, +45% on GPQA with comparable accuracy, and +50% on HumanEval with modest accuracy gap.
Conclusion: The findings demonstrate that reusable task-level confidence signatures can enable more efficient diffusion decoding, suggesting broader opportunities for algorithmic and systems innovations in this domain.
Abstract: Masked diffusion language models (MDLMs) are becoming competitive with their autoregressive counterparts but typically decode with fixed steps and sequential unmasking. To accelerate decoding, recent work such as Fast-dLLM enables parallel decoding via a static global confidence threshold, yet we observe strong block- and step-wise confidence fluctuations and, within a dataset, near-identical confidence trajectories across inputs as measured by cosine similarity. Motivated by these observations, we introduce One-Shot Dynamic Thresholding (OSDT), which calibrates thresholds on a single sequence and applies them to subsequent inputs with negligible overhead. On GPQA, GSM8K, and HumanEval, OSDT attains superior accuracy-throughput trade-offs (+24% tokens/s on GSM8K at the best accuracy, +45% on GPQA with comparable accuracy, and +50% on HumanEval with a modest accuracy gap). Beyond these results, our findings suggest broader opportunities to leverage reusable task-level confidence signatures for more general-purpose algorithmic and systems innovations in diffusion decoding.
[280] Energy Loss Functions for Physical Systems
Sékou-Oumar Kaba, Kusha Sareen, Daniel Levy, Siamak Ravanbakhsh
Main category: cs.LG
TL;DR: A framework that incorporates physical knowledge directly into loss functions for ML tasks on scientific systems, using energy-based formulations derived from thermal equilibrium assumptions.
Details
Motivation: Previous methods focused on architectural changes to incorporate physics, but this approach directly embeds physical insights into loss functions for better alignment with system behavior.Method: Derives energy loss functions assuming data samples are in thermal equilibrium, using reverse KL divergence with Boltzmann distributions to obtain energy differences between data and predictions.
Result: Significant improvements over baselines in molecular generation and spin ground-state prediction tasks, with better gradient alignment and physical symmetry preservation.
Conclusion: The proposed energy-based loss functions provide physically grounded, architecture-agnostic, and computationally efficient alternatives to traditional objectives like MSE, enabling better ML performance on scientific systems.
Abstract: Effectively leveraging prior knowledge of a system’s physics is crucial for applications of machine learning to scientific domains. Previous approaches mostly focused on incorporating physical insights at the architectural level. In this paper, we propose a framework to leverage physical information directly into the loss function for prediction and generative modeling tasks on systems like molecules and spins. We derive energy loss functions assuming that each data sample is in thermal equilibrium with respect to an approximate energy landscape. By using the reverse KL divergence with a Boltzmann distribution around the data, we obtain the loss as an energy difference between the data and the model predictions. This perspective also recasts traditional objectives like MSE as energy-based, but with a physically meaningless energy. In contrast, our formulation yields physically grounded loss functions with gradients that better align with valid configurations, while being architecture-agnostic and computationally efficient. The energy loss functions also inherently respect physical symmetries. We demonstrate our approach on molecular generation and spin ground-state prediction and report significant improvements over baselines.
[281] Natural Building Blocks for Structured World Models: Theory, Evidence, and Scaling
Lancelot Da Costa, Sanjeev Namjoshi, Mohammed Abbas Ansari, Bernhard Schölkopf
Main category: cs.LG
TL;DR: Proposes a modular framework for structured world models using discrete (HMMs) and continuous (sLDS) building blocks, supporting both passive modeling and active control while maintaining interpretability.
Details
Motivation: Address fragmentation in world modeling field where researchers develop bespoke architectures that rarely build upon each other, aiming to create standardized building blocks similar to how layers enabled progress in deep learning.Method: Uses Hidden Markov Models (HMMs) for discrete processes and switching linear dynamical systems (sLDS) for continuous processes as natural building blocks, with hierarchical composition. Avoids combinatorial explosion by fixing causal architecture and searching over only four depth parameters.
Result: Achieves competitive performance to neural approaches in multimodal generative modeling (passive) and planning from pixels (active) while maintaining interpretability. Practical expressiveness demonstrated through real-world applications.
Conclusion: The modular approach provides foundational infrastructure for world modeling, but scalable joint structure-parameter learning remains the core outstanding challenge. If solved, these building blocks could enable standardized progress similar to deep learning layers.
Abstract: The field of world modeling is fragmented, with researchers developing bespoke architectures that rarely build upon each other. We propose a framework that specifies the natural building blocks for structured world models based on the fundamental stochastic processes that any world model must capture: discrete processes (logic, symbols) and continuous processes (physics, dynamics); the world model is then defined by the hierarchical composition of these building blocks. We examine Hidden Markov Models (HMMs) and switching linear dynamical systems (sLDS) as natural building blocks for discrete and continuous modeling–which become partially-observable Markov decision processes (POMDPs) and controlled sLDS when augmented with actions. This modular approach supports both passive modeling (generation, forecasting) and active control (planning, decision-making) within the same architecture. We avoid the combinatorial explosion of traditional structure learning by largely fixing the causal architecture and searching over only four depth parameters. We review practical expressiveness through multimodal generative modeling (passive) and planning from pixels (active), with performance competitive to neural approaches while maintaining interpretability. The core outstanding challenge is scalable joint structure-parameter learning; current methods finesse this by cleverly growing structure and parameters incrementally, but are limited in their scalability. If solved, these natural building blocks could provide foundational infrastructure for world modeling, analogous to how standardized layers enabled progress in deep learning.
[282] Uncertainty Guided Online Ensemble for Non-stationary Data Streams in Fusion Science
Kishansingh Rajput, Malachi Schram, Brian Sammuli, Sen Lin
Main category: cs.LG
TL;DR: Online learning with uncertainty-guided ensemble methods improves ML model performance for non-stationary fusion data, reducing prediction errors by 80% compared to static models.
Details
Motivation: Fusion data exhibits non-stationary behavior due to experimental evolution and machine wear-and-tear, causing ML models to fail when assuming stationary distributions. Online learning techniques are needed but largely unexplored for fusion applications.Method: Applied online learning to adapt to drifting data streams for TF coil deflection prediction. Proposed uncertainty-guided online ensemble using Deep Gaussian Process Approximation (DGPA) for calibrated uncertainty estimation, guiding a meta-algorithm with ensemble learners trained on different historical data horizons.
Result: Online learning reduced error by 80% compared to static models. The uncertainty-guided online ensemble further improved performance, reducing prediction errors by about 10% over standard single-model online learning, while also providing uncertainty estimates for decision makers.
Conclusion: Online learning is critical for maintaining ML model performance in fusion applications with non-stationary data. The proposed uncertainty-guided ensemble method provides additional performance improvements and uncertainty quantification for better decision-making.
Abstract: Machine Learning (ML) is poised to play a pivotal role in the development and operation of next-generation fusion devices. Fusion data shows non-stationary behavior with distribution drifts, resulted by both experimental evolution and machine wear-and-tear. ML models assume stationary distribution and fail to maintain performance when encountered with such non-stationary data streams. Online learning techniques have been leveraged in other domains, however it has been largely unexplored for fusion applications. In this paper, we present an application of online learning to continuously adapt to drifting data stream for prediction of Toroidal Field (TF) coils deflection at the DIII-D fusion facility. The results demonstrate that online learning is critical to maintain ML model performance and reduces error by 80% compared to a static model. Moreover, traditional online learning can suffer from short-term performance degradation as ground truth is not available before making the predictions. As such, we propose an uncertainty guided online ensemble method to further improve the performance. The Deep Gaussian Process Approximation (DGPA) technique is leveraged for calibrated uncertainty estimation and the uncertainty values are then used to guide a meta-algorithm that produces predictions based on an ensemble of learners trained on different horizon of historical data. The DGPA also provides uncertainty estimation along with the predictions for decision makers. The online ensemble and the proposed uncertainty guided online ensemble reduces predictions error by about 6%, and 10% respectively over standard single model based online learning.
[283] Can LLMs subtract numbers?
Mayank Jobanputra, Nils Philipp Walter, Maitrey Mehta, Blerta Veseli, Evan Parker Kelly Chapple, Yifan Wang, Sneha Chetani, Ellie Pavlick, Antonio Vergari, Vera Demberg
Main category: cs.LG
TL;DR: LLMs perform poorly on subtraction compared to addition, especially when subtracting larger numbers from smaller ones, often omitting the negative sign despite internally knowing the result should be negative.
Details
Motivation: Subtraction has received little attention in LLM benchmarks despite being structurally distinct as a non-commutative operation, and its performance gaps compared to addition need systematic investigation.Method: Evaluated eight pretrained LLMs from four families on addition and subtraction problems, conducted probing analyses, and tested few-shot learning and instruction-tuning techniques.
Result: Subtraction accuracy lags significantly behind addition; errors concentrate in cases where a<b, with LLMs producing correct magnitude but omitting negative sign; instruction-tuned models achieve near-perfect accuracy in generating negative signs.
Conclusion: LLMs have limitations in subtraction arithmetic but these can be recovered through techniques like instruction-tuning, providing clearer characterization of their arithmetic capabilities.
Abstract: We present a systematic study of subtraction in large language models (LLMs). While prior benchmarks emphasize addition and multiplication, subtraction has received comparatively little attention despite being structurally distinct as a non-commutative operation. We evaluate eight pretrained LLMs spanning four families on addition and subtraction problems. Our experiments reveal that subtraction accuracy lags behind addition by a wide margin. We find that the errors for ($a-b$) are concentrated in cases where ($a<b$). In such cases, LLMs frequently produce the correct magnitude but omit the negative sign. Probing analyses show that LLMs internally encode whether results should be negative, yet this information is often not reflected in generated outputs. We further test well-known techniques such as few-shot learning and instruction-tuning to see if they can improve the LLMs’ performance. Our results suggest that while few-shot prompting yields modest gains, the instruction-tuned models achieve near-perfect accuracies in generating the negative sign. Together, these findings provide a clearer characterization of the limitations and recoverability of LLMs’ arithmetic capabilities in subtraction.
[284] Geometric Data Valuation via Leverage Scores
Rodrigo Mendoza-Smith
Main category: cs.LG
TL;DR: The paper proposes geometric leverage scores as a computationally efficient alternative to Shapley data valuation, showing they satisfy key axioms and provide theoretical guarantees for model quality.
Details
Motivation: Shapley data valuation is computationally infeasible at scale due to its combinatorial nature, requiring evaluation of all data subsets.Method: Proposes geometric leverage scores that measure structural influence in representation space, and extends to ridge leverage scores for positive marginal gains.
Result: Leverage scores satisfy Shapley axioms (dummy, efficiency, symmetry), and training on leverage-sampled subsets produces models within O(ε) of full-data optimum.
Conclusion: Ridge-leverage sampling provides efficient data valuation with theoretical guarantees and outperforms baselines in active learning without requiring gradients.
Abstract: Shapley data valuation provides a principled, axiomatic framework for assigning importance to individual datapoints, and has gained traction in dataset curation, pruning, and pricing. However, it is a combinatorial measure that requires evaluating marginal utility across all subsets of the data, making it computationally infeasible at scale. We propose a geometric alternative based on statistical leverage scores, which quantify each datapoint’s structural influence in the representation space by measuring how much it extends the span of the dataset and contributes to the effective dimensionality of the training problem. We show that our scores satisfy the dummy, efficiency, and symmetry axioms of Shapley valuation and that extending them to \emph{ridge leverage scores} yields strictly positive marginal gains that connect naturally to classical A- and D-optimal design criteria. We further show that training on a leverage-sampled subset produces a model whose parameters and predictive risk are within $O(\varepsilon)$ of the full-data optimum, thereby providing a rigorous link between data valuation and downstream decision quality. Finally, we conduct an active learning experiment in which we empirically demonstrate that ridge-leverage sampling outperforms standard baselines without requiring access gradients or backward passes.
[285] In Good GRACEs: Principled Teacher Selection for Knowledge Distillation
Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Sham Kakade, Surbhi Goel
Main category: cs.LG
TL;DR: GRACE is a lightweight score that predicts teacher effectiveness for knowledge distillation without needing trial-and-error, achieving up to 86% correlation with student performance.
Details
Motivation: Traditional knowledge distillation requires expensive trial-and-error to find optimal teacher-student combinations, which is inefficient and resource-intensive.Method: GRACE measures distributional properties of student gradients without requiring teacher logits, internals, or test data, connecting to information-theoretic leave-one-out stability.
Result: GRACE achieves up to 86% Spearman correlation with distilled student performance on GSM8K and MATH, improving performance by up to 7.4% over naive teacher selection.
Conclusion: GRACE efficiently identifies compatible teachers and provides fine-grained guidance for distillation design choices including temperature settings, size constraints, and model family selection.
Abstract: Knowledge distillation is an efficient strategy to use data generated by large “teacher” language models to train smaller capable “student” models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student’s gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using the GRACE-selected teacher can improve the performance by up to 7.4% over naively using the best-performing teacher. Further, GRACE can provide guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify a strongly compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.
[286] Measuring the Intrinsic Dimension of Earth Representations
Arjun Rao, Marc Rußwurm, Konstantin Klemmer, Esther Rolf
Main category: cs.LG
TL;DR: This paper analyzes the intrinsic dimensionality of geographic Implicit Neural Representations (INRs) for Earth observation, finding they typically have 2-10 dimensions and that this metric correlates with downstream performance and spatial artifacts.
Details
Motivation: Despite geographic INRs being used to distill Earth's data into compact representations, there's limited understanding of how much information they actually contain and where it's concentrated.Method: The study analyzes the intrinsic dimension of geographic INRs using datasets with ambient dimensions between 256-512, examining how spatial resolution and input modalities affect dimensionality.
Result: Geographic INRs have intrinsic dimensions between 2-10, which are sensitive to spatial resolution and input modalities. The intrinsic dimension correlates with downstream task performance and can capture spatial artifacts.
Conclusion: Intrinsic dimension provides an architecture-agnostic, label-free metric for evaluating information content in geographic INRs, enabling unsupervised evaluation, model selection, and pre-training design.
Abstract: Within the context of representation learning for Earth observation, geographic Implicit Neural Representations (INRs) embed low-dimensional location inputs (longitude, latitude) into high-dimensional embeddings, through models trained on geo-referenced satellite, image or text data. Despite the common aim of geographic INRs to distill Earth’s data into compact, learning-friendly representations, we lack an understanding of how much information is contained in these Earth representations, and where that information is concentrated. The intrinsic dimension of a dataset measures the number of degrees of freedom required to capture its local variability, regardless of the ambient high-dimensional space in which it is embedded. This work provides the first study of the intrinsic dimensionality of geographic INRs. Analyzing INRs with ambient dimension between 256 and 512, we find that their intrinsic dimensions fall roughly between 2 and 10 and are sensitive to changing spatial resolution and input modalities during INR pre-training. Furthermore, we show that the intrinsic dimension of a geographic INR correlates with downstream task performance and can capture spatial artifacts, facilitating model evaluation and diagnostics. More broadly, our work offers an architecture-agnostic, label-free metric of information content that can enable unsupervised evaluation, model selection, and pre-training design across INRs.
[287] Matrix Sensing with Kernel Optimal Loss: Robustness and Optimization Landscape
Xinyuan Song, Jiaye Teng, Ziye Ma
Main category: cs.LG
TL;DR: The paper studies how robust loss functions based on nonparametric regression improve optimization landscape and robustness in noisy matrix sensing compared to MSE loss, especially for non-Gaussian or heavy-tailed noise.
Details
Motivation: MSE loss is unreliable for non-Gaussian or heavy-tailed noise in regression tasks, motivating the need for more robust loss functions that remain stable under general noise settings.Method: Adopt a robust loss based on nonparametric regression using kernel-based residual density estimation and maximizing estimated log-likelihood. Analyze optimization landscape through RIP constants for spurious local minima.
Result: The robust loss excels at handling large noise and remains robust across diverse noise distributions, coinciding with MSE under Gaussian errors but performing better in general settings.
Conclusion: Simply changing the loss function can enhance robustness in machine learning tasks, with the proposed robust loss offering broad applicability across various noise distributions.
Abstract: In this paper we study how the choice of loss functions of non-convex optimization problems affects their robustness and optimization landscape, through the study of noisy matrix sensing. In traditional regression tasks, mean squared error (MSE) loss is a common choice, but it can be unreliable for non-Gaussian or heavy-tailed noise. To address this issue, we adopt a robust loss based on nonparametric regression, which uses a kernel-based estimate of the residual density and maximizes the estimated log-likelihood. This robust formulation coincides with the MSE loss under Gaussian errors but remains stable under more general settings. We further examine how this robust loss reshapes the optimization landscape by analyzing the upper-bound of restricted isometry property (RIP) constants for spurious local minima to disappear. Through theoretical and empirical analysis, we show that this new loss excels at handling large noise and remains robust across diverse noise distributions. This work offers initial insights into enhancing the robustness of machine learning tasks through simply changing the loss, guided by an intuitive and broadly applicable analytical framework.
[288] Variance-Aware Feel-Good Thompson Sampling for Contextual Bandits
Xuheng Li, Quanquan Gu
Main category: cs.LG
TL;DR: FGTSVA is a variance-aware Thompson Sampling algorithm for contextual bandits that achieves optimal regret bounds for general reward functions, addressing limitations of prior variance-dependent regret studies focused mainly on UCB methods.
Details
Motivation: Most variance-dependent regret studies focus on UCB-based algorithms, while Thompson sampling methods are understudied. The only existing variance-aware Thompson sampling (LinVDTS) is limited to linear rewards and has suboptimal dimension-dependent regret bounds.Method: FGTSVA extends the decoupling coefficient technique from Feel-good Thompson sampling (FGTS) to create a variance-aware algorithm. It uses a new decoupling coefficient (dc) that reflects model space complexity.
Result: FGTSVA achieves regret of Õ(√(dc·log|F|∑σ_t²) + dc), where |F| is model space size, T is rounds, and σ_t² is noise variance. In linear contextual bandits, it matches UCB-based algorithms using weighted linear regression.
Conclusion: FGTSVA provides the first variance-aware Thompson sampling algorithm with optimal regret bounds for general reward functions, bridging the gap between UCB and Thompson sampling approaches in variance-dependent regret analysis.
Abstract: Variance-dependent regret bounds have received increasing attention in recent studies on contextual bandits. However, most of these studies are focused on upper confidence bound (UCB)-based bandit algorithms, while sampling based bandit algorithms such as Thompson sampling are still understudied. The only exception is the LinVDTS algorithm (Xu et al., 2023), which is limited to linear reward function and its regret bound is not optimal with respect to the model dimension. In this paper, we present FGTSVA, a variance-aware Thompson Sampling algorithm for contextual bandits with general reward function with optimal regret bound. At the core of our analysis is an extension of the decoupling coefficient, a technique commonly used in the analysis of Feel-good Thompson sampling (FGTS) that reflects the complexity of the model space. With the new decoupling coefficient denoted by $\mathrm{dc}$, FGTS-VA achieves the regret of $\tilde{O}(\sqrt{\mathrm{dc}\cdot\log|\mathcal{F}|\sum_{t=1}^T\sigma_t^2}+\mathrm{dc})$, where $|\mathcal{F}|$ is the size of the model space, $T$ is the total number of rounds, and $\sigma_t^2$ is the subgaussian norm of the noise (e.g., variance when the noise is Gaussian) at round $t$. In the setting of contextual linear bandits, the regret bound of FGTSVA matches that of UCB-based algorithms using weighted linear regression (Zhou and Gu, 2022).
[289] QuPCG: Quantum Convolutional Neural Network for Detecting Abnormal Patterns in PCG Signals
Yasaman Torabi, Shahram Shirani, James P. Reilly
Main category: cs.LG
TL;DR: Hybrid quantum-classical CNN achieves 93.33% accuracy in classifying S3 and murmur abnormalities from heart sounds using compressed 8-pixel images requiring only 8 qubits.
Details
Motivation: Early identification of abnormal physiological patterns is essential for timely detection of cardiac disease, particularly in resource-constrained healthcare environments.Method: Transform 1D phonocardiogram signals into compact 2D images using wavelet feature extraction and adaptive threshold compression, then use hybrid quantum-classical CNN with only 8 qubits for quantum stage.
Result: 93.33% classification accuracy on test set and 97.14% on train set using HLS-CMDS dataset, demonstrating quantum models can efficiently capture temporal-spectral correlations.
Conclusion: First application of QCNN for bioacoustic signal processing, representing early step toward quantum-enhanced diagnostic systems for healthcare.
Abstract: Early identification of abnormal physiological patterns is essential for the timely detection of cardiac disease. This work introduces a hybrid quantum-classical convolutional neural network (QCNN) designed to classify S3 and murmur abnormalities in heart sound signals. The approach transforms one-dimensional phonocardiogram (PCG) signals into compact two-dimensional images through a combination of wavelet feature extraction and adaptive threshold compression methods. We compress the cardiac-sound patterns into an 8-pixel image so that only 8 qubits are needed for the quantum stage. Preliminary results on the HLS-CMDS dataset demonstrate 93.33% classification accuracy on the test set and 97.14% on the train set, suggesting that quantum models can efficiently capture temporal-spectral correlations in biomedical signals. To our knowledge, this is the first application of a QCNN algorithm for bioacoustic signal processing. The proposed method represents an early step toward quantum-enhanced diagnostic systems for resource-constrained healthcare environments.
[290] Disentangling Causal Substructures for Interpretable and Generalizable Drug Synergy Prediction
Yi Luo, Haochen Zhao, Xiao Liang, Yiwei Liu, Yuye Zhang, Xinyu Li, Jianxin Wang
Main category: cs.LG
TL;DR: CausalDDS is a novel framework that disentangles drug molecules into causal and spurious substructures to predict drug synergy, improving accuracy and interpretability through causal inference.
Details
Motivation: Existing drug synergy prediction methods operate as black-box predictors relying on statistical correlations, lacking interpretability and robustness to spurious features.Method: Disentangles drug molecules into causal and spurious substructures, uses causal representations for prediction, employs conditional intervention mechanism conditioned on paired molecular structures, and introduces optimization objective based on sufficiency and independence principles.
Result: Outperforms baseline models, especially in cold start and out-of-distribution settings, and effectively identifies key substructures underlying drug synergy.
Conclusion: CausalDDS shows strong potential as a practical tool for predicting drug synergy and facilitating drug discovery by providing molecular-level insights into how drug combinations work.
Abstract: Drug synergy prediction is a critical task in the development of effective combination therapies for complex diseases, including cancer. Although existing methods have shown promising results, they often operate as black-box predictors that rely predominantly on statistical correlations between drug characteristics and results. To address this limitation, we propose CausalDDS, a novel framework that disentangles drug molecules into causal and spurious substructures, utilizing the causal substructure representations for predicting drug synergy. By focusing on causal sub-structures, CausalDDS effectively mitigates the impact of redundant features introduced by spurious substructures, enhancing the accuracy and interpretability of the model. In addition, CausalDDS employs a conditional intervention mechanism, where interventions are conditioned on paired molecular structures, and introduces a novel optimization objective guided by the principles of sufficiency and independence. Extensive experiments demonstrate that our method outperforms baseline models, particularly in cold start and out-of-distribution settings. Besides, CausalDDS effectively identifies key substructures underlying drug synergy, providing clear insights into how drug combinations work at the molecular level. These results underscore the potential of CausalDDS as a practical tool for predicting drug synergy and facilitating drug discovery.
[291] CFL: On the Use of Characteristic Function Loss for Domain Alignment in Machine Learning
Abdullah Almansour, Ozan Tonguz
Main category: cs.LG
TL;DR: This paper proposes using Characteristic Function as a frequency domain approach to measure distribution shift in high-dimensional spaces for domain adaptation, addressing ML model underperformance due to distribution shifts in real-world deployment.
Details
Motivation: ML models often underperform when deployed in real-world due to distribution shift problems, which can lead to catastrophic outcomes in high-risk applications. Current statistical techniques have limitations in high-dimensional spaces.Method: The authors use Characteristic Function (CF) as a frequency domain approach to quantify distribution shift, providing an alternative to traditional statistical techniques like Kullback-Leibler, Kolmogorov-Smirnov Test, and Wasserstein distance.
Result: The paper demonstrates that Characteristic Function is a powerful alternative for measuring distribution shift in high-dimensional space and for domain adaptation.
Conclusion: Characteristic Function in the frequency domain offers an effective approach for addressing distribution shift problems in machine learning, particularly in high-dimensional scenarios where traditional statistical methods may be insufficient.
Abstract: Machine Learning (ML) models are extensively used in various applications due to their significant advantages over traditional learning methods. However, the developed ML models often underperform when deployed in the real world due to the well-known distribution shift problem. This problem can lead to a catastrophic outcomes when these decision-making systems have to operate in high-risk applications. Many researchers have previously studied this problem in ML, known as distribution shift problem, using statistical techniques (such as Kullback-Leibler, Kolmogorov-Smirnov Test, Wasserstein distance, etc.) to quantify the distribution shift. In this letter, we show that using Characteristic Function (CF) as a frequency domain approach is a powerful alternative for measuring the distribution shift in high-dimensional space and for domain adaptation.
[292] ProtoTSNet: Interpretable Multivariate Time Series Classification With Prototypical Parts
Bartłomiej Małkus, Szymon Bobek, Grzegorz J. Nalepa
Main category: cs.LG
TL;DR: ProtoTSNet is an interpretable classification method for multivariate time series that enhances ProtoPNet with group convolutions to capture dynamic patterns and feature importance, achieving competitive performance with explainable and non-explainable baselines.
Details
Motivation: Time series data in critical domains like industry and medicine requires both high accuracy and interpretability due to significant decision consequences, but existing methods struggle with time series-specific challenges like dynamic patterns and varying feature significance.Method: Enhanced ProtoPNet architecture with modified convolutional encoder using group convolutions, pre-trainable as part of an autoencoder to preserve and quantify feature importance, specifically designed for time series analysis.
Result: Evaluation on 30 multivariate time series datasets from UEA archive shows best performance among ante-hoc explainable methods and competitive performance with non-explainable and post-hoc explainable approaches.
Conclusion: ProtoTSNet provides interpretable results accessible to domain experts while maintaining competitive classification performance, addressing the unique challenges of time series analysis.
Abstract: Time series data is one of the most popular data modalities in critical domains such as industry and medicine. The demand for algorithms that not only exhibit high accuracy but also offer interpretability is crucial in such fields, as decisions made there bear significant consequences. In this paper, we present ProtoTSNet, a novel approach to interpretable classification of multivariate time series data, through substantial enhancements to the ProtoPNet architecture. Our method is tailored to overcome the unique challenges of time series analysis, including capturing dynamic patterns and handling varying feature significance. Central to our innovation is a modified convolutional encoder utilizing group convolutions, pre-trainable as part of an autoencoder and designed to preserve and quantify feature importance. We evaluated our model on 30 multivariate time series datasets from the UEA archive, comparing our approach with existing explainable methods as well as non-explainable baselines. Through comprehensive evaluation and ablation studies, we demonstrate that our approach achieves the best performance among ante-hoc explainable methods while maintaining competitive performance with non-explainable and post-hoc explainable approaches, providing interpretable results accessible to domain experts.
[293] Tackling Incomplete Data in Air Quality Prediction: A Bayesian Deep Learning Framework for Uncertainty Quantification
Yuzhuang Pian, Taiyu Wang, Shiqi Zhang, Rui Xu, Yonghong Liu
Main category: cs.LG
TL;DR: CGLUBNF is an end-to-end framework for accurate air quality forecasting that handles incomplete spatiotemporal data using Bayesian neural fields with graph attention and channel gating.
Details
Motivation: Observational data often has missing values due to collection/transmission issues, which impedes reliable air quality inference and can lead to overconfident extrapolation.Method: Uses Fourier features with graph attention encoder for spatial dependencies, channel gated learning unit with learnable activations for feature filtering, and Bayesian inference for uncertainty quantification.
Result: Superior prediction accuracy and sharper confidence intervals compared to 5 state-of-the-art baselines across 4 missing data patterns on two real-world datasets.
Conclusion: Provides foundation for reliable deep learning-based spatiotemporal forecasting with incomplete observations, applicable to emerging sensing paradigms like vehicle-borne mobile monitoring.
Abstract: Accurate air quality forecasts are vital for public health alerts, exposure assessment, and emissions control. In practice, observational data are often missing in varying proportions and patterns due to collection and transmission issues. These incomplete spatiotemporal records impede reliable inference and risk assessment and can lead to overconfident extrapolation. To address these challenges, we propose an end to end framework, the channel gated learning unit based spatiotemporal bayesian neural field (CGLUBNF). It uses Fourier features with a graph attention encoder to capture multiscale spatial dependencies and seasonal temporal dynamics. A channel gated learning unit, equipped with learnable activations and gated residual connections, adaptively filters and amplifies informative features. Bayesian inference jointly optimizes predictive distributions and parameter uncertainty, producing point estimates and calibrated prediction intervals. We conduct a systematic evaluation on two real world datasets, covering four typical missing data patterns and comparing against five state of the art baselines. CGLUBNF achieves superior prediction accuracy and sharper confidence intervals. In addition, we further validate robustness across multiple prediction horizons and analysis the contribution of extraneous variables. This research lays a foundation for reliable deep learning based spatio-temporal forecasting with incomplete observations in emerging sensing paradigms, such as real world vehicle borne mobile monitoring.
[294] OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning
Kevin Valencia, Thilina Balasooriya, Xihaier Luo, Shinjae Yoo, David Keetae Park
Main category: cs.LG
TL;DR: OmniField is a multimodal spatiotemporal learning framework that handles sparse, irregular, noisy data with varying modality availability through continuous neural fields and cross-modal fusion.
Details
Motivation: Real-world experimental data faces challenges with sparse, irregular measurements and varying modality availability across space and time, limiting usable records unless models can adapt to arbitrary modality subsets.Method: Proposes OmniField with multimodal crosstalk block architecture and iterative cross-modal refinement that aligns signals before decoding, enabling unified reconstruction, interpolation, forecasting, and cross-modal prediction without gridding.
Result: OmniField consistently outperforms eight strong multimodal spatiotemporal baselines and maintains performance close to clean-input levels under heavy simulated sensor noise.
Conclusion: The framework demonstrates robust handling of corrupted measurements and effective cross-modal learning for spatiotemporal data with varying modality availability.
Abstract: Multimodal spatiotemporal learning on real-world experimental data is constrained by two challenges: within-modality measurements are sparse, irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of available modalities varies across space and time, shrinking the usable record unless models can adapt to arbitrary subsets at train and test time. We propose OmniField, a continuity-aware framework that learns a continuous neural field conditioned on available modalities and iteratively fuses cross-modal context. A multimodal crosstalk block architecture paired with iterative cross-modal refinement aligns signals prior to the decoder, enabling unified reconstruction, interpolation, forecasting, and cross-modal prediction without gridding or surrogate preprocessing. Extensive evaluations show that OmniField consistently outperforms eight strong multimodal spatiotemporal baselines. Under heavy simulated sensor noise, performance remains close to clean-input levels, highlighting robustness to corrupted measurements.
[295] Learning Interactive World Model for Object-Centric Reinforcement Learning
Fan Feng, Phillip Lippe, Sara Magliacane
Main category: cs.LG
TL;DR: FIOC-WM is a framework that learns structured representations of both objects and their interactions in world models, improving sample efficiency and generalization for policy learning through modular interaction learning.
Details
Motivation: Most object-centric RL methods focus on individual objects while leaving interactions implicit, limiting robustness and transferability of learned policies.Method: Learns object-centric latents and interaction structure from pixels using pre-trained vision encoders, then decomposes tasks into composable interaction primitives with a hierarchical policy (high-level selects interaction types/order, low-level executes them).
Result: Improves policy-learning sample efficiency and generalization over world-model baselines on simulated robotic and embodied-AI benchmarks.
Conclusion: Explicit, modular interaction learning is crucial for robust control in object-centric reinforcement learning.
Abstract: Agents that understand objects and their interactions can learn policies that are more robust and transferable. However, most object-centric RL methods factor state by individual objects while leaving interactions implicit. We introduce the Factored Interactive Object-Centric World Model (FIOC-WM), a unified framework that learns structured representations of both objects and their interactions within a world model. FIOC-WM captures environment dynamics with disentangled and modular representations of object interactions, improving sample efficiency and generalization for policy learning. Concretely, FIOC-WM first learns object-centric latents and an interaction structure directly from pixels, leveraging pre-trained vision encoders. The learned world model then decomposes tasks into composable interaction primitives, and a hierarchical policy is trained on top: a high level selects the type and order of interactions, while a low level executes them. On simulated robotic and embodied-AI benchmarks, FIOC-WM improves policy-learning sample efficiency and generalization over world-model baselines, indicating that explicit, modular interaction learning is crucial for robust control.
[296] Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining
Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun
Main category: cs.LG
TL;DR: A framework for dynamically re-routing token-to-expert mapping in Mixture-of-Experts LLMs to reduce decode latency while maintaining quality, achieving 39% and 15% latency reductions on Qwen3 models.
Details
Motivation: MoE models become memory-bound during autoregressive generation because expert load grows slowly, making latency governed by the number of activated experts rather than computation.Method: Batch-aware routing that allows tokens to piggyback experts already loaded in memory from other tokens in the same batch, dynamically re-routing token-to-expert mapping.
Result: 39% latency reduction on Qwen3-30B and 15% on Qwen3-235B with batch size 16, without statistically significant accuracy loss.
Conclusion: Dynamic expert routing can significantly reduce MoE decode latency while preserving model quality, making MoE models more efficient during inference.
Abstract: An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of $16$. Without any statistically significant loss in accuracy, our approach achieves latency reductions of $39%$ and $15%$ in the MoE layer decode latency, respectively.
[297] Neural network initialization with nonlinear characteristics and information on spectral bias
Hikaru Homma, Jun Ohkubo
Main category: cs.LG
TL;DR: Proposes a spectral bias-aware initialization method that adjusts SWIM algorithm scale factors to capture low-frequency components in early layers and high-frequency components in late layers, improving training performance.
Details
Motivation: Neural network initialization significantly impacts learning performance, and spectral bias (learning coarse information in earlier layers) can be leveraged for better initialization strategies.Method: A framework that modifies the SWIM algorithm by adjusting scale factors to align with spectral bias - early layers focus on low-frequency components, late layers on high-frequency components.
Result: Numerical experiments on 1D regression and MNIST classification tasks show the proposed method outperforms conventional initialization algorithms.
Conclusion: The work demonstrates the importance of spectral properties in neural network learning and provides an effective initialization strategy that enhances training performance.
Abstract: Initialization of neural network parameters, such as weights and biases, has a crucial impact on learning performance; if chosen well, we can even avoid the need for additional training with backpropagation. For example, algorithms based on the ridgelet transform or the SWIM (sampling where it matters) concept have been proposed for initialization. On the other hand, it is well-known that neural networks tend to learn coarse information in the earlier layers. The feature is called spectral bias. In this work, we investigate the effects of utilizing information on the spectral bias in the initialization of neural networks. Hence, we propose a framework that adjusts the scale factors in the SWIM algorithm to capture low-frequency components in the early-stage hidden layers and to represent high-frequency components in the late-stage hidden layers. Numerical experiments on a one-dimensional regression task and the MNIST classification task demonstrate that the proposed method outperforms the conventional initialization algorithms. This work clarifies the importance of intrinsic spectral properties in learning neural networks, and the finding yields an effective parameter initialization strategy that enhances their training performance.
[298] Probabilistic Graph Cuts
Ayoub Ghriss
Main category: cs.LG
TL;DR: A unified probabilistic framework for differentiable graph partitioning that provides tight analytic bounds on expected cuts via integral representations and hypergeometric functions, enabling scalable learning without eigendecompositions.
Details
Motivation: Prior probabilistic relaxations of graph cuts focused only on RatioCut and lacked general guarantees and principled gradients, limiting their applicability to broader clustering objectives like Normalized Cut.Method: Developed a unified probabilistic framework using integral representations and Gauss hypergeometric functions to derive tight analytic upper bounds on expected discrete cuts, with closed-form forward and backward computations.
Result: The framework provides rigorous, numerically stable foundations for scalable differentiable graph partitioning that covers a wide range of clustering and contrastive learning objectives.
Conclusion: This work delivers a comprehensive probabilistic approach to differentiable graph partitioning that overcomes limitations of prior methods and enables end-to-end learning across diverse cut-based objectives.
Abstract: Probabilistic relaxations of graph cuts offer a differentiable alternative to spectral clustering, enabling end-to-end and online learning without eigendecompositions, yet prior work centered on RatioCut and lacked general guarantees and principled gradients. We present a unified probabilistic framework that covers a wide class of cuts, including Normalized Cut. Our framework provides tight analytic upper bounds on expected discrete cuts via integral representations and Gauss hypergeometric functions with closed-form forward and backward. Together, these results deliver a rigorous, numerically stable foundation for scalable, differentiable graph partitioning covering a wide range of clustering and contrastive learning objectives.
[299] Gradient-Variation Online Adaptivity for Accelerated Optimization with Hölder Smoothness
Yuheng Zhao, Yu-Hu Yan, Kfir Yehuda Levy, Peng Zhao
Main category: cs.LG
TL;DR: This paper connects online learning with gradient-variation regret minimization to offline optimization, focusing on H"older smooth functions that include both smooth and non-smooth cases. The authors develop adaptive online algorithms that don’t require prior knowledge of smoothness parameters and achieve optimal guarantees across regimes.
Details
Motivation: To bridge the gap between accelerated offline optimization and gradient-variation online learning, particularly for H"older smooth functions that generalize both smooth and non-smooth cases, enabling adaptive algorithms that work across different smoothness regimes.Method: Design gradient-variation online learning algorithms for (strongly) convex functions that adapt to H"older smoothness without prior knowledge, and combine online adaptivity with a detection-based guess-and-check procedure for offline optimization.
Result: The online algorithms achieve regret that smoothly interpolates between optimal guarantees in smooth and non-smooth regimes. Through online-to-batch conversion, this yields an optimal universal method for stochastic convex optimization under H"older smoothness.
Conclusion: The paper successfully develops universal methods that achieve accelerated convergence in smooth regimes while maintaining near-optimal performance in non-smooth regimes, bridging online learning and offline optimization for H"older smooth functions.
Abstract: Smoothness is known to be crucial for acceleration in offline optimization, and for gradient-variation regret minimization in online learning. Interestingly, these two problems are actually closely connected – accelerated optimization can be understood through the lens of gradient-variation online learning. In this paper, we investigate online learning with H"older smooth functions, a general class encompassing both smooth and non-smooth (Lipschitz) functions, and explore its implications for offline optimization. For (strongly) convex online functions, we design the corresponding gradient-variation online learning algorithm whose regret smoothly interpolates between the optimal guarantees in smooth and non-smooth regimes. Notably, our algorithms do not require prior knowledge of the H"older smoothness parameter, exhibiting strong adaptivity over existing methods. Through online-to-batch conversion, this gradient-variation online adaptivity yields an optimal universal method for stochastic convex optimization under H"older smoothness. However, achieving universality in offline strongly convex optimization is more challenging. We address this by integrating online adaptivity with a detection-based guess-and-check procedure, which, for the first time, yields a universal offline method that achieves accelerated convergence in the smooth regime while maintaining near-optimal convergence in the non-smooth one.
[300] Reinforcement learning based data assimilation for unknown state model
Ziyi Wang, Lijian Jiang
Main category: cs.LG
TL;DR: A novel method that integrates reinforcement learning with ensemble-based Bayesian filtering to learn surrogate state transition models for unknown dynamics directly from noisy observations, without requiring true state trajectories.
Details
Motivation: Data assimilation is challenging when governing equations are unknown, and existing machine learning approaches require noise-free ground-truth state sequences which are often infeasible to obtain in practice.Method: Formulate maximum likelihood estimation of surrogate model parameters as a sequential decision-making problem using Markov decision processes, then use reinforcement learning to find optimal policies for learning transition models offline.
Result: The framework accommodates nonlinear and partially observed measurement models, and numerical examples demonstrate superior accuracy and robustness in high-dimensional settings.
Conclusion: The proposed method enables effective state estimation for unknown dynamics directly from noisy observations, overcoming the limitation of requiring true state trajectories for training.
Abstract: Data assimilation (DA) has increasingly emerged as a critical tool for state estimation across a wide range of applications. It is signiffcantly challenging when the governing equations of the underlying dynamics are unknown. To this end, various machine learning approaches have been employed to construct a surrogate state transition model in a supervised learning framework, which relies on pre-computed training datasets. However, it is often infeasible to obtain noise-free ground-truth state sequences in practice. To address this challenge, we propose a novel method that integrates reinforcement learning with ensemble-based Bayesian ffltering methods, enabling the learning of surrogate state transition model for unknown dynamics directly from noisy observations, without using true state trajectories. Speciffcally, we treat the process for computing maximum likelihood estimation of surrogate model parameters as a sequential decision-making problem, which can be formulated as a discretetime Markov decision process (MDP). Under this formulation, learning the surrogate transition model is equivalent to ffnding an optimal policy of the MDP, which can be effectively addressed using reinforcement learning techniques. Once the model is trained offfine, state estimation can be performed in the online stage using ffltering methods based on the learned dynamics. The proposed framework accommodates a wide range of observation scenarios, including nonlinear and partially observed measurement models. A few numerical examples demonstrate that the proposed method achieves superior accuracy and robustness in high-dimensional settings.
[301] Federated Quantum Kernel Learning for Anomaly Detection in Multivariate IoT Time-Series
Kuan-Cheng Chen, Samuel Yen-Chi Chen, Chen-Yu Liu, Kin K. Leung
Main category: cs.LG
TL;DR: Federated Quantum Kernel Learning (FQKL) framework combines quantum feature maps with federated learning for privacy-preserving anomaly detection in industrial IoT systems, achieving better performance with reduced communication costs.
Details
Motivation: Address challenges in IIoT anomaly detection including privacy concerns, scalability issues, and communication efficiency, while overcoming limitations of classical federated learning in handling non-linear decision boundaries and imbalanced anomaly distributions.Method: Propose FQKL framework where quantum edge nodes locally compute compressed kernel statistics using parameterized quantum circuits, share only summaries with central server, which constructs global Gram matrix and trains decision functions like Fed-QSVM.
Result: Experimental results on synthetic IIoT benchmarks show FQKL achieves superior generalization in capturing complex temporal correlations compared to classical federated baselines, while significantly reducing communication overhead.
Conclusion: The work demonstrates the promise of quantum kernels in federated settings and advances scalable, robust quantum-enhanced intelligence for next-generation IoT infrastructures.
Abstract: The rapid growth of industrial Internet of Things (IIoT) systems has created new challenges for anomaly detection in high-dimensional, multivariate time-series, where privacy, scalability, and communication efficiency are critical. Classical federated learning approaches mitigate privacy concerns by enabling decentralized training, but they often struggle with highly non-linear decision boundaries and imbalanced anomaly distributions. To address this gap, we propose a Federated Quantum Kernel Learning (FQKL) framework that integrates quantum feature maps with federated aggregation to enable distributed, privacy-preserving anomaly detection across heterogeneous IoT networks. In our design, quantum edge nodes locally compute compressed kernel statistics using parameterized quantum circuits and share only these summaries with a central server, which constructs a global Gram matrix and trains a decision function (e.g., Fed-QSVM). Experimental results on synthetic IIoT benchmarks demonstrate that FQKL achieves superior generalization in capturing complex temporal correlations compared to classical federated baselines, while significantly reducing communication overhead. This work highlights the promise of quantum kernels in federated settings, advancing the path toward scalable, robust, and quantum-enhanced intelligence for next-generation IoT infrastructures.
[302] FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
Fengjuan Wang, Zhiyi Su, Xingzhu Hu, Cheng Wang, Mou Sun
Main category: cs.LG
TL;DR: FP8-Flow-MoE enables efficient FP8 training for large Mixture-of-Experts models by eliminating redundant cast operations through quantization-consistent dataflow and fused operators, achieving significant throughput gains and memory savings while maintaining convergence stability.
Details
Motivation: Current FP8 training for large MoE models suffers from frequent quantize-dequantize conversions that erode efficiency benefits, while naive FP8-only approaches cause numerical instability due to double quantization errors across different tensor dimensions.Method: Proposes FP8-Flow-MoE with quantization-consistent FP8-centric dataflow, scaling-aware transpose, and fused FP8 operators that reduce explicit cast operations from 12 to 2, eliminating redundant conversions.
Result: Achieves up to 21% higher throughput and 16.5 GB lower memory usage per GPU compared to BF16 and naive FP8 baselines on a 671B-parameter MoE model, while maintaining stable convergence.
Conclusion: FP8-Flow-MoE provides an effective plug-and-play FP8 training recipe compatible with existing frameworks like TransformerEngine and Megatron-LM, enabling efficient training of large MoE models with significant performance improvements.
Abstract: Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode much of FP8’s theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability. We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate up to 21% higher throughput and 16.5 GB lower memory usage per GPU compared to BF16 and na"ive FP8 baselines, while maintaining stable convergence. We provide a plug-and-play FP8 recipe compatible with TransformerEngine and Megatron-LM, which will be open-sourced soon.
[303] The Sequential Edge: Inverse-Entropy Voting Beats Parallel Self-Consistency at Matched Compute
Aman Sharma, Paras Chopra
Main category: cs.LG
TL;DR: Sequential scaling (iterative refinement) outperforms parallel self-consistency in language model reasoning, with a novel inverse-entropy voting method further boosting accuracy.
Details
Motivation: To determine whether parallel chains or sequential refinement is more effective for language model reasoning at equal token/compute budget, challenging the dominant parallel self-consistency paradigm.Method: Comprehensive evaluation across 5 models and 3 benchmarks comparing parallel vs sequential scaling, plus introducing inverse-entropy weighted voting that weights answers by inverse entropy of reasoning chains.
Result: Sequential scaling outperforms parallel self-consistency in 95.6% of configurations with accuracy gains up to 46.7%. Inverse-entropy voting further improves sequential scaling success rate over parallel majority voting.
Conclusion: Sequential refinement should replace parallel self-consistency as the default test-time scaling strategy for modern LLM reasoning, representing a paradigm shift in inference-time optimization.
Abstract: We revisit test-time scaling for language model reasoning and ask a fundamental question: at equal token budget and compute, is it better to run multiple independent chains in parallel, or to run fewer chains that iteratively refine through sequential steps? Through comprehensive evaluation across 5 state-of-the-art open source models and 3 challenging reasoning benchmarks, we find that sequential scaling where chains explicitly build upon previous attempts consistently outperforms the dominant parallel self-consistency paradigm in 95.6% of configurations with gains in accuracy upto 46.7%. Further, we introduce inverse-entropy weighted voting, a novel training-free method to further boost the accuracy of sequential scaling. By weighing answers in proportion to the inverse entropy of their reasoning chains, we increase our success rate over parallel majority and establish it as the optimal test-time scaling strategy. Our findings fundamentally challenge the parallel reasoning orthodoxy that has dominated test-time scaling since Wang et al.’s self-consistency decoding (Wang et al., 2022), positioning sequential refinement as the robust default for modern LLM reasoning and necessitating a paradigm shift in how we approach inference-time optimization.
[304] Large-scale automatic carbon ion treatment planning for head and neck cancers via parallel multi-agent reinforcement learning
Jueye Zhang, Chao Yang, Youfang Lai, Kai-Wen Li, Wenting Yan, Yunzhou Xia, Haimei Zhang, Jingjing Zhou, Gen Yang, Chen Lin, Tian Li, Yibao Zhang
Main category: cs.LG
TL;DR: A multi-agent reinforcement learning framework for automatically tuning 45 treatment-planning parameters in intensity-modulated carbon-ion therapy for head-and-neck cancer, achieving plan quality comparable to or better than expert manual tuning.
Details
Motivation: Head-and-neck cancer planning is challenging due to complex targets near critical organs. Current IMCT planning is slow and suboptimal due to laborious manual parameter tuning, while existing DL and RL methods face limitations in data bias, plan feasibility, and efficient exploration of large parameter spaces.Method: Proposed a scalable multi-agent RL framework using CTDE QMIX backbone with Double DQN, Dueling DQN, and DRQN. Features include compact DVH state inputs, linear action-to-value transform for parameter adjustments, and clinically informed piecewise reward. Uses synchronous multi-process workers interfacing with PHOENIX TPS for parallel optimization.
Result: Successfully tuned 45 parameters simultaneously, producing plans comparable to or better than expert manual plans (RL 85.93±7.85% vs Manual 85.02±6.92% relative plan score). Showed significant improvements for five OARs with p-value < 0.05.
Conclusion: The framework efficiently explores high-dimensional parameter spaces and generates clinically competitive IMCT plans through direct TPS interaction, notably improving organ-at-risk sparing in head-and-neck cancer treatment planning.
Abstract: Head-and-neck cancer (HNC) planning is difficult because multiple critical organs-at-risk (OARs) are close to complex targets. Intensity-modulated carbon-ion therapy (IMCT) offers superior dose conformity and OAR sparing but remains slow due to relative biological effectiveness (RBE) modeling, leading to laborious, experience-based, and often suboptimal tuning of many treatment-planning parameters (TPPs). Recent deep learning (DL) methods are limited by data bias and plan feasibility, while reinforcement learning (RL) struggles to efficiently explore the exponentially large TPP search space. We propose a scalable multi-agent RL (MARL) framework for parallel tuning of 45 TPPs in IMCT. It uses a centralized-training decentralized-execution (CTDE) QMIX backbone with Double DQN, Dueling DQN, and recurrent encoding (DRQN) for stable learning in a high-dimensional, non-stationary environment. To enhance efficiency, we (1) use compact historical DVH vectors as state inputs, (2) apply a linear action-to-value transform mapping small discrete actions to uniform parameter adjustments, and (3) design an absolute, clinically informed piecewise reward aligned with plan scores. A synchronous multi-process worker system interfaces with the PHOENIX TPS for parallel optimization and accelerated data collection. On a head-and-neck dataset (10 training, 10 testing), the method tuned 45 parameters simultaneously and produced plans comparable to or better than expert manual ones (relative plan score: RL $85.93\pm7.85%$ vs Manual $85.02\pm6.92%$), with significant (p-value $<$ 0.05) improvements for five OARs. The framework efficiently explores high-dimensional TPP spaces and generates clinically competitive IMCT plans through direct TPS interaction, notably improving OAR sparing.
[305] RoME: Domain-Robust Mixture-of-Experts for MILP Solution Prediction across Domains
Tianle Pu, Zijie Geng, Haoyang Liu, Shixuan Liu, Jie Wang, Li Zeng, Chao Chen, Changjun Fan
Main category: cs.LG
TL;DR: RoME is a domain-robust mixture-of-experts framework for predicting MILP solutions across domains, using dynamic routing and two-level distributionally robust optimization to enhance generalization.
Details
Motivation: Existing learning-based MILP solvers are limited to single-domain settings and struggle to generalize to unseen problem distributions, hindering the development of scalable general-purpose solvers.Method: RoME uses dynamic routing of problem instances to specialized experts based on learned task embeddings, with two-level distributionally robust optimization: inter-domain for global shifts and intra-domain for local robustness via task embedding perturbations.
Result: A single RoME model trained on three domains achieves 67.7% average improvement when evaluated on five diverse domains, and shows measurable performance gains on MIPLIB in zero-shot settings.
Conclusion: Cross-domain training enhances both generalization to unseen domains and performance within individual domains by capturing more general intrinsic combinatorial patterns, demonstrating RoME’s effectiveness for building scalable learning-based MILP solvers.
Abstract: Mixed-Integer Linear Programming (MILP) is a fundamental and powerful framework for modeling complex optimization problems across diverse domains. Recently, learning-based methods have shown great promise in accelerating MILP solvers by predicting high-quality solutions. However, most existing approaches are developed and evaluated in single-domain settings, limiting their ability to generalize to unseen problem distributions. This limitation poses a major obstacle to building scalable and general-purpose learning-based solvers. To address this challenge, we introduce RoME, a domain-Robust Mixture-of-Experts framework for predicting MILP solutions across domains. RoME dynamically routes problem instances to specialized experts based on learned task embeddings. The model is trained using a two-level distributionally robust optimization strategy: inter-domain to mitigate global shifts across domains, and intra-domain to enhance local robustness by introducing perturbations on task embeddings. We reveal that cross-domain training not only enhances the model’s generalization capability to unseen domains but also improves performance within each individual domain by encouraging the model to capture more general intrinsic combinatorial patterns. Specifically, a single RoME model trained on three domains achieves an average improvement of 67.7% then evaluated on five diverse domains. We further test the pretrained model on MIPLIB in a zero-shot setting, demonstrating its ability to deliver measurable performance gains on challenging real-world instances where existing learning-based approaches often struggle to generalize.
[306] Learning A Universal Crime Predictor with Knowledge-guided Hypernetworks
Fidan Karimova, Tong Chen, Yu Yang, Shazia Sadiq
Main category: cs.LG
TL;DR: HYSTL is a hypernetwork-enhanced framework for cross-city crime prediction that handles different crime types across cities by using a knowledge graph to bridge semantic gaps and dynamically generate prediction parameters.
Details
Motivation: Existing crime prediction methods struggle with aligning knowledge across diverse cities that have varying data availability for different crime types, making unified prediction challenging.Method: Uses a hypernetwork to dynamically generate parameters for prediction functions conditioned on crime types, with a structured crime knowledge graph to bridge semantic gaps between different crime types and guide predictions through crime associations.
Result: Extensive experiments on two cities with non-overlapping crime types show HYSTL outperforms state-of-the-art baselines.
Conclusion: HYSTL effectively trains a unified crime predictor that works across cities with different crime type records by leveraging hypernetworks and crime knowledge graphs.
Abstract: Predicting crimes in urban environments is crucial for public safety, yet existing prediction methods often struggle to align the knowledge across diverse cities that vary dramatically in data availability of specific crime types. We propose HYpernetwork-enhanced Spatial Temporal Learning (HYSTL), a framework that can effectively train a unified, stronger crime predictor without assuming identical crime types in different cities’ records. In HYSTL, instead of parameterising a dedicated predictor per crime type, a hypernetwork is designed to dynamically generate parameters for the prediction function conditioned on the crime type of interest. To bridge the semantic gap between different crime types, a structured crime knowledge graph is built, where the learned representations of crimes are used as the input to the hypernetwork to facilitate parameter generation. As such, when making predictions for each crime type, the predictor is additionally guided by its intricate association with other relevant crime types. Extensive experiments are performed on two cities with non-overlapping crime types, and the results demonstrate HYSTL outperforms state-of-the-art baselines.
[307] Reducing normalizing flow complexity for MCMC preconditioning
David Nabergoj, Erik Štrumbelj
Main category: cs.LG
TL;DR: Proposes a factorized preconditioning architecture combining linear preconditioning with conditional normalizing flows to improve MCMC sampling efficiency for complex target distributions.
Details
Motivation: Existing nonlinear preconditioners using overparameterized normalizing flows can degrade sampling efficiency and lack adaptability to target distribution geometry.Method: Factorized architecture with linear preconditioner for approximately Gaussian dimensions and conditional normalizing flows for complex dimensions, using warmup samples to estimate which dimensions are Gaussian.
Result: Significantly better tail samples on synthetic distributions, consistent performance on sparse logistic regression, and higher effective sample sizes on hierarchical Bayesian models with weak likelihoods and strong funnel geometries.
Conclusion: The factorized approach improves sampling efficiency for complex distributions, particularly relevant for hierarchical Bayesian models with limited data, and informs neural MCMC design.
Abstract: Preconditioning is a key component of MCMC algorithms that improves sampling efficiency by facilitating exploration of geometrically complex target distributions through an invertible map. While linear preconditioners are often sufficient for moderately complex target distributions, recent work has explored nonlinear preconditioning with invertible neural networks as components of normalizing flows (NFs). However, empirical and theoretical studies show that overparameterized NF preconditioners can degrade sampling efficiency and fit quality. Moreover, existing NF-based approaches do not adapt their architectures to the target distribution. Related work outside of MCMC similarly finds that suitably parameterized NFs can achieve comparable or superior performance with substantially less training time or data. We propose a factorized preconditioning architecture that reduces NF complexity by combining a linear component with a conditional NF, improving adaptability to target geometry. The linear preconditioner is applied to dimensions that are approximately Gaussian, as estimated from warmup samples, while the conditional NF models more complex dimensions. Our method yields significantly better tail samples on two complex synthetic distributions and consistently better performance on a sparse logistic regression posterior across varying likelihood and prior strengths. It also achieves higher effective sample sizes on hierarchical Bayesian model posteriors with weak likelihoods and strong funnel geometries. This approach is particularly relevant for hierarchical Bayesian model analyses with limited data and could inform current theoretical and software strides in neural MCMC design.
[308] Evolving Graph Learning for Out-of-Distribution Generalization in Non-stationary Environments
Qingyun Sun, Jiayi Luo, Haonan Yuan, Xingcheng Fu, Hao Peng, Jianxin Li, Philip S. Yu
Main category: cs.LG
TL;DR: Proposes EvoOOD, a framework for out-of-distribution generalization on dynamic graphs using environment-aware invariant pattern recognition to handle evolving non-stationary environments.
Details
Motivation: Existing GNNs show poor generalization under distribution shifts in dynamic graphs, which is inevitable in evolving non-stationary environments.Method: Uses environment sequential variational auto-encoder to model environment evolution, environment-aware invariant pattern recognition, and fine-grained causal interventions on nodes.
Result: Superior performance on real-world and synthetic dynamic datasets under distribution shifts compared to existing methods.
Conclusion: First work to address dynamic graph OOD generalization from environment evolution perspective, demonstrating effectiveness through invariant pattern recognition in non-stationary environments.
Abstract: Graph neural networks have shown remarkable success in exploiting the spatial and temporal patterns on dynamic graphs. However, existing GNNs exhibit poor generalization ability under distribution shifts, which is inevitable in dynamic scenarios. As dynamic graph generation progresses amid evolving latent non-stationary environments, it is imperative to explore their effects on out-of-distribution (OOD) generalization. This paper proposes a novel Evolving Graph Learning framework for OOD generalization (EvoOOD) by environment-aware invariant pattern recognition. Specifically, we first design an environment sequential variational auto-encoder to model environment evolution and infer the underlying environment distribution. Then, we introduce a mechanism for environment-aware invariant pattern recognition, tailored to address environmental diversification through inferred distributions. Finally, we conduct fine-grained causal interventions on individual nodes using a mixture of instantiated environment samples. This approach helps to distinguish spatio-temporal invariant patterns for OOD prediction, especially in non-stationary environments. Experimental results demonstrate the superiority of EvoGOOD on both real-world and synthetic dynamic datasets under distribution shifts. To the best of our knowledge, it is the first attempt to study the dynamic graph OOD generalization problem from the environment evolution perspective.
[309] LUMA-RAG: Lifelong Multimodal Agents with Provably Stable Streaming Alignment
Rohan Wandre, Yash Gajewar, Namrata Patel, Vivek Dhalkari
Main category: cs.LG
TL;DR: LUMA-RAG is a lifelong multimodal agent architecture that addresses challenges in maintaining index freshness and cross-modal semantic consistency for modern AI agents using streaming memory systems, alignment bridges, and stability-aware retrieval.
Details
Motivation: As AI agents transition to continuous multimodal streams, there are critical challenges in maintaining index freshness without prohibitive re-indexing costs and preserving cross-modal semantic consistency across heterogeneous embedding spaces.Method: Three key innovations: (i) streaming multi-tier memory system with dynamic embedding spilling, (ii) streaming CLAP->CLIP alignment bridge with incremental orthogonal Procrustes updates, (iii) stability-aware retrieval telemetry with Safe@k guarantees.
Result: Robust text-to-image retrieval (Recall@10 = 0.94), graceful performance degradation under product quantization offloading, and provably stable audio-to-image rankings (Safe@1 = 1.0).
Conclusion: LUMA-RAG establishes itself as a practical framework for production multimodal RAG systems, addressing the challenges of continuous multimodal data streams.
Abstract: Retrieval-Augmented Generation (RAG) has emerged as the dominant paradigm for grounding large language model outputs in verifiable evidence. However, as modern AI agents transition from static knowledge bases to continuous multimodal streams encompassing text, images, video, and audio, two critical challenges arise: maintaining index freshness without prohibitive re-indexing costs, and preserving cross-modal semantic consistency across heterogeneous embedding spaces. We present LUMA-RAG, a lifelong multimodal agent architecture featuring three key innovations: (i) a streaming, multi-tier memory system that dynamically spills embeddings from a hot HNSW tier to a compressed IVFPQ tier under strict memory budgets; (ii) a streaming CLAP->CLIP alignment bridge that maintains cross-modal consistency through incremental orthogonal Procrustes updates; and (iii) stability-aware retrieval telemetry providing Safe@k guarantees by jointly bounding alignment drift and quantization error. Experiments demonstrate robust text-to-image retrieval (Recall@10 = 0.94), graceful performance degradation under product quantization offloading, and provably stable audio-to-image rankings (Safe@1 = 1.0), establishing LUMA-RAG as a practical framework for production multimodal RAG systems.
[310] A Spatially Informed Gaussian Process UCB Method for Decentralized Coverage Control
Gennaro Guidone, Luca Monegaglia, Elia Raimondi, Han Wang, Mattia Bianchi, Florian Dörfler
Main category: cs.LG
TL;DR: A decentralized coverage control algorithm using Gaussian Processes that balances exploration and exploitation through local cost minimization with GP-UCB-inspired acquisition functions.
Details
Motivation: To develop a fully decentralized coverage control method for unknown environments that doesn't require global information, enabling scalable and autonomous multi-agent systems.Method: Each agent autonomously minimizes a local cost function combining expected locational cost and variance-based exploration, using GP-UCB-inspired acquisition functions and periodic inducing point updates with greedy selection for scalable online GP updates.
Result: The algorithm operates effectively in simulation, demonstrating successful decentralized coverage control in unknown spatial environments with only local observations and neighbor communication.
Conclusion: The proposed decentralized approach provides an effective solution for coverage control in unknown environments, balancing exploration-exploitation trade-offs while maintaining scalability through local computations and communications.
Abstract: We present a novel decentralized algorithm for coverage control in unknown spatial environments modeled by Gaussian Processes (GPs). To trade-off between exploration and exploitation, each agent autonomously determines its trajectory by minimizing a local cost function. Inspired by the GP-UCB (Upper Confidence Bound for GPs) acquisition function, the proposed cost combines the expected locational cost with a variance-based exploration term, guiding agents toward regions that are both high in predicted density and model uncertainty. Compared to previous work, our algorithm operates in a fully decentralized fashion, relying only on local observations and communication with neighboring agents. In particular, agents periodically update their inducing points using a greedy selection strategy, enabling scalable online GP updates. We demonstrate the effectiveness of our algorithm in simulation.
[311] Improving Unlearning with Model Updates Probably Aligned with Gradients
Virgile Dine, Teddy Furon, Charly Faure
Main category: cs.LG
TL;DR: The paper formulates machine unlearning as a constrained optimization problem and introduces feasible updates based on parameter masking to enable unlearning while preserving model utility.
Details
Motivation: To develop a unified framework for machine unlearning that can be applied to existing first-order methods while ensuring statistical guarantees and maintaining model performance.Method: Proposes feasible updates using parameter masking to selectively update model parameters, accounting for gradient estimation noise to provide statistical guarantees for local feasibility.
Result: Experiments with computer vision classifiers validate that the approach effectively enables unlearning without degrading the initial model’s utility.
Conclusion: The feasible update technique provides a plug-and-play solution that can enhance any first-order approximate unlearning method with statistical guarantees.
Abstract: We formulate the machine unlearning problem as a general constrained optimization problem. It unifies the first-order methods from the approximate machine unlearning literature. This paper then introduces the concept of feasible updates as the model’s parameter update directions that help with unlearning while not degrading the utility of the initial model. Our design of feasible updates is based on masking, \ie\ a careful selection of the model’s parameters worth updating. It also takes into account the estimation noise of the gradients when processing each batch of data to offer a statistical guarantee to derive locally feasible updates. The technique can be plugged in, as an add-on, to any first-order approximate unlearning methods. Experiments with computer vision classifiers validate this approach.
[312] SKGE: Spherical Knowledge Graph Embedding with Geometric Regularization
Xuan-Truong Quan, Xuan-Son Quan, Duc Do Minh, Vinh Nguyen Van
Main category: cs.LG
TL;DR: SKGE proposes spherical knowledge graph embedding by constraining entities to a hypersphere, outperforming Euclidean models like TransE through geometric regularization and inherent hard negative sampling.
Details
Motivation: Euclidean KGE models have inherent limitations in modeling complex relations and inefficient training due to unbounded space. Spherical geometry offers better regularization and representation learning.Method: SKGE uses a learnable Spherization Layer to map entities onto a hypersphere and interprets relations as translate-then-project transformations on the spherical manifold.
Result: SKGE consistently outperforms TransE on FB15k-237, CoDEx-S, and CoDEx-M benchmarks, showing significant gains on large-scale datasets and across all relation types.
Conclusion: Spherical geometry provides powerful regularization and inherent hard negative sampling, making manifold choice a fundamental design principle for next-generation KGE models.
Abstract: Knowledge graph embedding (KGE) has become a fundamental technique for representation learning on multi-relational data. Many seminal models, such as TransE, operate in an unbounded Euclidean space, which presents inherent limitations in modeling complex relations and can lead to inefficient training. In this paper, we propose Spherical Knowledge Graph Embedding (SKGE), a model that challenges this paradigm by constraining entity representations to a compact manifold: a hypersphere. SKGE employs a learnable, non-linear Spherization Layer to map entities onto the sphere and interprets relations as a hybrid translate-then-project transformation. Through extensive experiments on three benchmark datasets, FB15k-237, CoDEx-S, and CoDEx-M, we demonstrate that SKGE consistently and significantly outperforms its strong Euclidean counterpart, TransE, particularly on large-scale benchmarks such as FB15k-237 and CoDEx-M, demonstrating the efficacy of the spherical geometric prior. We provide an in-depth analysis to reveal the sources of this advantage, showing that this geometric constraint acts as a powerful regularizer, leading to comprehensive performance gains across all relation types. More fundamentally, we prove that the spherical geometry creates an “inherently hard negative sampling” environment, naturally eliminating trivial negatives and forcing the model to learn more robust and semantically coherent representations. Our findings compellingly demonstrate that the choice of manifold is not merely an implementation detail but a fundamental design principle, advocating for geometric priors as a cornerstone for designing the next generation of powerful and stable KGE models.
[313] NOWS: Neural Operator Warm Starts for Accelerating Iterative Solvers
Mohammad Sadegh Eshaghi, Cosmin Anitescu, Navid Valizadeh, Yizheng Wang, Xiaoying Zhuang, Timon Rabczuk
Main category: cs.LG
TL;DR: Neural Operator Warm Starts (NOWS) uses learned solution operators to accelerate classical iterative PDE solvers by providing high-quality initial guesses, reducing computational time by up to 90% while maintaining solver stability.
Details
Motivation: High-fidelity PDE simulations are computationally expensive, and while data-driven surrogates are fast, they are often unreliable outside their training distribution. There's a need for approaches that combine neural network speed with traditional solver reliability.Method: NOWS integrates learned solution operators with classical iterative solvers (like conjugate gradient and GMRES) by using neural operators to generate high-quality initial guesses, leaving existing discretizations and solver infrastructures intact.
Result: The method consistently reduces iteration counts and end-to-end runtime across benchmarks, achieving up to 90% reduction in computational time while preserving the stability and convergence guarantees of traditional numerical algorithms.
Conclusion: NOWS provides a practical and trustworthy hybrid approach that combines the rapid inference of neural operators with the rigor of traditional solvers to accelerate high-fidelity PDE simulations.
Abstract: Partial differential equations (PDEs) underpin quantitative descriptions across the physical sciences and engineering, yet high-fidelity simulation remains a major computational bottleneck for many-query, real-time, and design tasks. Data-driven surrogates can be strikingly fast but are often unreliable when applied outside their training distribution. Here we introduce Neural Operator Warm Starts (NOWS), a hybrid strategy that harnesses learned solution operators to accelerate classical iterative solvers by producing high-quality initial guesses for Krylov methods such as conjugate gradient and GMRES. NOWS leaves existing discretizations and solver infrastructures intact, integrating seamlessly with finite-difference, finite-element, isogeometric analysis, finite volume method, etc. Across our benchmarks, the learned initialization consistently reduces iteration counts and end-to-end runtime, resulting in a reduction of the computational time of up to 90 %, while preserving the stability and convergence guarantees of the underlying numerical algorithms. By combining the rapid inference of neural operators with the rigor of traditional solvers, NOWS provides a practical and trustworthy approach to accelerate high-fidelity PDE simulations.
[314] BRAINS: A Retrieval-Augmented System for Alzheimer’s Detection and Monitoring
Rajan Das Gupta, Md Kishor Morol, Nafiz Fahad, Md Tanzib Hosain, Sumaya Binte Zilani Choya, Md Jakir Hossen
Main category: cs.LG
TL;DR: BRAINS is a novel AI system that uses Large Language Models for early Alzheimer’s detection, featuring dual modules for cognitive diagnosis and case retrieval to enhance diagnostic accuracy.
Details
Motivation: Address the growing global burden of Alzheimer's disease and the need for early, accurate detection in regions with limited access to advanced diagnostic tools.Method: Dual-module architecture: Cognitive Diagnostic Module uses LLMs fine-tuned on cognitive/neuroimaging data; Case Retrieval Module encodes patient profiles and retrieves similar cases for enhanced contextual understanding through case fusion.
Result: Evaluations on real-world datasets demonstrate effectiveness in classifying disease severity and identifying early signs of cognitive decline.
Conclusion: BRAINS shows strong potential as a scalable, explainable assistive tool for early-stage Alzheimer’s detection and offers promise for future applications in neurodegeneration screening.
Abstract: As the global burden of Alzheimer’s disease (AD) continues to grow, early and accurate detection has become increasingly critical, especially in regions with limited access to advanced diagnostic tools. We propose BRAINS (Biomedical Retrieval-Augmented Intelligence for Neurodegeneration Screening) to address this challenge. This novel system harnesses the powerful reasoning capabilities of Large Language Models (LLMs) for Alzheimer’s detection and monitoring. BRAINS features a dual-module architecture: a cognitive diagnostic module and a case-retrieval module. The Diagnostic Module utilizes LLMs fine-tuned on cognitive and neuroimaging datasets – including MMSE, CDR scores, and brain volume metrics – to perform structured assessments of Alzheimer’s risk. Meanwhile, the Case Retrieval Module encodes patient profiles into latent representations and retrieves similar cases from a curated knowledge base. These auxiliary cases are fused with the input profile via a Case Fusion Layer to enhance contextual understanding. The combined representation is then processed with clinical prompts for inference. Evaluations on real-world datasets demonstrate BRAINS effectiveness in classifying disease severity and identifying early signs of cognitive decline. This system not only shows strong potential as an assistive tool for scalable, explainable, and early-stage Alzheimer’s disease detection, but also offers hope for future applications in the field.
[315] Variational Geometric Information Bottleneck: Learning the Shape of Understanding
Ronald Katende
Main category: cs.LG
TL;DR: A unified information-geometric framework that formalizes learning as a trade-off between informativeness and geometric simplicity, with curvature and intrinsic dimensionality controlling generalization and sample efficiency.
Details
Motivation: To bridge information theory and geometry in learning by formalizing understanding as a balance between information capture and geometric simplicity, linking representation geometry to sample efficiency.Method: Proposed Variational Geometric Information Bottleneck (V-GIB) that unifies mutual-information compression with curvature regularization using tractable geometric proxies like Hutchinson trace, Jacobian norms, and local PCA.
Result: Experiments show robust information-geometry Pareto frontier, stable estimators, and substantial gains in interpretive efficiency. Curvature-aware encoders maintain predictive power under data scarcity, validating the efficiency-curvature law.
Conclusion: V-GIB provides a principled route to geometrically coherent, data-efficient representations aligned with human-understandable structure, directly linking geometry to sample efficiency.
Abstract: We propose a unified information-geometric framework that formalizes understanding in learning as a trade-off between informativeness and geometric simplicity. An encoder phi is evaluated by U(phi) = I(phi(X); Y) - beta * C(phi), where C(phi) penalizes curvature and intrinsic dimensionality, enforcing smooth, low-complexity manifolds. Under mild manifold and regularity assumptions, we derive non-asymptotic bounds showing that generalization error scales with intrinsic dimension while curvature controls approximation stability, directly linking geometry to sample efficiency. To operationalize this theory, we introduce the Variational Geometric Information Bottleneck (V-GIB), a variational estimator that unifies mutual-information compression and curvature regularization through tractable geometric proxies such as the Hutchinson trace, Jacobian norms, and local PCA. Experiments across synthetic manifolds, few-shot settings, and real-world datasets (Fashion-MNIST, CIFAR-10) reveal a robust information-geometry Pareto frontier, stable estimators, and substantial gains in interpretive efficiency. Fractional-data experiments on CIFAR-10 confirm that curvature-aware encoders maintain predictive power under data scarcity, validating the predicted efficiency-curvature law. Overall, V-GIB provides a principled and measurable route to representations that are geometrically coherent, data-efficient, and aligned with human-understandable structure.
[316] An End-to-End Learning Approach for Solving Capacitated Location-Routing Problems
Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen
Main category: cs.LG
TL;DR: Proposes DRL with heterogeneous query (DRLHQ) as the first end-to-end learning approach for capacitated location-routing problems (CLRPs) and open CLRPs, using a novel attention mechanism that adapts to different decision stages.
Details
Motivation: CLRPs are challenging combinatorial optimization problems with complex constraints and intricate relationships between location and routing decisions. While DRL has been applied to vehicle routing problems, research on CLRPs remains unexplored.Method: Uses encoder-decoder structure with Markov decision process formulation. Introduces heterogeneous querying attention mechanism to handle interdependency between location and routing decisions, dynamically adapting to various decision-making stages.
Result: Experimental results on synthetic and benchmark datasets show superior solution quality and better generalization performance compared to traditional and DRL-based baselines for both CLRP and OCLRP.
Conclusion: The proposed DRLHQ approach effectively solves CLRPs as the first end-to-end learning method, demonstrating strong performance and generalization capabilities through its novel attention mechanism.
Abstract: The capacitated location-routing problems (CLRPs) are classical problems in combinatorial optimization, which require simultaneously making location and routing decisions. In CLRPs, the complex constraints and the intricate relationships between various decisions make the problem challenging to solve. With the emergence of deep reinforcement learning (DRL), it has been extensively applied to address the vehicle routing problem and its variants, while the research related to CLRPs still needs to be explored. In this paper, we propose the DRL with heterogeneous query (DRLHQ) to solve CLRP and open CLRP (OCLRP), respectively. We are the first to propose an end-to-end learning approach for CLRPs, following the encoder-decoder structure. In particular, we reformulate the CLRPs as a markov decision process tailored to various decisions, a general modeling framework that can be adapted to other DRL-based methods. To better handle the interdependency across location and routing decisions, we also introduce a novel heterogeneous querying attention mechanism designed to adapt dynamically to various decision-making stages. Experimental results on both synthetic and benchmark datasets demonstrate superior solution quality and better generalization performance of our proposed approach over representative traditional and DRL-based baselines in solving both CLRP and OCLRP.
[317] Causal Graph Neural Networks for Healthcare
Munib Mesinovic, Max Buhlan, Tingting Zhu
Main category: cs.LG
TL;DR: Causal graph neural networks address healthcare AI failures by learning invariant causal mechanisms instead of spurious correlations, enabling more robust and fair clinical applications.
Details
Motivation: Healthcare AI systems fail when deployed across institutions due to learning statistical associations rather than causal mechanisms, leading to performance drops and perpetuation of discriminatory patterns.Method: Combines graph-based representations of biomedical data with causal inference principles, including structural causal models, disentangled causal representation learning, interventional prediction, and counterfactual reasoning on graphs.
Result: Establishes foundations for patient-specific Causal Digital Twins enabling in silico clinical experimentation, with applications in psychiatric diagnosis, cancer subtyping, physiological monitoring, and drug recommendation.
Conclusion: Substantial barriers remain including computational requirements, validation challenges, and risks of causal-washing. Proposes tiered frameworks to distinguish causally-inspired architectures from causally-validated discoveries and identifies critical research priorities.
Abstract: Healthcare artificial intelligence systems routinely fail when deployed across institutions, with documented performance drops and perpetuation of discriminatory patterns embedded in historical data. This brittleness stems, in part, from learning statistical associations rather than causal mechanisms. Causal graph neural networks address this triple crisis of distribution shift, discrimination, and inscrutability by combining graph-based representations of biomedical data with causal inference principles to learn invariant mechanisms rather than spurious correlations. This Review examines methodological foundations spanning structural causal models, disentangled causal representation learning, and techniques for interventional prediction and counterfactual reasoning on graphs. We analyse applications demonstrating clinical value across psychiatric diagnosis through brain network analysis, cancer subtyping via multi-omics causal integration, continuous physiological monitoring with mechanistic interpretation, and drug recommendation correcting prescription bias. These advances establish foundations for patient-specific Causal Digital Twins, enabling in silico clinical experimentation, with integration of large language models for hypothesis generation and causal graph neural networks for mechanistic validation. Substantial barriers remain, including computational requirements precluding real-time deployment, validation challenges demanding multi-modal evidence triangulation beyond cross-validation, and risks of causal-washing where methods employ causal terminology without rigorous evidentiary support. We propose tiered frameworks distinguishing causally-inspired architectures from causally-validated discoveries and identify critical research priorities making causal rather than purely associational claims.
[318] Rawlsian many-to-one matching with non-linear utility
Hortence Nana, Andreas Athanasopoulos, Christos Dimitrakakis
Main category: cs.LG
TL;DR: The paper addresses many-to-one matching with non-linear utility functions for diversity, where classical stable matchings may not exist, and proposes Rawlsian fairness-based solution concepts with deterministic and stochastic algorithms.
Details
Motivation: Classical stable matchings fail to exist when colleges evaluate student sets through non-linear utility functions that capture diversity, necessitating alternative solution concepts.Method: Proposed Rawlsian fairness-based solution concepts and designed both deterministic and stochastic algorithms that iteratively improve the outcome of the worst-off college.
Result: The algorithms provide a practical approach to fair allocation in many-to-one matching problems where stability cannot be guaranteed.
Conclusion: Rawlsian fairness offers a viable alternative to stability in diverse many-to-one matching problems, with proposed algorithms enabling practical implementation.
Abstract: We study a many-to-one matching problem, such as the college admission problem, where each college can admit multiple students. Unlike classical models, colleges evaluate sets of students through non-linear utility functions that capture diversity between them. In this setting, we show that classical stable matchings may fail to exist. To address this, we propose alternative solution concepts based on Rawlsian fairness, aiming to maximize the minimum utility across colleges. We design both deterministic and stochastic algorithms that iteratively improve the outcome of the worst-off college, offering a practical approach to fair allocation when stability cannot be guaranteed.
[319] Theoretical Guarantees for Causal Discovery on Large Random Graphs
Mathieu Chevalley, Arash Mehrjou, Patrick Schwab
Main category: cs.LG
TL;DR: The paper provides theoretical guarantees for false-negative rates in causal discovery under random interventions, showing concentration of FNR around its mean in sparse Erdős-Rényi and scale-free graphs, with stronger concentration in scale-free networks.
Details
Motivation: To establish theoretical guarantees for causal structure recovery under realistic conditions, addressing the challenge of high dimensionality and network heterogeneity in causal discovery.Method: Theoretical analysis of false-negative rate (FNR) concentration under single-variable random interventions and ε-interventional faithfulness assumption, applied to sparse Erdős-Rényi DAGs and generalized Barabási-Albert graphs.
Result: FNR concentrates around its mean at rate O(log d/√d) for Erdős-Rényi graphs, and exhibits even stronger concentration (vanishing deviation) in scale-free graphs when degree exponent γ > 3.
Conclusion: Realistic scale-free topologies intrinsically regularize causal discovery, reducing variability in orientation error, and high dimensionality does not necessarily hinder accurate causal structure recovery.
Abstract: We investigate theoretical guarantees for the false-negative rate (FNR) – the fraction of true causal edges whose orientation is not recovered, under single-variable random interventions and an $\epsilon$-interventional faithfulness assumption that accommodates latent confounding. For sparse Erd\H{o}s–R'enyi directed acyclic graphs, where the edge probability scales as $p_e = \Theta(1/d)$, we show that the FNR concentrates around its mean at rate $O(\frac{\log d}{\sqrt d})$, implying that large deviations above the expected error become exponentially unlikely as dimensionality increases. This concentration ensures that derived upper bounds hold with high probability in large-scale settings. Extending the analysis to generalized Barab'asi–Albert graphs reveals an even stronger phenomenon: when the degree exponent satisfies $\gamma > 3$, the deviation width scales as $O(d^{\beta - \frac{1}{2}})$ with $\beta = 1/(\gamma - 1) < \frac{1}{2}$, and hence vanishes in the limit. This demonstrates that realistic scale-free topologies intrinsically regularize causal discovery, reducing variability in orientation error. These finite-dimension results provide the first dimension-adaptive, faithfulness-robust guarantees for causal structure recovery, and challenge the intuition that high dimensionality and network heterogeneity necessarily hinder accurate discovery. Our simulation results corroborate these theoretical predictions, showing that the FNR indeed concentrates and often vanishes in practice as dimensionality grows.
[320] Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning
Yixiu Mao, Yun Qu, Qi Wang, Xiangyang Ji
Main category: cs.LG
TL;DR: Proposes a new neighborhood constraint for offline RL that restricts actions to neighborhoods of dataset actions, addressing limitations of existing constraints while avoiding behavior policy modeling.
Details
Motivation: Existing offline RL constraints (density, support, sample) have limitations: density/sample constraints are overly conservative, while support constraint struggles with accurate behavior policy modeling.Method: Develops Adaptive Neighborhood-constrained Q learning (ANQ) using a bilevel optimization framework to implement neighborhood constraints that adapt radius based on data quality.
Result: ANQ achieves state-of-the-art performance on standard offline RL benchmarks and shows strong robustness with noisy or limited data.
Conclusion: The neighborhood constraint effectively bounds extrapolation errors and distribution shift while providing substantial flexibility and pointwise conservatism, outperforming existing constraint approaches.
Abstract: Offline reinforcement learning (RL) suffers from extrapolation errors induced by out-of-distribution (OOD) actions. To address this, offline RL algorithms typically impose constraints on action selection, which can be systematically categorized into density, support, and sample constraints. However, we show that each category has inherent limitations: density and sample constraints tend to be overly conservative in many scenarios, while the support constraint, though least restrictive, faces challenges in accurately modeling the behavior policy. To overcome these limitations, we propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions. Theoretically, the constraint not only bounds extrapolation errors and distribution shift under certain conditions, but also approximates the support constraint without requiring behavior policy modeling. Moreover, it retains substantial flexibility and enables pointwise conservatism by adapting the neighborhood radius for each data point. In practice, we employ data quality as the adaptation criterion and design an adaptive neighborhood constraint. Building on an efficient bilevel optimization framework, we develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint. Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data.
[321] Dynamic Priors in Bayesian Optimization for Hyperparameter Optimization
Lukas Fehring, Marcel Wever, Maximilian Spliethöver, Leona Hennig, Henning Wachsmuth, Marius Lindauer
Main category: cs.LG
TL;DR: A novel method for interactive hyperparameter optimization that allows users to steer Bayesian optimization with expert knowledge during runtime via prior distributions, with protection against misleading inputs.
Details
Motivation: HPO lacks user control and acceptance due to its black-box nature; existing approaches don't allow online steering during optimization.Method: Generalizes πBO to enable repeated interventions via user-specified prior distributions, with misleading prior detection scheme.
Result: Effectively incorporates multiple priors, leveraging informative ones while rejecting/overcoming misleading priors, achieving competitiveness with unperturbed BO.
Conclusion: The method successfully enables user control in HPO while maintaining performance through protection against harmful inputs.
Abstract: Hyperparameter optimization (HPO), for example, based on Bayesian optimization (BO), supports users in designing models well-suited for a given dataset. HPO has proven its effectiveness on several applications, ranging from classical machine learning for tabular data to deep neural networks for computer vision and transformers for natural language processing. However, HPO still sometimes lacks acceptance by machine learning experts due to its black-box nature and limited user control. Addressing this, first approaches have been proposed to initialize BO methods with expert knowledge. However, these approaches do not allow for online steering during the optimization process. In this paper, we introduce a novel method that enables repeated interventions to steer BO via user input, specifying expert knowledge and user preferences at runtime of the HPO process in the form of prior distributions. To this end, we generalize an existing method, $\pi$BO, preserving theoretical guarantees. We also introduce a misleading prior detection scheme, which allows protection against harmful user inputs. In our experimental evaluation, we demonstrate that our method can effectively incorporate multiple priors, leveraging informative priors, whereas misleading priors are reliably rejected or overcome. Thereby, we achieve competitiveness to unperturbed BO.
[322] Directional-Clamp PPO
Gilad Karpel, Ruida Zhou, Shoham Sabach, Mohammad Ghavamzadeh
Main category: cs.LG
TL;DR: Proposes DClamp-PPO, a PPO variant that penalizes “wrong” direction updates where importance ratios move opposite to advantage signals, improving performance over standard PPO across MuJoCo environments.
Details
Motivation: Standard PPO focuses on clipping "right" direction updates but overlooks that importance ratios frequently move in "wrong" directions during optimization due to randomness, which hinders performance improvement.Method: Introduces directional clamping that penalizes actions going to strict “wrong” direction regions (where advantage is positive but ratio falls below 1-β, or advantage negative but ratio above 1+β) by enforcing steeper loss slopes.
Result: DClamp-PPO consistently outperforms PPO and its variants across various MuJoCo environments with different random seeds, better avoiding “wrong” direction updates while keeping importance ratios closer to 1.
Conclusion: Addressing “wrong” direction updates is crucial for PPO improvement, and DClamp-PPO provides an effective solution through directional clamping that complements existing “right” direction clipping approaches.
Abstract: Proximal Policy Optimization (PPO) is widely regarded as one of the most successful deep reinforcement learning algorithms, known for its robustness and effectiveness across a range of problems. The PPO objective encourages the importance ratio between the current and behavior policies to move to the “right” direction – starting from importance sampling ratios equal to 1, increasing the ratios for actions with positive advantages and decreasing those with negative advantages. A clipping function is introduced to prevent over-optimization when updating the importance ratio in these “right” direction regions. Many PPO variants have been proposed to extend its success, most of which modify the objective’s behavior by altering the clipping in the “right” direction regions. However, due to randomness in the rollouts and stochasticity of the policy optimization, we observe that the ratios frequently move to the “wrong” direction during the PPO optimization. This is a key factor hindering the improvement of PPO, but it has been largely overlooked. To address this, we propose the Directional-Clamp PPO algorithm (DClamp-PPO), which further penalizes the actions going to the strict “wrong” direction regions, where the advantage is positive (negative) and importance ratio falls below (above) $1 - \beta$ ($1+\beta$), for a tunable parameter $\beta \in (0, 1)$. The penalty is by enforcing a steeper loss slope, i.e., a clamp, in those regions. We demonstrate that DClamp-PPO consistently outperforms PPO, as well as its variants, by focusing on modifying the objective’s behavior in the “right” direction, across various MuJoCo environments, using different random seeds. The proposed method is shown, both theoretically and empirically, to better avoid “wrong” direction updates while keeping the importance ratio closer to 1.
[323] Natural-gas storage modelling by deep reinforcement learning
Tiziano Balaconi, Aldo Glielmo, Marco Taboga
Main category: cs.LG
TL;DR: GasRL is a simulator combining natural gas market modeling with deep RL-trained storage policies, showing SAC algorithm achieves optimal storage management that matches real-world price dynamics without explicit calibration.
Details
Motivation: To analyze how optimal stockpile management affects equilibrium prices and market dynamics in natural gas markets, and to assess policy impacts like EU storage thresholds.Method: Developed GasRL simulator coupling natural gas market representation with deep RL-trained storage policies, tested various RL algorithms including Soft Actor Critic (SAC).
Result: SAC achieved multiple objectives: profitability, robust market clearing, price stabilization. SAC-derived policies produced price dynamics matching real-world volatility and seasonality without explicit calibration. EU storage thresholds improved market resilience against supply shocks.
Conclusion: RL-based storage management can effectively replicate real market behaviors, and policy interventions like storage thresholds enhance market stability against unexpected supply disruptions.
Abstract: We introduce GasRL, a simulator that couples a calibrated representation of the natural gas market with a model of storage-operator policies trained with deep reinforcement learning (RL). We use it to analyse how optimal stockpile management affects equilibrium prices and the dynamics of demand and supply. We test various RL algorithms and find that Soft Actor Critic (SAC) exhibits superior performance in the GasRL environment: multiple objectives of storage operators - including profitability, robust market clearing and price stabilisation - are successfully achieved. Moreover, the equilibrium price dynamics induced by SAC-derived optimal policies have characteristics, such as volatility and seasonality, that closely match those of real-world prices. Remarkably, this adherence to the historical distribution of prices is obtained without explicitly calibrating the model to price data. We show how the simulator can be used to assess the effects of EU-mandated minimum storage thresholds. We find that such thresholds have a positive effect on market resilience against unanticipated shifts in the distribution of supply shocks. For example, with unusually large shocks, market disruptions are averted more often if a threshold is in place.
[324] A Large Language Model for Corporate Credit Scoring
Chitro Majumdar, Sergio Scandizzo, Ratanlal Mahanta, Avradip Mandal, Swarnendu Bhattacharjee
Main category: cs.LG
TL;DR: Omega^2 is an LLM-driven framework for corporate credit scoring that combines structured financial data with machine learning to improve predictive reliability and interpretability across multiple rating agencies.
Details
Motivation: To create a more reliable and interpretable corporate credit scoring system that can generalize across different rating agencies and maintain temporal consistency.Method: Integrates CatBoost, LightGBM, and XGBoost models optimized through Bayesian search under temporal validation, using structured financial data (leverage, profitability, liquidity ratios) from 7,800 corporate credit ratings across multiple agencies.
Result: Achieved mean test AUC above 0.93 across agencies, demonstrating strong generalization across rating systems and temporal consistency.
Conclusion: Combining language-based reasoning with quantitative learning creates a transparent and institution-grade foundation for reliable corporate credit-risk assessment.
Abstract: We introduce Omega^2, a Large Language Model-driven framework for corporate credit scoring that combines structured financial data with advanced machine learning to improve predictive reliability and interpretability. Our study evaluates Omega^2 on a multi-agency dataset of 7,800 corporate credit ratings drawn from Moody’s, Standard & Poor’s, Fitch, and Egan-Jones, each containing detailed firm-level financial indicators such as leverage, profitability, and liquidity ratios. The system integrates CatBoost, LightGBM, and XGBoost models optimized through Bayesian search under temporal validation to ensure forward-looking and reproducible results. Omega^2 achieved a mean test AUC above 0.93 across agencies, confirming its ability to generalize across rating systems and maintain temporal consistency. These results show that combining language-based reasoning with quantitative learning creates a transparent and institution-grade foundation for reliable corporate credit-risk assessment.
[325] Neural Network Interoperability Across Platforms
Nadia Daoudi, Ivan Alfonso, Jordi Cabot
Main category: cs.LG
TL;DR: Proposes an automated approach to migrate neural network code across deep learning frameworks using a pivot model abstraction, validated on PyTorch and TensorFlow with successful functional equivalence.
Details
Motivation: Manual migration of NN implementations across libraries is challenging and time-consuming due to lack of specialized migration approaches, leading to outdated implementations and compatibility issues.Method: Uses a pivot neural network model to create an abstraction of the NN prior to migration, enabling automated code migration between frameworks.
Result: Successfully migrated five neural networks between PyTorch and TensorFlow, producing functionally equivalent implementations.
Conclusion: The approach effectively automates NN code migration across frameworks, addressing the challenge of framework switching while maintaining functional equivalence.
Abstract: The development of smart systems (i.e., systems enhanced with AI components) has thrived thanks to the rapid advancements in neural networks (NNs). A wide range of libraries and frameworks have consequently emerged to support NN design and implementation. The choice depends on factors such as available functionalities, ease of use, documentation and community support. After adopting a given NN framework, organizations might later choose to switch to another if performance declines, requirements evolve, or new features are introduced. Unfortunately, migrating NN implementations across libraries is challenging due to the lack of migration approaches specifically tailored for NNs. This leads to increased time and effort to modernize NNs, as manual updates are necessary to avoid relying on outdated implementations and ensure compatibility with new features. In this paper, we propose an approach to automatically migrate neural network code across deep learning frameworks. Our method makes use of a pivot NN model to create an abstraction of the NN prior to migration. We validate our approach using two popular NN frameworks, namely PyTorch and TensorFlow. We also discuss the challenges of migrating code between the two frameworks and how they were approached in our method. Experimental evaluation on five NNs shows that our approach successfully migrates their code and produces NNs that are functionally equivalent to the originals. Artefacts from our work are available online.
[326] Apriel-H1: Towards Efficient Enterprise Reasoning Models
Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier, Shruthan Radhakrishna, Soham Parikh, Shambhavi Mishra, Sebastien Paquet, Srinivas Sunkara, Valérie Bécaert, Sathwik Tejaswi Madhusudhan, Torsten Scholak
Main category: cs.LG
TL;DR: Hybrid LLMs combining transformer attention with SSM sequence mixers achieve 2x higher inference throughput with minimal reasoning performance degradation.
Details
Motivation: Transformers suffer from quadratic complexity and high memory usage during inference, limiting throughput for agentic tasks and long-context reasoning. SSMs offer linear complexity and constant memory footprint.Method: Incremental distillation from pretrained transformer, progressively replacing less critical attention layers with linear Mamba blocks to create hybrid SSM-Transformer models.
Result: Apriel-H1-15B models achieve over 2x higher inference throughput in production with minimal reasoning degradation. Performance degrades gradually as more attention layers are replaced with Mamba blocks.
Conclusion: Distilled hybrid SSM-Transformer architectures deliver substantial efficiency gains over pure transformers without significantly compromising reasoning quality.
Abstract: Large Language Models (LLMs) achieve remarkable reasoning capabilities through transformer architectures with attention mechanisms. However, transformers suffer from quadratic time and memory complexity in the attention module (MHA) and require caching key-value states during inference, which severely limits throughput and scalability. High inference throughput is critical for agentic tasks, long-context reasoning, efficient deployment under high request loads, and more efficient test-time compute scaling. State Space Models (SSMs) such as Mamba offer a promising alternative with linear inference complexity and a constant memory footprint via recurrent computation with fixed-size hidden states. In this technical report we introduce the Apriel-H1 family of hybrid LLMs that combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size. These models are obtained through incremental distillation from a pretrained reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing less critical attention layers with linear Mamba blocks. We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces, achieving over 2x higher inference throughput when deployed in the production-ready vLLM environment, with minimal degradation in reasoning performance. This shows that distilled hybrid SSM-Transformer architectures can deliver substantial efficiency gains over the pretrained transformer equivalent without substantially compromising the reasoning quality.
[327] A Non-Adversarial Approach to Idempotent Generative Modelling
Mohammed Al-Jaff, Giovanni Luca Marchetti, Michael C Welle, Jens Lundell, Mats G. Gustafsson, Gustav Eje Henter, Hossein Azizpour, Danica Kragic
Main category: cs.LG
TL;DR: NAIGNs improve IGNs by replacing adversarial training with IMLE, addressing mode collapse and instability while learning manifold distance fields and energy-based models.
Details
Motivation: IGNs suffer from mode collapse, mode dropping, and training instability due to adversarial components, limiting their ability to fully cover the data manifold.Method: Combine reconstruction loss with non-adversarial Implicit Maximum Likelihood Estimation (IMLE) objective to create NAIGNs.
Result: NAIGNs improve data restoration and generation quality, implicitly learn distance fields to data manifold, and function as energy-based models.
Conclusion: Non-adversarial approach in NAIGNs successfully addresses IGN limitations while providing additional benefits like manifold distance learning.
Abstract: Idempotent Generative Networks (IGNs) are deep generative models that also function as local data manifold projectors, mapping arbitrary inputs back onto the manifold. They are trained to act as identity operators on the data and as idempotent operators off the data manifold. However, IGNs suffer from mode collapse, mode dropping, and training instability due to their objectives, which contain adversarial components and can cause the model to cover the data manifold only partially – an issue shared with generative adversarial networks. We introduce Non-Adversarial Idempotent Generative Networks (NAIGNs) to address these issues. Our loss function combines reconstruction with the non-adversarial generative objective of Implicit Maximum Likelihood Estimation (IMLE). This improves on IGN’s ability to restore corrupted data and generate new samples that closely match the data distribution. We moreover demonstrate that NAIGNs implicitly learn the distance field to the data manifold, as well as an energy-based model.
[328] In Situ Training of Implicit Neural Compressors for Scientific Simulations via Sketch-Based Regularization
Cooper Simpson, Stephen Becker, Alireza Doostan
Main category: cs.LG
TL;DR: Novel in situ training protocol using implicit neural representations with memory buffers of full and sketched data to prevent catastrophic forgetting, achieving high compression rates for complex simulation data.
Details
Motivation: To enable in situ neural compression using implicit neural representation-based hypernetworks while preventing catastrophic forgetting during continual learning.Method: Uses limited memory buffers containing both full and sketched data samples, with sketching serving as a regularizer based on Johnson-Lindenstrauss theory to prevent forgetting.
Result: Achieves strong reconstruction performance at high compression rates on complex 2D/3D simulation data over long time horizons and unstructured grids, matching offline method performance.
Conclusion: Sketching enables effective in situ training that approximates offline method performance while preventing catastrophic forgetting in continual learning scenarios.
Abstract: Focusing on implicit neural representations, we present a novel in situ training protocol that employs limited memory buffers of full and sketched data samples, where the sketched data are leveraged to prevent catastrophic forgetting. The theoretical motivation for our use of sketching as a regularizer is presented via a simple Johnson-Lindenstrauss-informed result. While our methods may be of wider interest in the field of continual learning, we specifically target in situ neural compression using implicit neural representation-based hypernetworks. We evaluate our method on a variety of complex simulation data in two and three dimensions, over long time horizons, and across unstructured grids and non-Cartesian geometries. On these tasks, we show strong reconstruction performance at high compression rates. Most importantly, we demonstrate that sketching enables the presented in situ scheme to approximately match the performance of the equivalent offline method.
[329] Recursively Enumerably Representable Classes and Computable Versions of the Fundamental Theorem of Statistical Learning
David Kattermann, Lothar Sebastian Krapp
Main category: cs.LG
TL;DR: The paper investigates computable PAC (CPAC) learning, showing that the traditional VC-dimension characterization fails in the computable setting, but effective VC-dimension and recursively enumerable representable (RER) classes provide alternative characterizations.
Details
Motivation: To understand how computability constraints affect PAC learning theory, particularly why the Fundamental Theorem of Statistical Learning breaks down when learners must be computable functions, and to find alternative characterizations.Method: Analyze the relationship between CPAC learning and RER classes, study effective VC-dimensions, and examine various notions of CPAC learning including nonuniform CPAC learning.
Result: Effective VC-dimensions can be arbitrarily larger than traditional VC-dimension even for RER classes; CPAC learnability can be characterized via RER class containment; agnostic learnability is achievable for RER classes through nonuniform CPAC learning.
Conclusion: Computable PAC learning requires new theoretical frameworks beyond traditional VC-dimension, with RER classes and effective VC-dimensions providing key insights and characterizations for learnability in the computable setting.
Abstract: We study computable probably approximately correct (CPAC) learning, where learners are required to be computable functions. It had been previously observed that the Fundamental Theorem of Statistical Learning, which characterizes PAC learnability by finiteness of the Vapnik-Chervonenkis (VC-)dimension, no longer holds in this framework. Recent works recovered analogs of the Fundamental Theorem in the computable setting, for instance by introducing an effective VC-dimension. Guided by this, we investigate the connection between CPAC learning and recursively enumerable representable (RER) classes, whose members can be algorithmically listed. Our results show that the effective VC-dimensions can take arbitrary values above the traditional one, even for RER classes, which creates a whole family of (non-)examples for various notions of CPAC learning. Yet the two dimensions coincide for classes satisfying sufficiently strong notions of CPAC learning. We then observe that CPAC learnability can also be characterized via containment of RER classes that realize the same samples. Furthermore, it is shown that CPAC learnable classes satisfying a unique identification property are necessarily RER. Finally, we establish that agnostic learnability can be guaranteed for RER classes, by considering the relaxed notion of nonuniform CPAC learning.
[330] Scalable Evaluation and Neural Models for Compositional Generalization
Giacomo Camposampiero, Pietro Barbiero, Michael Hersche, Roger Wattenhofer, Abbas Rahimi
Main category: cs.LG
TL;DR: This paper addresses compositional generalization in machine learning by introducing a rigorous evaluation framework, extensive benchmarking of vision backbones, and Attribute Invariant Networks that achieve significant accuracy improvements with minimal parameter overhead.
Details
Motivation: Current evaluation protocols for compositional generalization lack standardization and rigor, while general-purpose vision architectures lack necessary inductive biases and existing approaches compromise scalability.Method: 1) Developed a rigorous evaluation framework that unifies previous approaches with constant computational requirements; 2) Conducted extensive evaluation training over 5000 models; 3) Proposed Attribute Invariant Networks as a new model class.
Result: Attribute Invariant Networks established a new Pareto frontier with 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled models.
Conclusion: The paper provides a comprehensive solution to compositional generalization challenges through improved evaluation protocols and efficient model architectures that balance performance and scalability.
Abstract: Compositional generalization-a key open challenge in modern machine learning-requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack of standardized evaluation protocols and the limitations of current benchmarks, which often favor efficiency over rigor. At the same time, general-purpose vision architectures lack the necessary inductive biases, and existing approaches to endow them compromise scalability. As a remedy, this paper introduces: 1) a rigorous evaluation framework that unifies and extends previous approaches while reducing computational requirements from combinatorial to constant; 2) an extensive and modern evaluation on the status of compositional generalization in supervised vision backbones, training more than 5000 models; 3) Attribute Invariant Networks, a class of models establishing a new Pareto frontier in compositional generalization, achieving a 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts.
[331] Nesterov-Accelerated Robust Federated Learning Over Byzantine Adversaries
Lihan Xu, Yanjie Dong, Gang Wang, Runhao Zeng, Xiaoyi Fan, Xiping Hu
Main category: cs.LG
TL;DR: Proposes Byrd-NAFL, a Byzantine-resilient federated learning algorithm that combines Nesterov’s momentum with robust aggregation to achieve fast and secure convergence against malicious workers.
Details
Motivation: To address the dual challenges of communication efficiency and robustness in federated learning systems where Byzantine adversaries can exhibit arbitrary malicious behaviors.Method: Integrates Nesterov’s momentum into federated learning with Byzantine-resilient aggregation rules to accelerate convergence while safeguarding against gradient corruption from malicious workers.
Result: Establishes finite-time convergence guarantees under non-convex smooth loss functions and demonstrates superior performance over benchmarks in convergence speed, accuracy, and resilience to various Byzantine attacks.
Conclusion: Byrd-NAFL effectively combines acceleration techniques with Byzantine resilience, providing a practical solution for robust and efficient federated learning in adversarial environments.
Abstract: We investigate robust federated learning, where a group of workers collaboratively train a shared model under the orchestration of a central server in the presence of Byzantine adversaries capable of arbitrary and potentially malicious behaviors. To simultaneously enhance communication efficiency and robustness against such adversaries, we propose a Byzantine-resilient Nesterov-Accelerated Federated Learning (Byrd-NAFL) algorithm. Byrd-NAFL seamlessly integrates Nesterov’s momentum into the federated learning process alongside Byzantine-resilient aggregation rules to achieve fast and safeguarding convergence against gradient corruption. We establish a finite-time convergence guarantee for Byrd-NAFL under non-convex and smooth loss functions with relaxed assumption on the aggregated gradients. Extensive numerical experiments validate the effectiveness of Byrd-NAFL and demonstrate the superiority over existing benchmarks in terms of convergence speed, accuracy, and resilience to diverse Byzantine attack strategies.
[332] STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular Generation
Bum Chul Kwon, Ben Shapira, Moshiko Raboh, Shreyans Sethi, Shruti Murarka, Joseph A Morrone, Jianying Hu, Parthasarathy Suryanarayanan
Main category: cs.LG
TL;DR: STAR-VAE is a Transformer-based variational autoencoder for molecular generation that uses SELFIES representations to ensure validity, enables property-guided conditional generation, and supports efficient finetuning with LoRA adapters.
Details
Motivation: The chemical space is vast, requiring generative models that can learn broad distributions, enable conditional generation based on properties, and provide fast molecular generation.Method: Transformer-based VAE with SELFIES encoding, trained on 79M PubChem molecules, using latent-variable formulation for conditional generation and LoRA for efficient finetuning.
Result: Matches or exceeds baselines on GuacaMol and MOSES benchmarks, shows smooth latent representations, and shifts docking-score distributions toward stronger binding on Tartarus benchmarks.
Conclusion: A modernized VAE with principled conditioning and parameter-efficient finetuning remains competitive for molecular generation tasks.
Abstract: The chemical space of drug-like molecules is vast, motivating the development of generative models that must learn broad chemical distributions, enable conditional generation by capturing structure-property representations, and provide fast molecular generation. Meeting the objectives depends on modeling choices, including the probabilistic modeling approach, the conditional generative formulation, the architecture, and the molecular input representation. To address the challenges, we present STAR-VAE (Selfies-encoded, Transformer-based, AutoRegressive Variational Auto Encoder), a scalable latent-variable framework with a Transformer encoder and an autoregressive Transformer decoder. It is trained on 79 million drug-like molecules from PubChem, using SELFIES to guarantee syntactic validity. The latent-variable formulation enables conditional generation: a property predictor supplies a conditioning signal that is applied consistently to the latent prior, the inference network, and the decoder. Our contributions are: (i) a Transformer-based latent-variable encoder-decoder model trained on SELFIES representations; (ii) a principled conditional latent-variable formulation for property-guided generation; and (iii) efficient finetuning with low-rank adapters (LoRA) in both encoder and decoder, enabling fast adaptation with limited property and activity data. On the GuacaMol and MOSES benchmarks, our approach matches or exceeds baselines, and latent-space analyses reveal smooth, semantically structured representations that support both unconditional exploration and property-aware generation. On the Tartarus benchmarks, the conditional model shifts docking-score distributions toward stronger predicted binding. These results suggest that a modernized, scale-appropriate VAE remains competitive for molecular generation when paired with principled conditioning and parameter-efficient finetuning.
[333] Curriculum Design for Trajectory-Constrained Agent: Compressing Chain-of-Thought Tokens in LLMs
Georgios Tzannetos, Parameswaran Kamalaruban, Adish Singla
Main category: cs.LG
TL;DR: Curriculum learning strategy that gradually tightens constraints during training to help agents master deployment requirements more efficiently.
Details
Motivation: Training agents to operate under strict deployment constraints (resource budgets, safety requirements) is challenging, especially when constraints make tasks complex.Method: Curriculum learning inspired by self-paced learning, starting with simplified constraints and progressively introducing full deployment conditions. Applied to both RL and LLM agents.
Result: Theoretical analysis shows accelerated training; empirical validation demonstrates effectiveness across RL agents in binary-tree MDP and navigation tasks, and LLMs in math reasoning with output compression achieving inference speedup.
Conclusion: Curriculum design enhances efficiency and performance of agents under complex trajectory constraints, with practical benefits for resource-constrained deployment including LLM inference speedup.
Abstract: Training agents to operate under strict constraints during deployment, such as limited resource budgets or stringent safety requirements, presents significant challenges, especially when these constraints render the task complex. In this work, we propose a curriculum learning strategy that gradually tightens constraints during training, enabling the agent to incrementally master the deployment requirements. Inspired by self-paced learning techniques in unconstrained reinforcement learning (RL), our approach facilitates a smoother transition to challenging environments by initially training on simplified versions of the constraints and progressively introducing the full deployment conditions. We provide a theoretical analysis using an RL agent in a binary-tree Markov Decision Process (MDP) to demonstrate that our curriculum strategy can accelerate training relative to a baseline approach that imposes the trajectory constraints from the outset. Moreover, we empirically validate the effectiveness and generality of our method across both RL and large language model (LLM) agents in diverse settings, including a binary-tree MDP, a multi-task navigation domain, and a math reasoning task with two benchmarks. These results highlight the potential of curriculum design in enhancing the efficiency and performance of agents operating under complex trajectory constraints during deployment. Moreover, when applied to LLMs, our strategy enables compression of output chain-of-thought tokens, achieving a substantial inference speedup on consumer hardware, demonstrating its effectiveness for resource-constrained deployment.
[334] Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning
Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao
Main category: cs.LG
TL;DR: Twilight is a framework that adaptively prunes redundant tokens in attention mechanisms using top-p sampling, achieving up to 98% token pruning and significant speedups in long-context LLM decoding.
Details
Motivation: Current sparse attention and KV cache compression methods use fixed budgets, which fail to adapt to dynamic real-world scenarios where optimal accuracy-efficiency tradeoffs vary.Method: Borrow top-p sampling (nucleus sampling) to sparse attention to achieve adaptive budgeting, creating Twilight framework that adds adaptive sparsity to existing sparse attention algorithms.
Result: Twilight adaptively prunes up to 98% of redundant tokens, achieving 15.4× acceleration in self-attention operations and 3.9× acceleration in end-to-end per token latency.
Conclusion: Twilight successfully brings adaptive sparsity to sparse attention algorithms without sacrificing accuracy, providing significant performance improvements for long-context LLM decoding.
Abstract: Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.
[335] Does Interpretability of Knowledge Tracing Models Support Teacher Decision Making?
Adia Khalid, Alina Deriyeva, Benjamin Paassen
Main category: cs.LG
TL;DR: Study investigates whether interpretable knowledge tracing models actually help teachers make better pedagogical decisions, finding that while teachers prefer interpretable models, they don’t necessarily lead to faster mastery in practice.
Details
Motivation: To examine if the interpretability of knowledge tracing models actually benefits human teachers in making pedagogical decisions, given that KT models are typically required to be interpretable for high-stakes teaching decisions.Method: Conducted a simulation study comparing decisions based on interpretable vs non-interpretable KT models, then repeated with N=12 human teachers making decisions based on model information.
Result: Simulation showed interpretable models achieve mastery faster, but human teachers’ decisions showed no significant difference in tasks needed for mastery between model types, despite rating interpretable models higher in usability and trustworthiness.
Conclusion: The relationship between model interpretability and teacher decisions is complex - teachers don’t solely rely on KT models, and further research is needed to understand how learners and teachers actually use these models.
Abstract: Knowledge tracing (KT) models are a crucial basis for pedagogical decision-making, namely which task to select next for a learner and when to stop teaching a particular skill. Given the high stakes of pedagogical decisions, KT models are typically required to be interpretable, in the sense that they should implement an explicit model of human learning and provide explicit estimates of learners’ abilities. However, to our knowledge, no study to date has investigated whether the interpretability of KT models actually helps human teachers to make teaching decisions. We address this gap. First, we perform a simulation study to show that, indeed, decisions based on interpretable KT models achieve mastery faster compared to decisions based on a non-interpretable model. Second, we repeat the study but ask $N=12$ human teachers to make the teaching decisions based on the information provided by KT models. As expected, teachers rate interpretable KT models higher in terms of usability and trustworthiness. However, the number of tasks needed until mastery hardly differs between KT models. This suggests that the relationship between model interpretability and teacher decisions is not straightforward: teachers do not solely rely on KT models to make decisions and further research is needed to investigate how learners and teachers actually understand and use KT models.
[336] Training Language Models to Reason Efficiently
Daman Arora, Andrea Zanette
Main category: cs.LG
TL;DR: Using reinforcement learning to train large reasoning models to dynamically allocate inference-time compute based on task complexity, reducing computational costs while maintaining accuracy.
Details
Motivation: Large reasoning models with long chain-of-thoughts provide strong problem-solving capabilities but have high deployment costs due to longer generations. Reducing inference costs is crucial for economic feasibility, user experience, and environmental sustainability.Method: Propose using reinforcement learning (RL) to train reasoning models to dynamically allocate inference-time compute based on task complexity, incentivizing models to minimize unnecessary computational overhead while maintaining accuracy.
Result: Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy. The method enables deriving a family of reasoning models with varying efficiency levels controlled via a single hyperparameter.
Conclusion: Training large reasoning models to reason efficiently through RL-based dynamic compute allocation achieves substantial efficiency gains while maintaining performance, addressing the high deployment costs of reasoning models.
Abstract: Scaling model size and training data has led to great advances in the performance of Large Language Models (LLMs). However, the diminishing returns of this approach necessitate alternative methods to improve model capabilities, particularly in tasks requiring advanced reasoning. Large reasoning models, which leverage long chain-of-thoughts, bring unprecedented breakthroughs in problem-solving capabilities but at a substantial deployment cost associated to longer generations. Reducing inference costs is crucial for the economic feasibility, user experience, and environmental sustainability of these models. In this work, we propose to train large reasoning models to reason efficiently. More precisely, we use reinforcement learning (RL) to train reasoning models to dynamically allocate inference-time compute based on task complexity. Our method incentivizes models to minimize unnecessary computational overhead while maintaining accuracy, thereby achieving substantial efficiency gains. It enables the derivation of a family of reasoning models with varying efficiency levels, controlled via a single hyperparameter. Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy.
[337] Calibration improves detection of mislabeled examples
Ilies Chibane, Thomas George, Pierre Nodet, Vincent Lemaire
Main category: cs.LG
TL;DR: Calibrating base machine learning models improves accuracy and robustness in detecting mislabeled data instances.
Details
Motivation: Mislabeled data undermines ML system performance in real-world applications, creating a need for effective detection methods.Method: Investigate the impact of calibrating base ML models used for mislabeling detection, which rely on trust scores from model probing.
Result: Empirical results show calibration methods improve accuracy and robustness of mislabeled instance detection.
Conclusion: Model calibration provides a practical and effective solution for industrial applications dealing with mislabeled data.
Abstract: Mislabeled data is a pervasive issue that undermines the performance of machine learning systems in real-world applications. An effective approach to mitigate this problem is to detect mislabeled instances and subject them to special treatment, such as filtering or relabeling. Automatic mislabeling detection methods typically rely on training a base machine learning model and then probing it for each instance to obtain a trust score that each provided label is genuine or incorrect. The properties of this base model are thus of paramount importance. In this paper, we investigate the impact of calibrating this model. Our empirical results show that using calibration methods improves the accuracy and robustness of mislabeled instance detection, providing a practical and effective solution for industrial applications.
[338] TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models
Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, Vinay Kumar Sankarapu
Main category: cs.LG
TL;DR: TabTune is a unified library that standardizes the workflow for tabular foundation models, addressing challenges like heterogeneous preprocessing and inconsistent evaluation.
Details
Motivation: To overcome limitations in tabular foundation model adoption caused by fragmented APIs, inconsistent fine-tuning procedures, and lack of standardized evaluation for deployment metrics.Method: Provides a single interface with consistent access to 7 state-of-the-art models, supports multiple adaptation strategies (zero-shot, meta-learning, SFT, PEFT), automates model-aware preprocessing, and manages architectural heterogeneity.
Result: A unified framework that enables consistent benchmarking of adaptation strategies and integrates evaluation modules for performance, calibration, and fairness.
Conclusion: TabTune facilitates the adoption of tabular foundation models through standardization, extensibility, and reproducibility, with the library being open source and publicly available.
Abstract: Tabular foundation models represent a growing paradigm in structured data learning, extending the benefits of large-scale pretraining to tabular domains. However, their adoption remains limited due to heterogeneous preprocessing pipelines, fragmented APIs, inconsistent fine-tuning procedures, and the absence of standardized evaluation for deployment-oriented metrics such as calibration and fairness. We present TabTune, a unified library that standardizes the complete workflow for tabular foundation models through a single interface. TabTune provides consistent access to seven state-of-the-art models supporting multiple adaptation strategies, including zero-shot inference, meta-learning, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT). The framework automates model-aware preprocessing, manages architectural heterogeneity internally, and integrates evaluation modules for performance, calibration, and fairness. Designed for extensibility and reproducibility, TabTune enables consistent benchmarking of adaptation strategies of tabular foundation models. The library is open source and available at https://github.com/Lexsi-Labs/TabTune .
[339] ConMeZO: Adaptive Descent-Direction Sampling for Gradient-Free Finetuning of Large Language Models
Lejs Deen Behric, Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil
Main category: cs.LG
TL;DR: ConMeZO is a zeroth-order optimizer that accelerates LLM finetuning by using adaptive directional sampling with momentum, achieving 2x faster convergence than MeZO while maintaining low memory usage.
Details
Motivation: Zeroth-order optimization (MeZO) is memory-efficient for LLM finetuning but suffers from slow convergence due to high-dimensional parameter spaces. The need to improve convergence speed while preserving memory efficiency motivates this work.Method: ConMeZO restricts directional sampling to a cone centered around a momentum estimate instead of uniform random sampling, concentrating search in directions where the true gradient is more likely to be found.
Result: ConMeZO achieves the same worst-case convergence rate as MeZO theoretically, and empirically shows up to 2x faster convergence when finetuning LLMs on natural language tasks while maintaining low memory footprint.
Conclusion: ConMeZO successfully addresses the slow convergence problem of zeroth-order methods for LLM finetuning through adaptive directional sampling, providing significant speed improvements without sacrificing memory efficiency.
Abstract: Zeroth-order or derivative-free optimization (MeZO) is an attractive strategy for finetuning large language models (LLMs) because it eliminates the memory overhead of backpropagation. However, it converges slowly due to the inherent curse of dimensionality when searching for descent directions in the high-dimensional parameter space of billion-scale LLMs. We propose ConMeZO, a novel zeroth-order optimizer that accelerates convergence by adaptive directional sampling. Instead of drawing the direction uniformly at random, ConMeZO restricts the sampling to a cone centered around a momentum estimate. This concentrates the search in directions where the true gradient is more likely to lie and thus reduces the effect of high dimensions. We prove that ConMeZO achieves the same worst-case convergence rate as MeZO. Empirically, when finetuning LLMs on natural language tasks, ConMeZO is up to 2X faster than MeZO while retaining the low-memory footprint of zeroth-order methods.
[340] Assessing win strength in MLB win prediction models
Morgan Allen, Paul Savala
Main category: cs.LG
TL;DR: This paper analyzes MLB game predictions using machine learning models, showing a relationship between predicted win probability and actual win strength, and demonstrates profitable betting strategies using these predictions.
Details
Motivation: To extend previous MLB prediction work by training comprehensive ML models and relate win probabilities to actual win strength, then apply these to betting strategies.Method: Trained multiple machine learning models on a common dataset to predict MLB game outcomes, then analyzed the relationship between predicted win probabilities and actual score differentials.
Result: Found that common ML models show a relationship between predicted win probability and win strength, and demonstrated positive returns when using appropriate betting strategies based on these predictions.
Conclusion: Machine learning models can effectively predict MLB game outcomes and win strength, but naive application to betting leads to losses while strategic use can generate positive returns.
Abstract: In Major League Baseball, strategy and planning are major factors in determining the outcome of a game. Previous studies have aided this by building machine learning models for predicting the winning team of any given game. We extend this work by training a comprehensive set of machine learning models using a common dataset. In addition, we relate the win probabilities produced by these models to win strength as measured by score differential. In doing so we show that the most common machine learning models do indeed demonstrate a relationship between predicted win probability and the strength of the win. Finally, we analyze the results of using predicted win probabilities as a decision making mechanism on run-line betting. We demonstrate positive returns when utilizing appropriate betting strategies, and show that naive use of machine learning models for betting lead to significant loses.
[341] VecComp: Vector Computing via MIMO Digital Over-the-Air Computation
Saeed Razavikia, José Mairton Barros Da Silva Junior, Carlo Fischione
Main category: cs.LG
TL;DR: VecComp extends ChannelComp framework to enable vector function computation using multiple-antenna technology, maintaining linear computational complexity with vector dimension and robustness against channel fading.
Details
Motivation: ChannelComp was limited to scalar functions and susceptible to channel fading, while many applications require vector-based computations. VecComp addresses these limitations to support high-dimensional data-centric applications.Method: Integrates ChannelComp with multiple-antenna technology to enable vector function computation, ensuring computational complexity scales linearly with vector dimension.
Result: Establishes non-asymptotic upper bound on mean squared error, confirming computation efficiency under fading channels. Numerical experiments demonstrate improved vector function computation and fading compensation.
Conclusion: VecComp successfully generalizes ChannelComp for vector computations while maintaining computational efficiency and robustness against channel impairments, making it suitable for high-dimensional applications.
Abstract: Recently, the ChannelComp framework has proposed digital over-the-air computation by designing digital modulations that enable the computation of arbitrary functions. Unlike traditional analog over-the-air computation, which is restricted to nomographic functions, ChannelComp enables a broader range of computational tasks while maintaining compatibility with digital communication systems. This framework is intended for applications that favor local information processing over the mere acquisition of data. However, ChannelComp is currently designed for scalar function computation, while numerous data-centric applications necessitate vector-based computations, and it is susceptible to channel fading. In this work, we introduce a generalization of the ChannelComp framework, called VecComp, by integrating ChannelComp with multiple-antenna technology. This generalization not only enables vector function computation but also ensures scalability in the computational complexity, which increases only linearly with the vector dimension. As such, VecComp remains computationally efficient and robust against channel impairments, making it suitable for high-dimensional, data-centric applications. We establish a non-asymptotic upper bound on the mean squared error of VecComp, affirming its computation efficiency under fading channel conditions. Numerical experiments show the effectiveness of VecComp in improving the computation of vector functions and fading compensation over noisy and fading multiple-access channels.
[342] Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold
Xinghan Li, Haodong Wen, Kaifeng Lyu
Main category: cs.LG
TL;DR: Adam optimizer implicitly reduces a unique sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from SGD that achieve better sparsity and generalization in certain settings.
Details
Motivation: Despite Adam's popularity in practice, most theoretical analyses focus on SGD as a proxy, leaving little understanding of how Adam's solutions actually differ from SGD's.Method: The authors use continuous-time approximation with stochastic differential equations to rigorously characterize Adam’s behavior, and analyze its performance in overparameterized models with label noise and sparse linear regression with diagonal linear networks.
Result: When training with label noise, SGD minimizes tr(H) while Adam minimizes tr(Diag(H)^{1/2}). In sparse linear regression, this distinction enables Adam to achieve better sparsity and generalization than SGD.
Conclusion: The analysis framework extends to various adaptive gradient methods (RMSProp, Adam-mini, Adalayer, Shampoo) and provides a unified perspective on how adaptive optimizers reduce sharpness, offering insights for future optimizer design.
Abstract: Despite the popularity of the Adam optimizer in practice, most theoretical analyses study Stochastic Gradient Descent (SGD) as a proxy for Adam, and little is known about how the solutions found by Adam differ. In this paper, we show that Adam implicitly reduces a unique form of sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from SGD. More specifically, when the training loss is small, Adam wanders around the manifold of minimizers and takes semi-gradients to minimize this sharpness measure in an adaptive manner, a behavior we rigorously characterize through a continuous-time approximation using stochastic differential equations. We further demonstrate how this behavior differs from that of SGD in a well-studied setting: when training overparameterized models with label noise, SGD has been shown to minimize the trace of the Hessian matrix, $\tr(\mH)$, whereas we prove that Adam minimizes $\tr(\Diag(\mH)^{1/2})$ instead. In solving sparse linear regression with diagonal linear networks, this distinction enables Adam to achieve better sparsity and generalization than SGD. Finally, our analysis framework extends beyond Adam to a broad class of adaptive gradient methods, including RMSProp, Adam-mini, Adalayer and Shampoo, and provides a unified perspective on how these adaptive optimizers reduce sharpness, which we hope will offer insights for future optimizer design.
[343] Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
A. Bochkov
Main category: cs.LG
TL;DR: The paper proposes a constructive scaling paradigm for LLMs using frozen embeddings and layer-wise growth with LoRA fine-tuning, achieving comparable performance to monolithic training while being more resource-efficient.
Details
Motivation: To overcome the resource-intensive and inflexible nature of monolithic LLM training by enabling incremental model growth through constructive scaling.Method: Layer-wise constructive methodology with frozen embeddings and early layers, combined with LoRA fine-tuning for efficient holistic optimization as model depth increases.
Result: Constructively grown models achieve performance comparable to monolithically trained baselines while demonstrating stable convergence and emergence of complex reasoning abilities with increased depth.
Conclusion: The approach enables more resource-efficient scaling, continual learning, and modular AI development, suggesting a paradigm shift from monolithic optimization to constructive growth.
Abstract: The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive scaling paradigm, enabled by the principle of emergent semantics in Transformers with frozen, non-semantic input embeddings. We posit that because high-level meaning is a compositional property of a Transformer’s deep layers, not its input vectors, the embedding layer and trained lower layers can serve as a fixed foundation. This liberates backpropagation to focus solely on newly added components, making incremental growth viable. We operationalize this with a layer-wise constructive methodology that combines strict layer freezing in early stages with efficient, holistic fine-tuning of the entire model stack via low-rank adaptation (LoRA) as complexity increases. This method not only demonstrates stable convergence but also reveals a direct correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD, which are absent in shallower models. In a controlled study, our constructively grown model rivals the performance of a monolithically trained baseline of the same size, validating the efficiency and efficacy of the approach. Our findings suggest a path towards a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development. This opens a path for more resource-efficient scaling, continual learning, and a more modular approach to building powerful AI systems. We release all code and models to facilitate further research.
[344] Enhancing Federated Learning Privacy with QUBO
Andras Ferenczi, Sutapa Samanta, Dagen Wang, Todd Hodges
Main category: cs.LG
TL;DR: A quantum-inspired QUBO formulation for federated learning that selects minimal client updates per round, reducing privacy exposure by up to 95.2% while maintaining model accuracy.
Details
Motivation: To address cumulative privacy risks in federated learning where repeated client participation increases exposure to membership inference, property inference, and model inversion attacks.Method: Uses quadratic unconstrained binary optimization (QUBO) to select a small subset of most relevant client updates for each training round, assuming a trusted server with access to validation data.
Result: On MNIST with 300 clients: 95.2% per-round and 49% cumulative privacy exposure reduction, 147 clients never used while maintaining full-aggregation accuracy. On CINIC-10: 82% per-round and 33% cumulative privacy improvement.
Conclusion: The QUBO-based client selection method effectively reduces privacy exposure in federated learning without compromising model performance, demonstrating scalability across different datasets and model complexities.
Abstract: Federated learning (FL) is a widely used method for training machine learning (ML) models in a scalable way while preserving privacy (i.e., without centralizing raw data). Prior research shows that the risk of exposing sensitive data increases cumulatively as the number of iterations where a client’s updates are included in the aggregated model increase. Attackers can launch membership inference attacks (MIA; deciding whether a sample or client participated), property inference attacks (PIA; inferring attributes of a client’s data), and model inversion attacks (MI; reconstructing inputs), thereby inferring client-specific attributes and, in some cases, reconstructing inputs. In this paper, we mitigate risk by substantially reducing per client exposure using a quantum computing-inspired quadratic unconstrained binary optimization (QUBO) formulation that selects a small subset of client updates most relevant for each training round. In this work, we focus on two threat vectors: (i) information leakage by clients during training and (ii) adversaries who can query or obtain the global model. We assume a trusted central server and do not model server compromise. This method also assumes that the server has access to a validation/test set with global data distribution. Experiments on the MNIST dataset with 300 clients in 20 rounds showed a 95.2% per-round and 49% cumulative privacy exposure reduction, with 147 clients’ updates never being used during training while maintaining in general the full-aggregation accuracy or even better. The method proved to be efficient at lower scale and more complex model as well. A CINIC-10 dataset-based experiment with 30 clients resulted in 82% per-round privacy improvement and 33% cumulative privacy.
[345] Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation
Ziniu Zhang, Zhenshuo Zhang, Dongyue Li, Lu Wang, Jennifer Dy, Hongyang R. Zhang
Main category: cs.LG
TL;DR: Proposes a gradient-based algorithm for selecting optimal demonstration examples in in-context learning, achieving 37.7x speedup and 11% better performance than embedding-based methods.
Details
Motivation: To efficiently select the best k examples from n candidates for in-context learning, as previous methods based on token embedding similarity have limitations.Method: Uses gradient-based influence estimation through first-order approximation, aggregates scores from multiple random subsets, and selects top-k examples.
Result: Achieves <1% approximation error, 37.7x speedup on 34B parameter models, and 11% average improvement over embedding-based methods across various models and datasets.
Conclusion: Gradient-based demonstration selection is highly efficient and effective for in-context learning, enabling scalable subset selection without full inference.
Abstract: This paper introduces an algorithm to select demonstration examples for in-context learning of a query set. Given a set of $n$ examples, how can we quickly select $k$ out of $n$ to best serve as the conditioning for downstream inference? This problem has broad applications in prompt tuning and chain-of-thought reasoning. Since model weights remain fixed during in-context learning, previous work has sought to design methods based on the similarity of token embeddings. This work proposes a new approach based on gradients of the output taken in the input embedding space. Our approach estimates model outputs through a first-order approximation using the gradients. Then, we apply this estimation to multiple randomly sampled subsets. Finally, we aggregate the sampled subset outcomes to form an influence score for each demonstration, and select $k$ most relevant examples. This procedure only requires pre-computing model outputs and gradients once, resulting in a linear-time algorithm relative to model and training set sizes. Extensive experiments across various models and datasets validate the efficiency of our approach. We show that the gradient estimation procedure yields approximations of full inference with less than ${1}%$ error across six datasets. This allows us to scale up subset selection that would otherwise run full inference by up to ${37.7}\times$ on models with up to $34$ billion parameters, and outperform existing selection methods based on input embeddings by ${11}%$ on average.
[346] Fast, Private, and Protected: Safeguarding Data Privacy and Defending Against Model Poisoning Attacks in Federated Learning
Nicolas Riccieri Gardin Assumpcao, Leandro Villas
Main category: cs.LG
TL;DR: FPP is a novel federated learning approach that provides security against model poisoning attacks while maintaining privacy through secure aggregation and enabling training recovery.
Details
Motivation: Federated Learning faces security challenges where privacy-preserving mechanisms make it difficult to protect against attackers who want to compromise the training outcome through model poisoning attacks.Method: FPP uses participants’ assessments to evaluate rounds, enables training recovery after attacks, and employs a reputation-based mechanism to mitigate attacker participation. It was validated in a dockerized environment against FedAvg, Power-of-Choice, and aggregation via Trimmed Mean and Median.
Result: FPP achieves rapid convergence rate and can converge even in the presence of malicious participants performing model poisoning attacks.
Conclusion: FPP successfully safeguards federated training while preserving data privacy through secure aggregation and provides resilience against model poisoning attacks.
Abstract: Federated Learning (FL) is a distributed training paradigm wherein participants collaborate to build a global model while ensuring the privacy of the involved data, which remains stored on participant devices. However, proposals aiming to ensure such privacy also make it challenging to protect against potential attackers seeking to compromise the training outcome. In this context, we present Fast, Private, and Protected (FPP), a novel approach that aims to safeguard federated training while enabling secure aggregation to preserve data privacy. This is accomplished by evaluating rounds using participants’ assessments and enabling training recovery after an attack. FPP also employs a reputation-based mechanism to mitigate the participation of attackers. We created a dockerized environment to validate the performance of FPP compared to other approaches in the literature (FedAvg, Power-of-Choice, and aggregation via Trimmed Mean and Median). Our experiments demonstrate that FPP achieves a rapid convergence rate and can converge even in the presence of malicious participants performing model poisoning attacks.
[347] FlowRL: Matching Reward Distributions for LLM Reasoning
Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, Zhouhan Lin
Main category: cs.LG
TL;DR: FlowRL is a reinforcement learning method for LLMs that matches full reward distributions instead of maximizing rewards, improving diversity and performance on reasoning tasks.
Details
Motivation: Existing reward-maximizing methods like PPO and GRPO tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, reducing diversity in LLM reasoning.Method: Transform scalar rewards into normalized target distribution using learnable partition function, then minimize reverse KL divergence between policy and target distribution via flow-balanced optimization.
Result: Achieves 10.0% average improvement over GRPO and 5.1% over PPO on math benchmarks, with consistent better performance on code reasoning tasks.
Conclusion: Reward distribution-matching is a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
Abstract: We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0%$ over GRPO and $5.1%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
[348] GeoCrossBench: Cross-Band Generalization for Remote Sensing
Hakob Tamazyan, Ani Vanyan, Alvard Barseghyan, Anna Khosrovyan, Evan Shelhamer, Hrant Khachatrian
Main category: cs.LG
TL;DR: GeoCrossBench extends GeoBench with cross-satellite evaluation, showing current models struggle with new satellites lacking band overlap or having additional bands. ChiViT outperforms others in no-overlap scenarios, but all models suffer performance drops. Fine-tuning with oracle labels shows potential for consistent performance.
Details
Motivation: As remote sensing satellites proliferate with diverse sensors, foundation models need better cross-satellite generalization since retraining for each new satellite is costly. Most labeled data comes from older satellites.Method: Developed GeoCrossBench benchmark with three evaluation protocols: in-distribution performance, generalization to satellites with no band overlap, and generalization to satellites with additional bands. Also created ChiViT, a self-supervised extension of ChannelViT.
Result: Best remote sensing foundation models (DOFA, TerraFM) don’t outperform general models like DINOv3 in-distribution. All models suffer 2-4x performance drop with no band overlap, where ChiViT significantly outperforms DINOv3. Models drop 5-25% with additional bands. Fine-tuning last layer with oracle labels achieves consistent performance.
Conclusion: Current remote sensing models lack strong cross-satellite generalization. The benchmark is far from saturated, highlighting need for more future-proof models. Code and datasets released publicly to encourage development.
Abstract: The number and diversity of remote sensing satellites grows over time, while the vast majority of labeled data comes from older satellites. As the foundation models for Earth observation scale up, the cost of (re-)training to support new satellites grows too, so the generalization capabilities of the models towards new satellites become increasingly important. In this work we introduce GeoCrossBench, an extension of the popular GeoBench benchmark with a new evaluation protocol: it tests the in-distribution performance; generalization to satellites with no band overlap; and generalization to satellites with additional bands with respect to the training set. We also develop a self-supervised extension of ChannelViT, ChiViT, to improve its cross-satellite performance. First, we show that even the best foundation models for remote sensing (DOFA, TerraFM) do not outperform general purpose models like DINOv3 in the in-distribution setting. Second, when generalizing to new satellites with no band overlap, all models suffer 2-4x drop in performance, and ChiViT significantly outperforms the runner-up DINOv3. Third, the performance of all tested models drops on average by 5-25% when given additional bands during test time. Finally, we show that fine-tuning just the last linear layer of these models using oracle labels from all bands can get relatively consistent performance across all satellites, highlighting that the benchmark is far from being saturated. We publicly release the code and the datasets to encourage the development of more future-proof remote sensing models with stronger cross-satellite generalization.
[349] Hybrid Quantum-Classical Recurrent Neural Networks
Wenduan Xu
Main category: cs.LG
TL;DR: A hybrid quantum-classical recurrent neural network with quantum recurrent core using parametrized quantum circuits as coherent memory in exponentially large Hilbert space, achieving competitive performance on various sequence-learning tasks.
Details
Motivation: To create a physically consistent quantum RNN that unifies unitary recurrence as high-capacity memory, partial observation via mid-circuit readouts, and nonlinear classical control for input-conditioned parametrization.Method: Hybrid architecture with quantum recurrent core (parametrized quantum circuit) controlled by classical feedforward network. Hidden state is quantum state of n-qubit PQC, updated via unitary dynamics with mid-circuit Pauli expectation-value readouts combined with classical processing.
Result: Achieved competitive performance against strong classical baselines on sentiment analysis, MNIST, permuted MNIST, copying memory, language modeling, and machine translation with soft attention mechanism.
Conclusion: First quantum-grounded model to achieve competitive performance across broad class of sequence-learning tasks, demonstrating viability of quantum recurrent neural networks with physically consistent design.
Abstract: We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an $n$-qubit PQC in an exponentially large Hilbert space $\mathbb{C}^{2^n}$, which serves as a coherent recurrent quantum memory. The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit Pauli expectation-value readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit readouts, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling. For sequence-to-sequence learning, we further devise a soft attention mechanism over the mid-circuit readouts and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.
[350] Explainable Graph Neural Architecture Search via Monte-Carlo Tree Search (Full version)
Yuya Sasaki
Main category: cs.LG
TL;DR: ExGNAS is an explainable Graph Neural Architecture Search method that uses a simple search space and Monte-Carlo tree search to find optimal GNN architectures while providing explainability for the selection process.
Details
Motivation: Existing Graph NAS methods lack explainability in why specific architectures are selected due to complex search spaces and neural models, making it difficult to understand the decision process.Method: Proposes ExGNAS with (i) a simple search space adaptable to various graphs and (ii) a Monte-Carlo tree search algorithm that makes the decision process explainable.
Result: ExGNAS achieves high accuracy (up to 26.1% improvement) and efficiency (up to 88% runtime reduction) compared to four state-of-the-art Graph NAS methods across twelve graphs, with demonstrated effectiveness in explainability through user studies.
Conclusion: The combination of simple search space and explainable search algorithm enables finding accurate GNN models while providing insights into important functions within the search space, addressing the explainability gap in Graph NAS.
Abstract: The number of graph neural network (GNN) architectures has increased rapidly due to the growing adoption of graph analysis. Although we use GNNs in wide application scenarios, it is a laborious task to design/select optimal GNN architectures in diverse graphs. To reduce human efforts, graph neural architecture search (Graph NAS) has been used to search for a sub-optimal GNN architecture that combines existing components. However, existing Graph NAS methods lack explainability to understand the reasons why the model architecture is selected because they use complex search space and neural models to select architecture. Therefore, we propose an explainable Graph NAS method, called ExGNAS, which consists of (i) a simple search space that can adapt to various graphs and (ii) a search algorithm with Monte-Carlo tree that makes the decision process explainable. The combination of our search space and algorithm achieves finding accurate GNN models and the important functions within the search space. We comprehensively evaluate ExGNAS compared with four state-of-the-art Graph NAS methods in twelve graphs. Our experimental results show that ExGNAS achieves high average accuracy and efficiency; improving accuracy up to 26.1% and reducing run time up to 88%. Furthermore, we show the effectiveness of explainability by questionnaire-based user study and architecture analysis.
[351] Reset & Distill: A Recipe for Overcoming Negative Transfer in Continual Reinforcement Learning
Hongjoon Ahn, Jinu Hyeon, Youngmin Oh, Bosun Hwang, Taesup Moon
Main category: cs.LG
TL;DR: The paper addresses negative transfer in Continual Reinforcement Learning (CRL) and proposes Reset & Distill (R&D), a simple baseline method that resets actor/critic networks for new tasks and uses distillation to overcome negative transfer, outperforming recent approaches.
Details
Motivation: Negative transfer problem in CRL when learning new tasks is a significant issue that existing methods fail to address effectively, requiring dedicated solutions.Method: Reset & Distill (R&D) method that resets agent’s online actor and critic networks for new tasks, combined with offline distillation from online actor and previous expert’s action probabilities.
Result: Extensive experiments on Meta World tasks show R&D consistently outperforms recent approaches with significantly higher success rates across tasks.
Conclusion: Negative transfer is a critical problem in CRL that requires robust strategies like R&D to mitigate detrimental effects, highlighting the importance of addressing this issue in CRL algorithm development.
Abstract: We argue that the negative transfer problem occurring when the new task to learn arrives is an important problem that needs not be overlooked when developing effective Continual Reinforcement Learning (CRL) algorithms. Through comprehensive experimental validation, we demonstrate that such issue frequently exists in CRL and cannot be effectively addressed by several recent work on either mitigating plasticity loss of RL agents or enhancing the positive transfer in CRL scenario. To that end, we develop Reset & Distill (R&D), a simple yet highly effective baseline method, to overcome the negative transfer problem in CRL. R&D combines a strategy of resetting the agent’s online actor and critic networks to learn a new task and an offline learning step for distilling the knowledge from the online actor and previous expert’s action probabilities. We carried out extensive experiments on long sequence of Meta World tasks and show that our simple baseline method consistently outperforms recent approaches, achieving significantly higher success rates across a range of tasks. Our findings highlight the importance of considering negative transfer in CRL and emphasize the need for robust strategies like R&D to mitigate its detrimental effects.
[352] Improving Uncertainty Estimation through Semantically Diverse Language Generation
Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, Sepp Hochreiter
Main category: cs.LG
TL;DR: SDLG introduces a method to quantify predictive uncertainty in LLMs by generating semantically diverse alternatives, helping detect hallucinations more effectively and efficiently.
Details
Motivation: LLMs suffer from hallucinations that make them untrustworthy, and predictive uncertainty is identified as a main cause of these hallucinations.Method: Semantically Diverse Language Generation (SDLG) steers LLMs to generate semantically diverse yet likely alternatives for initially generated text to measure aleatoric semantic uncertainty.
Result: SDLG consistently outperforms existing methods on question-answering tasks and is the most computationally efficient approach.
Conclusion: SDLG sets a new standard for uncertainty estimation in LLMs by providing precise detection of hallucinations through semantic uncertainty measurement.
Abstract: Large language models (LLMs) can suffer from hallucinations when generating text. These hallucinations impede various applications in society and industry by making LLMs untrustworthy. Current LLMs generate text in an autoregressive fashion by predicting and appending text tokens. When an LLM is uncertain about the semantic meaning of the next tokens to generate, it is likely to start hallucinating. Thus, it has been suggested that predictive uncertainty is one of the main causes of hallucinations. We introduce Semantically Diverse Language Generation (SDLG) to quantify predictive uncertainty in LLMs. SDLG steers the LLM to generate semantically diverse yet likely alternatives for an initially generated text. This approach provides a precise measure of aleatoric semantic uncertainty, detecting whether the initial text is likely to be hallucinated. Experiments on question-answering tasks demonstrate that SDLG consistently outperforms existing methods while being the most computationally efficient, setting a new standard for uncertainty estimation in LLMs.
[353] Link Prediction with Untrained Message Passing Layers
Lisi Qarkaxhija, Anatol E. Wegner, Ingo Scholtes
Main category: cs.LG
TL;DR: Untrained message passing layers in graph neural networks can achieve competitive or superior link prediction performance compared to fully trained MPNNs, especially with high-dimensional features.
Details
Motivation: Most MPNNs require large labeled datasets for training, which is costly and time-consuming. This work explores removing trainable parameters from message passing layers to create more efficient models.Method: Use variants of popular message passing architectures where all trainable parameters for transforming node features are removed, creating untrained message passing layers.
Result: Untrained message passing layers lead to competitive and even superior performance compared to fully trained MPNNs for link prediction, particularly with high-dimensional features.
Conclusion: Untrained message passing provides a highly efficient and interpretable approach to link prediction, with theoretical connections to path-based topological node similarity measures.
Abstract: Message passing neural networks (MPNNs) operate on graphs by exchanging information between neigbouring nodes. MPNNs have been successfully applied to various node-, edge-, and graph-level tasks in areas like molecular science, computer vision, natural language processing, and combinatorial optimization. However, most MPNNs require training on large amounts of labeled data, which can be costly and time-consuming. In this work, we explore the use of various untrained message passing layers in graph neural networks, i.e. variants of popular message passing architecture where we remove all trainable parameters that are used to transform node features in the message passing step. Focusing on link prediction, we find that untrained message passing layers can lead to competitive and even superior performance compared to fully trained MPNNs, especially in the presence of high-dimensional features. We provide a theoretical analysis of untrained message passing by relating the inner products of features implicitly produced by untrained message passing layers to path-based topological node similarity measures. As such, untrained message passing architectures can be viewed as a highly efficient and interpretable approach to link prediction.
[354] Lower-dimensional projections of cellular expression improves cell type classification from single-cell RNA sequencing
Muhammad Umar, Andras Lakatos, Muhammad Asif, Arif Mahmood
Main category: cs.LG
TL;DR: EnProCell is a reference-based method for single-cell RNA sequencing cell type classification that uses ensemble projections and deep neural networks, achieving state-of-the-art performance with high accuracy and F1 scores.
Details
Motivation: Existing cell-type classification methods for scRNA-seq data primarily use unsupervised lower dimensional projections from large reference datasets, which may not optimally capture class separability needed for accurate classification.Method: EnProCell first computes lower dimensional projections using an ensemble of PCA and multiple discriminant analysis to capture both high variance and class separability, then trains a deep neural network on these projections for cell type classification.
Result: EnProCell outperformed existing methods on four datasets from different sequencing technologies, achieving 98.91% accuracy and 98.64% F1 score for reference data prediction, and 99.52% accuracy and 99.07% F1 score for query data prediction.
Conclusion: EnProCell provides superior cell type classification performance while being computationally efficient and simple to implement, making it a valuable tool for single-cell RNA sequencing analysis.
Abstract: Single-cell RNA sequencing (scRNA-seq) enables the study of cellular diversity at single cell level. It provides a global view of cell-type specification during the onset of biological mechanisms such as developmental processes and human organogenesis. Various statistical, machine and deep learning-based methods have been proposed for cell-type classification. Most of the methods utilizes unsupervised lower dimensional projections obtained from for a large reference data. In this work, we proposed a reference-based method for cell type classification, called EnProCell. The EnProCell, first, computes lower dimensional projections that capture both the high variance and class separability through an ensemble of principle component analysis and multiple discriminant analysis. In the second phase, EnProCell trains a deep neural network on the lower dimensional representation of data to classify cell types. The proposed method outperformed the existing state-of-the-art methods when tested on four different data sets produced from different single-cell sequencing technologies. The EnProCell showed higher accuracy (98.91) and F1 score (98.64) than other methods for predicting reference from reference datasets. Similarly, EnProCell also showed better performance than existing methods in predicting cell types for data with unknown cell types (query) from reference datasets (accuracy:99.52; F1 score: 99.07). In addition to improved performance, the proposed methodology is simple and does not require more computational resources and time. the EnProCell is available at https://github.com/umar1196/EnProCell.
[355] A Systematic Literature Review of Spatio-Temporal Graph Neural Network Models for Time Series Forecasting and Classification
Flavio Corradini, Flavio Gerosa, Marco Gori, Carlo Lucheroni, Marco Piangerelli, Martina Zannotti
Main category: cs.LG
TL;DR: A systematic literature review of 366 papers on spatio-temporal GNNs for time series classification and forecasting, providing comprehensive analysis of models, applications, datasets, and results across different domains.
Details
Motivation: To provide a comprehensive overview of spatio-temporal GNN modeling approaches and application domains for time series analysis, addressing the need for systematic comparison in this rapidly growing field.Method: Conducted database search and detailed examination of 366 selected papers, analyzing models, source code links, datasets, benchmark models, and results across various application domains.
Result: First and broadest systematic literature review comparing results from current spatio-temporal GNN models across different domains, identifying current limitations and challenges in the field.
Conclusion: The review highlights key challenges including comparability, reproducibility, explainability, poor information capacity, and scalability in spatio-temporal GNN applications, and provides interactive tools for further exploration.
Abstract: In recent years, spatio-temporal graph neural networks (GNNs) have attracted considerable interest in the field of time series analysis, due to their ability to capture, at once, dependencies among variables and across time points. The objective of this systematic literature review is hence to provide a comprehensive overview of the various modeling approaches and application domains of GNNs for time series classification and forecasting. A database search was conducted, and 366 papers were selected for a detailed examination of the current state-of-the-art in the field. This examination is intended to offer to the reader a comprehensive review of proposed models, links to related source code, available datasets, benchmark models, and fitting results. All this information is hoped to assist researchers in their studies. To the best of our knowledge, this is the first and broadest systematic literature review presenting a detailed comparison of results from current spatio-temporal GNN models applied to different domains. In its final part, this review discusses current limitations and challenges in the application of spatio-temporal GNNs, such as comparability, reproducibility, explainability, poor information capacity, and scalability. This paper is complemented by a GitHub repository at https://github.com/FlaGer99/SLR-Spatio-Temporal-GNN.git providing additional interactive tools to further explore the presented findings.
[356] LoLaFL: Low-Latency Federated Learning via Forward-only Propagation
Jierui Zhang, Jianhao Huang, Kaibin Huang
Main category: cs.LG
TL;DR: LoLaFL is a low-latency federated learning framework that uses forward-only propagation and nonlinear aggregation schemes to reduce communication rounds and latency by over 87-97% while maintaining accuracy.
Details
Motivation: Traditional federated learning with deep neural networks cannot meet 6G low-latency requirements due to high-dimensional parameter transmission and numerous communication rounds needed for convergence.Method: Extends maximal coding rate reduction principle to learn linear discriminative features, implements forward-only propagation, and proposes two nonlinear aggregation schemes: harmonic-mean-like aggregation and low-rank covariance matrix transmission.
Result: Achieves latency reductions of over 87% and 97% respectively with the two aggregation schemes while maintaining comparable accuracies to traditional FL.
Conclusion: LoLaFL successfully addresses the latency challenges in federated learning for 6G networks through forward-only propagation and efficient nonlinear aggregation strategies.
Abstract: Federated learning (FL) has emerged as a widely adopted paradigm for enabling edge learning with distributed data while ensuring data privacy. However, the traditional FL with deep neural networks trained via backpropagation can hardly meet the low-latency learning requirements in the sixth generation (6G) mobile networks. This challenge mainly arises from the high-dimensional model parameters to be transmitted and the numerous rounds of communication required for convergence due to the inherent randomness of the training process. To address this issue, we adopt the state-of-the-art principle of maximal coding rate reduction to learn linear discriminative features and extend the resultant white-box neural network into FL, yielding the novel framework of Low-Latency Federated Learning (LoLaFL) via forward-only propagation. LoLaFL enables layer-wise transmissions and aggregation with significantly fewer communication rounds, thereby considerably reducing latency. Additionally, we propose two \emph{nonlinear} aggregation schemes for LoLaFL. The first scheme is based on the proof that the optimal NN parameter aggregation in LoLaFL should be harmonic-mean-like. The second scheme further exploits the low-rank structures of the features and transmits the low-rank-approximated covariance matrices of features to achieve additional latency reduction. Theoretic analysis and experiments are conducted to evaluate the performance of LoLaFL. In comparison with traditional FL, the two nonlinear aggregation schemes for LoLaFL can achieve reductions in latency of over 87% and 97%, respectively, while maintaining comparable accuracies.
[357] LEASE: Offline Preference-based Reinforcement Learning with High Sample Efficiency
Xiao-Yin Liu, Guotao Li, Xiao-Hu Zhou, Zeng-Guang Hou
Main category: cs.LG
TL;DR: LEASE is an offline preference-based RL algorithm that uses a learned transition model to generate unlabeled preference data and employs uncertainty-aware labeling to improve sample efficiency.
Details
Motivation: Overcome the challenge of acquiring sufficient preference labels in offline PbRL by leveraging model-generated data while ensuring reward model accuracy.Method: Uses learned transition model to generate unlabeled preference data, implements uncertainty-aware mechanism to select high-confidence data for labeling, and provides theoretical generalization bounds.
Result: Achieves comparable performance to baselines with fewer preference data and no online interaction, with theoretical improvement guarantees.
Conclusion: LEASE effectively improves sample efficiency in offline PbRL through model-generated data and uncertainty-aware labeling, with proven theoretical guarantees.
Abstract: Offline preference-based reinforcement learning (PbRL) provides an effective way to overcome the challenges of designing reward and the high costs of online interaction. However, since labeling preference needs real-time human feedback, acquiring sufficient preference labels is challenging. To solve this, this paper proposes a offLine prEference-bAsed RL with high Sample Efficiency (LEASE) algorithm, where a learned transition model is leveraged to generate unlabeled preference data. Considering the pretrained reward model may generate incorrect labels for unlabeled data, we design an uncertainty-aware mechanism to ensure the performance of reward model, where only high confidence and low variance data are selected. Moreover, we provide the generalization bound of reward model to analyze the factors influencing reward accuracy, and demonstrate that the policy learned by LEASE has theoretical improvement guarantee. The developed theory is based on state-action pair, which can be easily combined with other offline algorithms. The experimental results show that LEASE can achieve comparable performance to baseline under fewer preference data without online interaction.
[358] Position: Bridge the Gaps between Machine Unlearning and AI Regulation
Bill Marino, Meghdad Kurmanji, Nicholas D. Lane
Main category: cs.LG
TL;DR: This position paper analyzes the gap between current machine unlearning capabilities and their potential applications for AI regulation compliance, using the EU’s AI Act as a case study.
Details
Motivation: The emergence of AI regulations like the EU's AI Act creates new use cases for machine unlearning beyond traditional data privacy applications, but technical gaps need to be addressed.Method: The authors conduct a systematic analysis of the EU AI Act provisions, catalog potential machine unlearning applications for compliance, and identify technical gaps between current capabilities and regulatory requirements.
Result: The study reveals significant technical gaps that prevent machine unlearning from effectively supporting AI Act compliance, despite its theoretical potential.
Conclusion: Researchers need to proactively solve open technical questions to bridge the gap between machine unlearning’s current state and its potential for AI regulation compliance.
Abstract: The ‘‘right to be forgotten’’ and the data privacy laws that encode it have motivated machine unlearning since its earliest days. Now, some argue that an inbound wave of artificial intelligence regulations – like the European Union’s Artificial Intelligence Act (AIA) – may offer important new use cases for machine unlearning. However, this position paper argues, this opportunity will only be realized if researchers proactively bridge the (sometimes sizable) gaps between machine unlearning’s state of the art and its potential applications to AI regulation. To demonstrate this point, we use the AIA as our primary case study. Specifically, we deliver a ``state of the union’’ as regards machine unlearning’s current potential (or, in many cases, lack thereof) for aiding compliance with various provisions of the AIA. This starts with a precise cataloging of the potential applications of machine unlearning to AIA compliance. For each, we flag the technical gaps that exist between the potential application and the state of the art of machine unlearning. Finally, we end with a call to action: for machine learning researchers to solve the open technical questions that could unlock machine unlearning’s potential to assist compliance with the AIA – and other AI regulations like it.
[359] Noise-based reward-modulated learning
Jesús García Fernández, Nasir Ahmad, Marcel van Gerven
Main category: cs.LG
TL;DR: NRL is a novel synaptic plasticity rule that unifies reinforcement learning and gradient-based optimization using noise-driven local updates, achieving comparable performance to backpropagation while being more suitable for neuromorphic systems.
Details
Motivation: To develop energy-efficient and adaptive AI for neuromorphic computing platforms that require local information processing and effective credit assignment, addressing the computational bottleneck of exact gradients.Method: Uses noise-based reward-modulated learning that approximates gradients through stochastic neural activity, employs reward prediction errors as optimization targets, and uses eligibility traces for retrospective credit assignment.
Result: Achieves performance comparable to backpropagation baselines (though with slower convergence) and significantly outperforms reward-modulated Hebbian learning in multi-layer networks, demonstrating superior scalability.
Conclusion: NRL provides a theoretically grounded paradigm well-suited for neuromorphic AI, transforming inherent noise into a functional resource and showing promise for low-power adaptive systems with locality constraints.
Abstract: The pursuit of energy-efficient and adaptive artificial intelligence (AI) has positioned neuromorphic computing as a promising alternative to conventional computing. However, achieving learning on these platforms requires techniques that prioritize local information while enabling effective credit assignment. Here, we propose noise-based reward-modulated learning (NRL), a novel synaptic plasticity rule that mathematically unifies reinforcement learning and gradient-based optimization with biologically-inspired local updates. NRL addresses the computational bottleneck of exact gradients by approximating them through stochastic neural activity, transforming the inherent noise of biological and neuromorphic substrates into a functional resource. Drawing inspiration from biological learning, our method uses reward prediction errors as its optimization target to generate increasingly advantageous behavior, and eligibility traces to facilitate retrospective credit assignment. Experimental validation on reinforcement tasks, featuring immediate and delayed rewards, shows that NRL achieves performance comparable to baselines optimized using backpropagation, although with slower convergence, while showing significantly superior performance and scalability in multi-layer networks compared to reward-modulated Hebbian learning (RMHL), the most prominent similar approach. While tested on simple architectures, the results highlight the potential of noise-driven, brain-inspired learning for low-power adaptive systems, particularly in computing substrates with locality constraints. NRL offers a theoretically grounded paradigm well-suited for the event-driven characteristics of next-generation neuromorphic AI.
[360] AutoPDL: Automatic Prompt Optimization for LLM Agents
Claudio Spiess, Mandana Vaziri, Louis Mandel, Martin Hirzel
Main category: cs.LG
TL;DR: AutoPDL automates the discovery of optimal LLM agent configurations by treating prompt design as a structured AutoML problem, using successive halving to efficiently search through combinatorial spaces of prompting patterns and demonstrations.
Details
Motivation: Manual prompt tuning for LLMs is tedious, error-prone, and model/task-specific, creating a need for automated approaches to discover effective prompting strategies.Method: Frames prompt optimization as structured AutoML over combinatorial spaces of agentic/non-agentic patterns and demonstrations, using successive halving for efficient search. Implements common patterns in PDL programming language for human-readable, editable solutions.
Result: Across 3 tasks and 7 LLMs (3B-70B parameters), achieved consistent accuracy gains of 9.21±15.46 percentage points (up to 67.5pp), with selected strategies varying across models and tasks.
Conclusion: AutoPDL successfully automates LLM prompt optimization, producing human-readable PDL programs that enable source-to-source optimization and human-in-the-loop refinement while achieving significant performance improvements.
Abstract: The performance of large language models (LLMs) depends on how they are prompted, with choices spanning both the high-level prompting pattern (e.g., Zero-Shot, CoT, ReAct, ReWOO) and the specific prompt content (instructions and few-shot demonstrations). Manually tuning this combination is tedious, error-prone, and specific to a given LLM and task. Therefore, this paper proposes AutoPDL, an automated approach to discovering good LLM agent configurations. Our approach frames this as a structured AutoML problem over a combinatorial space of agentic and non-agentic prompting patterns and demonstrations, using successive halving to efficiently navigate this space. We introduce a library implementing common prompting patterns using the PDL prompt programming language. AutoPDL solutions are human-readable, editable, and executable PDL programs that use this library. This approach also enables source-to-source optimization, allowing human-in-the-loop refinement and reuse. Evaluations across three tasks and seven LLMs (ranging from 3B to 70B parameters) show consistent accuracy gains ($9.21\pm15.46$ percentage points), up to 67.5pp, and reveal that selected prompting strategies vary across models and tasks.
[361] Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Therien, Sambit Sahu, Tom Goldstein, Supriyo Chakraborty
Main category: cs.LG
TL;DR: Default MoE improves Mixture of Experts training by providing dense gradient updates to the router while maintaining sparse parameter activation, using exponential moving averages of expert outputs as substitutes for missing activations.
Details
Motivation: Standard MoE pretraining suffers from training instability and suboptimal performance due to sparse backward updates, where only activated experts receive gradients.Method: Substitutes missing expert activations with default outputs using exponential moving averages of previous expert outputs, allowing the router to receive signals from all experts for each token.
Result: Default MoE outperforms standard TopK routing in various settings without significant computational overhead, leading to significant improvements in training performance.
Conclusion: The proposed Default MoE method effectively addresses training instability in MoE models by providing dense gradient updates to routers while maintaining computational efficiency.
Abstract: Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our method, which we refer to as Default MoE, substitutes missing expert activations with default outputs consisting of an exponential moving average of expert outputs previously seen over the course of training. This allows the router to receive signals from every expert for each token, leading to significant improvements in training performance. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead. Code: https://github.com/vatsal0/default-moe.
[362] Surrogate modeling of Cellular-Potts Agent-Based Models as a segmentation task using the U-Net neural network architecture
Tien Comlekoglu, J. Quetzalcóatl Toledo-Marín, Tina Comlekoglu, Douglas W. DeSimone, Shayn M. Peirce, Geoffrey Fox, James A. Glazier
Main category: cs.LG
TL;DR: Developed a CNN-based surrogate model using U-Net architecture to accelerate Cellular-Potts model simulations by 590x while maintaining accuracy in predicting emergent vascular behaviors.
Details
Motivation: Cellular-Potts models are computationally expensive due to explicit modeling of interactions among large numbers of agents and diffusive fields, limiting their scalability.Method: Used convolutional neural network with U-Net architecture that accounts for periodic boundary conditions, trained to predict 100 computational steps ahead (Monte-Carlo steps).
Result: Achieved 590x acceleration compared to original CPM code execution while effectively capturing emergent behaviors like vessel sprouting, extension, anastomosis, and vascular lacunae contraction.
Conclusion: Deep learning can serve as efficient surrogate models for CPM simulations, enabling faster evaluation of computationally expensive biological processes at greater spatial and temporal scales.
Abstract: The Cellular-Potts model is a powerful and ubiquitous framework for developing computational models for simulating complex multicellular biological systems. Cellular-Potts models (CPMs) are often computationally expensive due to the explicit modeling of interactions among large numbers of individual model agents and diffusive fields described by partial differential equations (PDEs). In this work, we develop a convolutional neural network (CNN) surrogate model using a U-Net architecture that accounts for periodic boundary conditions. We use this model to accelerate the evaluation of a mechanistic CPM previously used to investigate in vitro vasculogenesis. The surrogate model was trained to predict 100 computational steps ahead (Monte-Carlo steps, MCS), accelerating simulation evaluations by a factor of 590 times compared to CPM code execution. Over multiple recursive evaluations, our model effectively captures the emergent behaviors demonstrated by the original Cellular-Potts model of such as vessel sprouting, extension and anastomosis, and contraction of vascular lacunae. This approach demonstrates the potential for deep learning to serve as efficient surrogate models for CPM simulations, enabling faster evaluation of computationally expensive CPM of biological processes at greater spatial and temporal scales.
[363] Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning
Hongjoon Ahn, Heewoong Choi, Jisu Han, Taesup Moon
Main category: cs.LG
TL;DR: OTA improves offline goal-conditioned RL by incorporating temporal abstraction into value learning, enabling better advantage estimates for high-level policies in long-horizon tasks.
Details
Motivation: Offline GCRL struggles with long-horizon tasks due to high-level policy's inability to generate appropriate subgoals and incorrect advantage estimates during learning.Method: Option-aware Temporally Abstracted value learning (OTA) modifies value updates to be option-aware, contracting the effective horizon length for better advantage estimates.
Result: OTA achieves strong performance on complex tasks from OGBench, including maze navigation and visual robotic manipulation environments.
Conclusion: Incorporating temporal abstraction into value learning enables effective high-level policy learning in offline GCRL for long-horizon tasks.
Abstract: Offline goal-conditioned reinforcement learning (GCRL) offers a practical learning paradigm in which goal-reaching policies are trained from abundant state-action trajectory datasets without additional environment interaction. However, offline GCRL still struggles with long-horizon tasks, even with recent advances that employ hierarchical policy structures, such as HIQL. Identifying the root cause of this challenge, we observe the following insight. Firstly, performance bottlenecks mainly stem from the high-level policy’s inability to generate appropriate subgoals. Secondly, when learning the high-level policy in the long-horizon regime, the sign of the advantage estimate frequently becomes incorrect. Thus, we argue that improving the value function to produce a clear advantage estimate for learning the high-level policy is essential. In this paper, we propose a simple yet effective solution: Option-aware Temporally Abstracted value learning, dubbed OTA, which incorporates temporal abstraction into the temporal-difference learning process. By modifying the value update to be option-aware, our approach contracts the effective horizon length, enabling better advantage estimates even in long-horizon regimes. We experimentally show that the high-level policy learned using the OTA value function achieves strong performance on complex tasks from OGBench, a recently proposed offline GCRL benchmark, including maze navigation and visual robotic manipulation environments.
[364] Contrastive Consolidation of Top-Down Modulations Achieves Sparsely Supervised Continual Learning
Viet Anh Khoa Tran, Emre Neftci, Willem A. M. Wybo
Main category: cs.LG
TL;DR: TMCL uses predictive coding principles from neuroscience to enable continual learning without catastrophic forgetting, integrating sparse labeled data while maintaining generalization through task-modulated contrastive learning.
Details
Motivation: Biological brains learn continually from unlabeled data streams while integrating specialized information from sparse labels, but machine learning methods suffer from catastrophic forgetting when fine-tuning for new tasks.Method: Task-modulated contrastive learning (TMCL) uses predictive coding principles to create view-invariant representations, learns affine modulations for new classes without affecting feedforward weights, and introduces modulation invariance to stabilize representations.
Result: TMCL shows improvements in class-incremental and transfer learning over state-of-the-art unsupervised and supervised approaches, achieving good performance with only 1% of available labels.
Conclusion: Top-down modulations play a crucial role in balancing stability and plasticity in continual learning, enabling effective integration of new information without forgetting previous knowledge.
Abstract: Biological brains learn continually from a stream of unlabeled data, while integrating specialized information from sparsely labeled examples without compromising their ability to generalize. Meanwhile, machine learning methods are susceptible to catastrophic forgetting in this natural learning setting, as supervised specialist fine-tuning degrades performance on the original task. We introduce task-modulated contrastive learning (TMCL), which takes inspiration from the biophysical machinery in the neocortex, using predictive coding principles to integrate top-down information continually and without supervision. We follow the idea that these principles build a view-invariant representation space, and that this can be implemented using a contrastive loss. Then, whenever labeled samples of a new class occur, new affine modulations are learned that improve separation of the new class from all others, without affecting feedforward weights. By co-opting the view-invariance learning mechanism, we then train feedforward weights to match the unmodulated representation of a data sample to its modulated counterparts. This introduces modulation invariance into the representation space, and, by also using past modulations, stabilizes it. Our experiments show improvements in both class-incremental and transfer learning over state-of-the-art unsupervised approaches, as well as over comparable supervised approaches, using as few as 1% of available labels. Taken together, our work suggests that top-down modulations play a crucial role in balancing stability and plasticity.
[365] Imagine Beyond! Distributionally Robust Auto-Encoding for State Space Coverage in Online Reinforcement Learning
Nicolas Castanet, Olivier Sigaud, Sylvain Lamprier
Main category: cs.LG
TL;DR: DRAG addresses the problem of limited state coverage in Goal-Conditioned Reinforcement Learning by combining β-VAE with Distributionally Robust Optimization to create more comprehensive latent representations.
Details
Motivation: In visual GCRL, auto-encoders often converge to latent spaces that over-represent frequently visited states, especially when using intrinsic motivation. This limits the diversity of skills learned and exploration capability.Method: DRAG combines β-VAE framework with Distributionally Robust Optimization, using an adversarial neural weighter to enforce distributional shifts towards uniform state space coverage, allowing construction of semantically meaningful latent spaces beyond immediate experience.
Result: The approach improves state space coverage and downstream control performance on hard exploration environments like mazes and robotic control with obstacles, without requiring pre-training or prior environment knowledge.
Conclusion: DRAG successfully addresses the coverage limitation in GCRL by ensuring latent spaces represent the full state space, enabling better exploration and skill learning in challenging environments.
Abstract: Goal-Conditioned Reinforcement Learning (GCRL) enables agents to autonomously acquire diverse behaviors, but faces major challenges in visual environments due to high-dimensional, semantically sparse observations. In the online setting, where agents learn representations while exploring, the latent space evolves with the agent’s policy, to capture newly discovered areas of the environment. However, without incentivization to maximize state coverage in the representation, classical approaches based on auto-encoders may converge to latent spaces that over-represent a restricted set of states frequently visited by the agent. This is exacerbated in an intrinsic motivation setting, where the agent uses the distribution encoded in the latent space to sample the goals it learns to master. To address this issue, we propose to progressively enforce distributional shifts towards a uniform distribution over the full state space, to ensure a full coverage of skills that can be learned in the environment. We introduce DRAG (Distributionally Robust Auto-Encoding for GCRL), a method that combines the $\beta$-VAE framework with Distributionally Robust Optimization. DRAG leverages an adversarial neural weighter of training states of the VAE, to account for the mismatch between the current data distribution and unseen parts of the environment. This allows the agent to construct semantically meaningful latent spaces beyond its immediate experience. Our approach improves state space coverage and downstream control performance on hard exploration environments such as mazes and robotic control involving walls to bypass, without pre-training nor prior environment knowledge.
[366] Continuous-time Riemannian SGD and SVRG Flows on Wasserstein Probabilistic Space
Mingyang Yi, Bohan Wang
Main category: cs.LG
TL;DR: Extends stochastic optimization methods (SGD and SVRG) from Euclidean space to Wasserstein space using SDE approximations and Fokker-Planck equations, achieving convergence rates comparable to Euclidean settings.
Details
Motivation: To enrich continuous optimization methods in Wasserstein space beyond standard Riemannian gradient flow, since optimization on Wasserstein space connects to practical sampling processes.Method: Construct SDEs to approximate discrete Euclidean dynamics of Riemannian stochastic methods, then derive flows in Wasserstein space via Fokker-Planck equation.
Result: Successfully extended SGD flow and SVRG flow to Wasserstein space with established convergence rates.
Conclusion: The proposed framework effectively brings stochastic optimization methods to Wasserstein space while maintaining convergence properties similar to Euclidean settings.
Abstract: Recently, optimization on the Riemannian manifold have provided valuable insights to the optimization community. In this regard, extending these methods to to the Wasserstein space is of particular interest, since optimization on Wasserstein space is closely connected to practical sampling processes. Generally, the standard (continuous) optimization method on Wasserstein space is Riemannian gradient flow (i.e., Langevin dynamics when minimizing KL divergence). In this paper, we aim to enrich the family of continuous optimization methods in the Wasserstein space, by extending the gradient flow on it into the stochastic gradient descent (SGD) flow and stochastic variance reduction gradient (SVRG) flow. By leveraging the property of Wasserstein space, we construct stochastic differential equations (SDEs) to approximate the corresponding discrete Euclidean dynamics of the desired Riemannian stochastic methods. Then, we obtain the flows in Wasserstein space by Fokker-Planck equation. Finally, we establish convergence rates of the proposed stochastic flows, which align with those known in the Euclidean setting.
[367] DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning
Qi Cao, Ruiyi Wang, Ruiyi Zhang, Sai Ashish Somayajula, Pengtao Xie
Main category: cs.LG
TL;DR: DreamPRM is a domain-reweighted training framework for multimodal Process Reward Models that uses bi-level optimization to address dataset quality imbalance and improve generalization in multimodal reasoning tasks.
Details
Motivation: Multimodal reasoning faces severe distribution shift and dataset quality imbalance issues when extending Process Reward Models from text-only to multimodal scenarios, requiring effective data selection strategies for reliable training.Method: DreamPRM employs bi-level optimization: lower-level fine-tunes PRM on multiple datasets with domain weights to prioritize high-quality reasoning signals, while upper-level evaluates PRM on meta-learning dataset and updates domain weights via aggregation loss to improve generalization.
Result: Extensive experiments on multimodal reasoning benchmarks show DreamPRM consistently improves state-of-the-art MLLM performance, surpassing other data selection methods and yielding higher accuracy gains than existing test-time scaling approaches.
Conclusion: DreamPRM effectively addresses multimodal PRM training challenges through domain-reweighted optimization, demonstrating superior performance in handling dataset quality imbalance and improving generalization across mathematical and general reasoning tasks.
Abstract: Reasoning has substantially improved the performance of large language models (LLMs) on complicated tasks. Central to the current reasoning studies, Process Reward Models (PRMs) offer a fine-grained evaluation of intermediate reasoning steps and guide the reasoning process. However, extending PRMs to multimodal large language models (MLLMs) introduces challenges. Since multimodal reasoning covers a wider range of tasks compared to text-only scenarios, the resulting distribution shift from the training to testing sets is more severe, leading to greater generalization difficulty. Training a reliable multimodal PRM, therefore, demands large and diverse datasets to ensure sufficient coverage. However, current multimodal reasoning datasets suffer from a marked quality imbalance, which degrades PRM performance and highlights the need for an effective data selection strategy. To address the issues, we introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs which employs bi-level optimization. In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights, allowing the PRM to prioritize high-quality reasoning signals and alleviating the impact of dataset quality imbalance. In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset; this feedback updates the domain weights through an aggregation loss function, thereby improving the generalization capability of trained PRM. Extensive experiments on multiple multimodal reasoning benchmarks covering both mathematical and general reasoning show that test-time scaling with DreamPRM consistently improves the performance of state-of-the-art MLLMs. Further comparisons reveal that DreamPRM’s domain-reweighting strategy surpasses other data selection methods and yields higher accuracy gains than existing test-time scaling approaches.
[368] Energy-Based Model for Accurate Estimation of Shapley Values in Feature Attribution
Cheng Lu, Jiusun Zeng, Yu Xia, Jinhui Cai, Shihua Luo
Main category: cs.LG
TL;DR: EmSHAP is a novel Shapley value estimation method using energy-based models to accurately capture conditional dependencies among features, with GRU networks and dynamic masking for improved accuracy.
Details
Motivation: Traditional Shapley value estimation struggles with capturing complex conditional dependencies among features in real-world data environments, requiring better methods for accurate attribution.Method: Uses energy-based models to estimate conditional probabilities, incorporates GRU networks for capturing long-term dependencies and mitigating feature ordering effects, and employs dynamic masking mechanism for robustness.
Result: Theoretical analysis shows improved error bounds, and four case studies demonstrate higher accuracy and better scalability compared to competitive methods.
Conclusion: EmSHAP provides an effective and accurate solution for Shapley value estimation in complex data environments by leveraging energy-based modeling and neural network enhancements.
Abstract: Shapley value is a widely used tool in explainable artificial intelligence (XAI), as it provides a principled way to attribute contributions of input features to model outputs. However, estimation of Shapley value requires capturing conditional dependencies among all feature combinations, which poses significant challenges in complex data environments. In this article, EmSHAP (Energy-based model for Shapley value estimation), an accurate Shapley value estimation method, is proposed to estimate the expectation of Shapley contribution function under the arbitrary subset of features given the rest. By utilizing the ability of energy-based model (EBM) to model complex distributions, EmSHAP provides an effective solution for estimating the required conditional probabilities. To further improve estimation accuracy, a GRU (Gated Recurrent Unit)-coupled partition function estimation method is introduced. The GRU network captures long-term dependencies with a lightweight parameterization and maps input features into a latent space to mitigate the influence of feature ordering. Additionally, a dynamic masking mechanism is incorporated to further enhance the robustness and accuracy by progressively increasing the masking rate. Theoretical analysis on the error bound as well as application to four case studies verified the higher accuracy and better scalability of EmSHAP in contrast to competitive methods.
[369] Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification
Zinan Lin, Enshu Liu, Xuefei Ning, Junyi Zhu, Wenyu Wang, Sergey Yekhanin
Main category: cs.LG
TL;DR: Latent Zoning Network (LZN) unifies generative modeling, representation learning, and classification through a shared Gaussian latent space with disjoint zones for different data types, enabling task composition via encoder-decoder pairs.
Details
Motivation: Current state-of-the-art solutions for generative modeling, representation learning, and classification remain largely disjoint, creating complex ML pipelines. The paper seeks to unify these three core ML problems under a single principle.Method: LZN creates a shared Gaussian latent space where different data types (images, text, labels) are mapped to disjoint zones via encoders. ML tasks are expressed as compositions of these encoders and decoders - e.g., label-conditional generation uses label encoder and image decoder.
Result: LZN improves FID on CIFAR10 from 2.76 to 2.59 when combined with Rectified Flow; outperforms MoCo and SimCLR by 9.3% and 0.2% respectively on ImageNet linear classification; achieves state-of-the-art classification accuracy on CIFAR10 while improving FID in joint tasks.
Conclusion: LZN demonstrates that a unified approach can effectively address generative modeling, representation learning, and classification simultaneously, simplifying ML pipelines and fostering synergy across tasks.
Abstract: Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.
[370] RASPNet: A Benchmark Dataset for Radar Adaptive Signal Processing Applications
Shyam Venkatasubramanian, Bosung Kang, Ali Pezeshki, Muralidhar Rangaswamy, Vahid Tarokh
Main category: cs.LG
TL;DR: RASPNet is a 16TB dataset for radar adaptive signal processing with 100 realistic scenarios across US topographies, providing 10,000 clutter realizations per scenario to benchmark radar algorithms and complex-valued neural networks.
Details
Motivation: To address the gap in large-scale, realistic datasets for standardizing evaluation of radar adaptive signal processing techniques and complex-valued neural networks in the adaptive radar community.Method: Compiled 100 realistic scenarios over various US topographies and land types, generating 10,000 clutter realizations per scenario from airborne radar settings to create a standardized benchmark dataset.
Result: Successfully created RASPNet - a 16TB dataset that enables standardized evaluation of RASP techniques and supports development of data-driven models, including demonstrating transfer learning applications for real-world adaptive radar scenarios.
Conclusion: RASPNet fills a critical gap by providing a comprehensive, large-scale dataset that standardizes benchmarking for radar adaptive signal processing and complex-valued learning algorithms, supporting the development of more effective data-driven models in adaptive radar applications.
Abstract: We present a large-scale dataset called RASPNet for radar adaptive signal processing (RASP) applications to support the development of data-driven models within the adaptive radar community. RASPNet exceeds 16 TB in size and comprises 100 realistic scenarios compiled over a variety of topographies and land types across the contiguous United States. For each scenario, RASPNet comprises 10,000 clutter realizations from an airborne radar setting, which can be used to benchmark radar and complex-valued learning algorithms. RASPNet intends to fill a prominent gap in the availability of a large-scale, realistic dataset that standardizes the evaluation of RASP techniques and complex-valued neural networks. We outline its construction, organization, and several applications, including a transfer learning example to demonstrate how RASPNet can be used for real-world adaptive radar scenarios.
[371] Feature compression is the root cause of adversarial fragility in neural network classifiers
Jingchao Gao, Ziqing Lu, Raghu Mudumbai, Xiaodong Wu, Jirong Yi, Myung Cho, Catherine Xu, Hui Xie, Weiyu Xu
Main category: cs.LG
TL;DR: The paper provides a matrix-theoretic explanation for why deep neural networks have poor adversarial robustness compared to optimal classifiers, showing their robustness degrades with increasing input dimension.
Details
Motivation: To understand why deep neural networks are vulnerable to adversarial attacks and how their robustness compares to optimal classifiers.Method: Used matrix-theoretic analysis to study the smallest magnitude of additive perturbations that can change classifier outputs, focusing on how robustness scales with input dimension.
Result: Theoretical analysis shows neural networks’ adversarial robustness can be only 1/√d of the best possible robustness of optimal classifiers, and this matches well with numerical experiments including ImageNet networks.
Conclusion: Neural networks’ adversarial fragility stems from fundamental limitations in high-dimensional spaces, with robustness degrading as input dimension increases, consistent with information-theoretic explanations.
Abstract: In this paper, we uniquely study the adversarial robustness of deep neural networks (NN) for classification tasks against that of optimal classifiers. We look at the smallest magnitude of possible additive perturbations that can change a classifier’s output. We provide a matrix-theoretic explanation of the adversarial fragility of deep neural networks for classification. In particular, our theoretical results show that a neural network’s adversarial robustness can degrade as the input dimension $d$ increases. Analytically, we show that neural networks’ adversarial robustness can be only $1/\sqrt{d}$ of the best possible adversarial robustness of optimal classifiers. Our theories match remarkably well with numerical experiments of practically trained NN, including NN for ImageNet images. The matrix-theoretic explanation is consistent with an earlier information-theoretic feature-compression-based explanation for the adversarial fragility of neural networks.
[372] Network Anomaly Traffic Detection via Multi-view Feature Fusion
Song Hao, Wentao Fu, Xuanze Chen, Chengxiang Jin, Jiajun Zhou, Shanqing Yu, Qi Xuan
Main category: cs.LG
TL;DR: Proposes MuFF, a multi-view feature fusion method for network anomaly detection that combines temporal and interactive perspectives to overcome limitations of single-view approaches.
Details
Motivation: Traditional single-view anomaly detection methods have limitations in handling complex attacks and encrypted communications, requiring a more comprehensive approach.Method: MuFF models temporal and interactive relationships of packets from different viewpoints, learns corresponding features, and fuses them for anomaly detection.
Result: Extensive experiments on six real traffic datasets demonstrate excellent performance in network anomalous traffic detection.
Conclusion: MuFF effectively compensates for the shortcomings of single-perspective detection by leveraging multi-view feature fusion.
Abstract: Traditional anomalous traffic detection methods are based on single-view analysis, which has obvious limitations in dealing with complex attacks and encrypted communications. In this regard, we propose a Multi-view Feature Fusion (MuFF) method for network anomaly traffic detection. MuFF models the temporal and interactive relationships of packets in network traffic based on the temporal and interactive viewpoints respectively. It learns temporal and interactive features. These features are then fused from different perspectives for anomaly traffic detection. Extensive experiments on six real traffic datasets show that MuFF has excellent performance in network anomalous traffic detection, which makes up for the shortcomings of detection under a single perspective.
[373] End-to-End Probabilistic Framework for Learning with Hard Constraints
Utkarsh Utkarsh, Danielle C. Maddix, Ruijun Ma, Michael W. Mahoney, Yuyang Wang
Main category: cs.LG
TL;DR: ProbHardE2E is a probabilistic forecasting framework that incorporates hard constraints and uncertainty quantification through a differentiable probabilistic projection layer, enabling end-to-end learning without distributional assumptions.
Details
Motivation: Existing approaches either satisfy constraints through post-processing or at inference, and often rely on biased likelihood-based objectives with restrictive distributional assumptions, limiting their robustness and flexibility.Method: Uses a novel differentiable probabilistic projection layer (DPPL) that can be combined with various neural network architectures, optimizes strictly proper scoring rules without distributional assumptions, and handles non-linear constraints.
Result: Successfully applied to learning partial differential equations with uncertainty estimates and probabilistic time-series forecasting, demonstrating broad applicability across different domains.
Conclusion: ProbHardE2E provides a general framework for probabilistic forecasting that handles hard constraints end-to-end, offers robust uncertainty quantification, and connects disparate domains through its flexible methodology.
Abstract: We present ProbHardE2E, a probabilistic forecasting framework that incorporates hard operational/physical constraints, and provides uncertainty quantification. Our methodology uses a novel differentiable probabilistic projection layer (DPPL) that can be combined with a wide range of neural network architectures. DPPL allows the model to learn the system in an end-to-end manner, compared to other approaches where constraints are satisfied either through a post-processing step or at inference. ProbHardE2E optimizes a strictly proper scoring rule, without making any distributional assumptions on the target, which enables it to obtain robust distributional estimates (in contrast to existing approaches that generally optimize likelihood-based objectives, which are heavily biased by their distributional assumptions and model choices); and it can incorporate a range of non-linear constraints (increasing the power of modeling and flexibility). We apply ProbHardE2E in learning partial differential equations with uncertainty estimates and to probabilistic time-series forecasting, showcasing it as a broadly applicable general framework that connects these seemingly disparate domains.
[374] Constrained Optimal Fuel Consumption of HEVs under Observational Noise
Shuchang Yan, Haoran Sun
Main category: cs.LG
TL;DR: This study extends prior work on constrained optimal fuel consumption in hybrid electric vehicles by incorporating observational noise in SOC and reference speed measurements, using robust constrained reinforcement learning to maintain performance under real-world sensor imperfections.
Details
Motivation: Real-world HEV operations face sensor noise in SOC measurements and deviations in reference speeds from actual driving conditions, which were not considered in previous perfect-measurement assumptions.Method: Reformulated the COFC problem with explicit observational noise modeling (uniform distribution), employed robust CRL approach with structured training for stability, and evaluated on Toyota Prius hybrid system using NEDC and WLTC cycles.
Result: Fuel consumption and SOC constraint satisfaction remained robust across varying noise levels, with analysis showing different impacts of SOC vs. speed observational noise on fuel consumption.
Conclusion: This is the first study to explicitly examine how observational noise affects constrained optimal fuel consumption in HEVs, providing insights relevant to dynamometer testing and predictive energy control applications.
Abstract: In our prior work, we investigated the minimum fuel consumption of a hybrid electric vehicle (HEV) under a state-of-charge (SOC) balance constraint, assuming perfect SOC measurements and accurate reference speed profiles. The constrained optimal fuel consumption (COFC) problem was addressed using a constrained reinforcement learning (CRL) framework. However, in real-world scenarios, SOC readings are often corrupted by sensor noise, and reference speeds may deviate from actual driving conditions. To account for these imperfections, this study reformulates the COFC problem by explicitly incorporating observational noise in both SOC and reference speed. We adopt a robust CRL approach, where the noise is modeled as a uniform distribution, and employ a structured training procedure to ensure stability. The proposed method is evaluated through simulations on the Toyota Prius hybrid system (THS), using both the New European Driving Cycle (NEDC) and the Worldwide Harmonized Light Vehicles Test Cycle (WLTC). Results show that fuel consumption and SOC constraint satisfaction remain robust across varying noise levels. Furthermore, the analysis reveals that observational noise in SOC and speed can impact fuel consumption to different extents. To the best of our knowledge, this is the first study to explicitly examine how observational noise – commonly encountered in dynamometer testing and predictive energy control (PEC) applications – affects constrained optimal fuel consumption in HEVs.
[375] Multiscale spatiotemporal heterogeneity analysis of bike-sharing system’s self-loop phenomenon: Evidence from Shanghai
Yichen Wang, Qing Yu, Yancun Song
Main category: cs.LG
TL;DR: Study analyzes bike-sharing self-loop phenomenon using spatial autoregressive model and double machine learning, finding spatial lag effects and key socioeconomic factors influencing bike redistribution patterns.
Details
Motivation: Bike-sharing self-loop phenomenon (bikes returned to same station) impacts service equity, requiring analysis of socioeconomic and geospatial factors affecting this pattern.Method: Multiscale analysis using spatial autoregressive model and double machine learning framework to assess impacts at metro stations and street scales.
Result: Self-loop intensity shows spatial lag effect at street scale, positively associated with residential land use. Higher effects in areas with middle-aged residents, high employment, low car ownership. Multimodal transit shows positive effects at both scales.
Conclusion: Recommend increasing bike availability in high metro usage/low bus coverage areas and implementing adaptable redistribution strategies to enhance bike-sharing cooperation.
Abstract: Bike-sharing is an environmentally friendly shared mobility mode, but its self-loop phenomenon, where bikes are returned to the same station after several time usage, significantly impacts equity in accessing its services. Therefore, this study conducts a multiscale analysis with a spatial autoregressive model and double machine learning framework to assess socioeconomic features and geospatial location’s impact on the self-loop phenomenon at metro stations and street scales. The results reveal that bike-sharing self-loop intensity exhibits significant spatial lag effect at street scale and is positively associated with residential land use. Marginal treatment effects of residential land use is higher on streets with middle-aged residents, high fixed employment, and low car ownership. The multimodal public transit condition reveals significant positive marginal treatment effects at both scales. To enhance bike-sharing cooperation, we advocate augmenting bicycle availability in areas with high metro usage and low bus coverage, alongside implementing adaptable redistribution strategies.
[376] ABS: Enforcing Constraint Satisfaction On Generated Sequences Via Automata-Guided Beam Search
Vincenzo Collura, Karim Tit, Laura Bussi, Eleonora Giunchiglia, Maxime Cordy
Main category: cs.LG
TL;DR: ABS is a model-agnostic inference-time algorithm that guarantees constraint compliance using Deterministic Finite Automata (DFA) to guide beam search, achieving perfect constraint satisfaction while maintaining or improving output quality.
Details
Motivation: Current autoregressive models like LLMs excel at capturing statistical patterns but cannot guarantee compliance with formal constraints, limiting their reliability in applications requiring strict constraint satisfaction.Method: ABS uses DFA-compiled constraints to guide beam search by masking transitions that lead to violations and dynamically re-ranking paths based on model probabilities and automaton acceptance structure, without requiring model retraining.
Result: ABS achieves perfect constraint satisfaction across three tasks (constrained image-stream classification, controlled text generation, text infilling) while outperforming or matching state-of-the-art baselines on quality metrics and efficiency.
Conclusion: ABS provides a general, model-agnostic solution for guaranteed constraint compliance in sequence generation tasks, bridging the gap between statistical learning and formal verification.
Abstract: Sequence generation and prediction form a cornerstone of modern machine learning, with applications spanning natural language processing, program synthesis, and time-series forecasting. These tasks are typically modeled in an autoregressive fashion, where each token is generated conditional on the preceding ones, and beam search is commonly used to balance exploration and fluency during decoding. While deep learning models and Large Language Models (LLMs) excel at capturing statistical patterns in this setting, they remain ill-equipped to guarantee compliance with formal constraints. In this paper, we introduce ABS: a general and model-agnostic inference-time algorithm that guarantees compliance with any constraint that can be compiled into a Deterministic Finite Automaton (DFA), without requiring retraining. ABS leverages the DFA to guide a constrained variant of beam search: at each decoding step, transitions leading to violations are masked, while remaining paths are dynamically re-ranked according to both the model’s probabilities and the automaton’s acceptance structure. We formally prove that the resulting sequences are guaranteed to satisfy the given constraints, and we empirically demonstrate that ABS also improves output quality. We validate our approach on three distinct tasks: constrained image-stream classification, controlled text generation, and text infilling. In all settings, ABS achieves perfect constraint satisfaction, while outperforming or matching state-of-the-art baselines on standard quality metrics and efficiency.
[377] MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Dayou Du, Tairan Xu, Kai Zou, Edoardo Ponti, Luo Mai
Main category: cs.LG
TL;DR: MoE-CAP is a benchmark for Mixture-of-Experts systems that reveals the inherent trade-off between Cost, Accuracy, and Performance (CAP), showing current hardware can only optimize two of three dimensions.
Details
Motivation: Existing benchmarks fail to accurately capture the trade-offs between cost, accuracy, and performance in sparse MoE systems, complicating practical deployment decisions.Method: Introduces MoE-CAP benchmark with sparsity-aware metrics (S-MBU and S-MFU) and CAP Radar Diagram to visualize trade-offs across diverse hardware platforms.
Result: Analysis shows MoE systems typically optimize two CAP dimensions at the expense of the third, demonstrating the MoE-CAP trade-off phenomenon.
Conclusion: MoE-CAP enables accurate performance benchmarking and helps understand the fundamental CAP trade-offs in MoE system deployments.
Abstract: The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.
[378] Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients
Minhyuk Seo, Taeheon Kim, Hankook Lee, Jonghyun Choi, Tinne Tuytelaars
Main category: cs.LG
TL;DR: FedMosaic addresses data and model heterogeneity in personalized federated learning through task-relevance-aware model aggregation and dimension-invariant modules, achieving superior performance in realistic multi-modal scenarios.
Details
Motivation: Existing PFL methods are limited to simplified scenarios with homogeneous data and models, while real-world applications require handling diverse tasks with distribution shifts over time.Method: FedMosaic uses task-relevance-aware model aggregation to reduce parameter interference and dimension-invariant modules for knowledge sharing across heterogeneous architectures without high computational costs.
Result: Empirical evaluation on a multi-modal benchmark with 40 distinct tasks shows FedMosaic outperforms state-of-the-art PFL methods in both personalization and generalization capabilities.
Conclusion: FedMosaic successfully addresses realistic PFL scenarios with data and model heterogeneity, demonstrating strong performance in challenging multi-task environments with distribution shifts.
Abstract: As AI becomes more personal, e.g., Agentic AI, there is an increasing need for personalizing models for various use cases. Personalized federated learning (PFL) enables each client to collaboratively leverage other clients’ knowledge for better adaptation to the task of interest, without privacy risks. Despite its potential, existing PFL methods remain confined to rather simplified scenarios where data and models are the same across clients. To move towards realistic scenarios, we propose FedMosaic, a method that jointly addresses data and model heterogeneity with a task-relevance-aware model aggregation strategy to reduce parameter interference, and a dimension-invariant module that enables knowledge sharing across heterogeneous architectures without huge computational cost. To mimic the real-world task diversity, we propose a multi-modal PFL benchmark spanning 40 distinct tasks with distribution shifts over time. The empirical study shows that FedMosaic outperforms the state-of-the-art PFL methods, excelling in both personalization and generalization capabilities under challenging, realistic scenarios.
[379] UFGraphFR: Graph Federation Recommendation System based on User Text description features
Xudong Wang, Qingbo Hao, Xu Cheng, Yingyuan Xiao
Main category: cs.LG
TL;DR: UFGraphFR is a federated learning framework for recommendation systems that uses semantic vectors and secure user relationship graphs to improve recommendation accuracy while preserving privacy.
Details
Motivation: Traditional federated recommendation approaches treat users as isolated entities, failing to capture collaborative signals through global user relationship graphs, which limits recommendation accuracy.Method: The framework has three components: 1) Client-side transformation of private data to text descriptions and semantic encoding, 2) Server-side secure reconstruction of user relationship graphs using aggregated weights and graph neural networks, 3) Client-side personalization of user behavior sequences using Transformers.
Result: Extensive experiments on four benchmark datasets show UFGraphFR significantly outperforms state-of-the-art baselines in both recommendation accuracy and personalization, with consistent performance across different pre-trained models.
Conclusion: UFGraphFR provides a practical method for efficient federated recommendations with strict privacy protection using semantic vectors, secure user relationship graphs, and personalized behavior sequences.
Abstract: Federated learning offers a privacy-preserving framework for recommendation systems by enabling local data processing; however, data localization introduces substantial obstacles. Traditional federated recommendation approaches treat each user as an isolated entity, failing to construct global user relationship graphs that capture collaborative signals, which limits the accuracy of recommendations. To address this limitation, we derive insight from the insight that semantic similarity reflects preference. similarity, which can be used to improve the construction of user relationship graphs. This paper proposes UFGraphFR, a novel framework with three key components: 1) On the client side, private structured data is first transformed into text descriptions. These descriptions are then encoded into semantic vectors using pre-trained models; 2) On the server side, user relationship graphs are securely reconstructed using aggregated model weights without accessing raw data, followed by information propagation through lightweight graph neural networks; 3) On the client side, user behavior sequences are personalized using Transformer architectures. Extensive experiments conducted on four benchmark datasets demonstrate that UFGraphFR significantly outperforms state-of-the-art baselines in both recommendation accuracy and personalization. The framework also maintains robustness across different pre-trained models, as evidenced by the consistent performance metrics obtained. This work provides a practical method for efficient federated recommendations with strict privacy by using semantic vectors, secure user relationship graphs, and personalized behavior sequences. The code is available at: https://github.com/trueWangSyutung/UFGraphFR
[380] Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models
Michael Plainer, Hao Wu, Leon Klein, Stephan Günnemann, Frank Noé
Main category: cs.LG
TL;DR: The paper identifies inconsistencies in diffusion models for molecular sampling and proposes a Fokker-Planck-derived regularization to enforce consistency between learned scores and equilibrium distributions.
Details
Motivation: Diffusion models for molecular sampling show inconsistencies where the energy-based interpretation of learned scores doesn't match the training distribution, even in simple systems.Method: Proposed an energy-based diffusion model with Fokker-Planck-derived regularization to enforce consistency between learned scores and data distribution evolution.
Result: Demonstrated improved consistency and efficient sampling across biomolecular systems including fast-folding proteins, and developed a state-of-the-art transferable Boltzmann emulator for dipeptides.
Conclusion: Fokker-Planck regularization addresses score inaccuracies at small diffusion timesteps, enabling more consistent and efficient molecular sampling and simulation.
Abstract: In recent years, diffusion models trained on equilibrium molecular distributions have proven effective for sampling biomolecules. Beyond direct sampling, the score of such a model can also be used to derive the forces that act on molecular systems. However, while classical diffusion sampling usually recovers the training distribution, the corresponding energy-based interpretation of the learned score is often inconsistent with this distribution, even for low-dimensional toy systems. We trace this inconsistency to inaccuracies of the learned score at very small diffusion timesteps, where the model must capture the correct evolution of the data distribution. In this regime, diffusion models fail to satisfy the Fokker–Planck equation, which governs the evolution of the score. We interpret this deviation as one source of the observed inconsistencies and propose an energy-based diffusion model with a Fokker–Planck-derived regularization term to enforce consistency. We demonstrate our approach by sampling and simulating multiple biomolecular systems, including fast-folding proteins, and by introducing a state-of-the-art transferable Boltzmann emulator for dipeptides that supports simulation and achieves improved consistency and efficient sampling. Our code, model weights, and self-contained JAX and PyTorch notebooks are available at https://github.com/noegroup/ScoreMD.
[381] Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders
Gongxu Luo, Haoyue Dai, Loka Li, Chengqian Gao, Boyang Sun, Kun Zhang
Main category: cs.LG
TL;DR: GISL is a new method for gene regulatory network inference that addresses both latent confounders and selection bias using perturbation data, distinguishing regulatory dependencies from spurious ones caused by selection.
Details
Motivation: Current GRNI methods often overlook selection bias, where only cells meeting certain criteria are observed, leading to spurious dependencies and flawed causal interpretations. There's a need to distinguish between regulation, confounding, and selection-induced dependencies.Method: GISL leverages gene perturbation data, exploiting that selection-induced dependencies are symmetric under perturbation while regulatory and confounding dependencies are not. It infers gene regulatory relations and non-regulatory mechanisms up to equivalence class.
Result: Experiments on synthetic and real-world gene expression data demonstrate GISL’s effectiveness in uncovering true regulatory relations while accounting for selection bias and latent confounders.
Conclusion: Gene perturbations provide a simple yet effective way to distinguish selection-induced dependencies from true regulatory ones. GISL successfully addresses both selection bias and latent confounding in gene regulatory network inference.
Abstract: Gene regulatory network inference (GRNI) aims to discover how genes causally regulate each other from gene expression data. It is well-known that statistical dependencies in observed data do not necessarily imply causation, as spurious dependencies may arise from latent confounders, such as non-coding RNAs. Numerous GRNI methods have thus been proposed to address this confounding issue. However, dependencies may also result from selection–only cells satisfying certain survival or inclusion criteria are observed–while these selection-induced spurious dependencies are frequently overlooked in gene expression data analyses. In this work, we show that such selection is ubiquitous and, when ignored or conflated with true regulations, can lead to flawed causal interpretation and misguided intervention recommendations. To address this challenge, a fundamental question arises: can we distinguish dependencies due to regulation, confounding, and crucially, selection? We show that gene perturbations offer a simple yet effective answer: selection-induced dependencies are symmetric under perturbation, while those from regulation or confounding are not. Building on this motivation, we propose GISL (Gene regulatory network Inference in the presence of Selection bias and Latent confounders), a principled algorithm that leverages perturbation data to uncover both true gene regulatory relations and non-regulatory mechanisms of selection and confounding up to the equivalence class. Experiments on synthetic and real-world gene expression data demonstrate the effectiveness of our method.
[382] Personalized Interpolation: Achieving Efficient Conversion Estimation with Flexible Optimization Windows
Xin Zhang, Weiliang Li, Rui Li, Zihang Fu, Tongyi Tang, Zhengyu Zhang, Wen-Yen Chen, Nima Noorshams, Nirav Jasapara, Xiaowen Ding, Ellie Wen, Xue Feng
Main category: cs.LG
TL;DR: Proposes Personalized Interpolation method for flexible advertiser-specific conversion windows in online advertising, improving prediction accuracy and efficiency.
Details
Motivation: Accurate conversion prediction is challenging due to variable time delays between user interactions and conversions, requiring flexible optimization windows tailored to specific advertiser behaviors.Method: Personalized Interpolation method that extends existing fixed-window models to support flexible advertiser-specific optimization windows without increasing system complexity.
Result: Achieves high prediction accuracy and improved efficiency compared to existing solutions in real-world ads conversion model experiments.
Conclusion: Demonstrates potential to improve conversion optimization and support wider advertising strategies in large-scale online advertising systems.
Abstract: Optimizing conversions is crucial in modern online advertising systems, enabling advertisers to deliver relevant products to users and drive business outcomes. However, accurately predicting conversion events remains challenging due to variable time delays between user interactions (e.g., impressions or clicks) and the actual conversions. These delays vary substantially across advertisers and products, necessitating flexible optimization windows tailored to specific conversion behaviors. To address this, we propose a novel \textit{Personalized Interpolation} method that extends existing models based on fixed conversion windows to support flexible advertiser-specific optimization windows. Our method enables accurate conversion estimation across diverse delay distributions without increasing system complexity. We evaluate the effectiveness of the proposed approach through extensive experiments using a real-world ads conversion model. Our results show that this method achieves both high prediction accuracy and improved efficiency compared to existing solutions. This study demonstrates the potential of our Personalized Interpolation method to improve conversion optimization and support a wider range of advertising strategies in large-scale online advertising systems.
[383] Relational Causal Discovery with Latent Confounders
Matteo Negro, Andrea Piras, Ragib Ahsan, David Arbour, Elena Zheleva
Main category: cs.LG
TL;DR: Proposes RelFCI, a sound and complete causal discovery algorithm for relational data with latent confounders, addressing limitations of existing methods.
Details
Motivation: Existing causal discovery methods either assume i.i.d. data (unsuitable for relational data) or causal sufficiency (unrealistic for real-world datasets with latent confounders).Method: Builds on Fast Causal Inference (FCI) and Relational Causal Discovery (RCD) algorithms, defines new graphical models for relational domains, and establishes relational d-separation with latent confounders.
Result: Experimental results demonstrate RelFCI’s effectiveness in identifying correct causal structure in relational causal models with latent confounders.
Conclusion: RelFCI successfully bridges the gap for causal discovery in relational data with latent confounders, providing sound and complete guarantees.
Abstract: Estimating causal effects from real-world relational data can be challenging when the underlying causal model and potential confounders are unknown. While several causal discovery algorithms exist for learning causal models with latent confounders from data, they assume that the data is independent and identically distributed (i.i.d.) and are not well-suited for learning from relational data. Similarly, existing relational causal discovery algorithms assume causal sufficiency, which is unrealistic for many real-world datasets. To address this gap, we propose RelFCI, a sound and complete causal discovery algorithm for relational data with latent confounders. Our work builds upon the Fast Causal Inference (FCI) and Relational Causal Discovery (RCD) algorithms and it defines new graphical models, necessary to support causal discovery in relational domains. We also establish soundness and completeness guarantees for relational d-separation with latent confounders. We present experimental results demonstrating the effectiveness of RelFCI in identifying the correct causal structure in relational causal models with latent confounders.
[384] Generalization Error Analysis for Selective State-Space Models Through the Lens of Attention
Arya Honarpisheh, Mustafa Bozdag, Octavia Camps, Mario Sznaier
Main category: cs.LG
TL;DR: Theoretical generalization analysis of selective state-space models (SSMs) with novel covering number-based bounds, examining how state matrix spectral properties affect training stability and sequence length generalization.
Details
Motivation: SSMs have emerged as alternatives to Transformers for sequence modeling, but lack theoretical understanding of their generalization properties, particularly for selective SSMs used in models like Mamba.Method: Derived novel covering number-based generalization bounds for selective SSMs, analyzed spectral abscissa of continuous-time state matrix, and empirically validated on synthetic majority task, IMDb sentiment classification, and ListOps task.
Result: Developed theoretical framework showing how state matrix spectral properties influence training stability and generalization across sequence lengths, with empirical validation supporting theoretical insights.
Conclusion: Provides theoretical foundation for understanding selective SSM generalization, connecting spectral properties to practical model behavior, advancing theoretical analysis of modern sequence modeling architectures.
Abstract: State-space models (SSMs) have recently emerged as a compelling alternative to Transformers for sequence modeling tasks. This paper presents a theoretical generalization analysis of selective SSMs, the core architectural component behind the Mamba model. We derive a novel covering number-based generalization bound for selective SSMs, building upon recent theoretical advances in the analysis of Transformer models. Using this result, we analyze how the spectral abscissa of the continuous-time state matrix influences the model’s stability during training and its ability to generalize across sequence lengths. We empirically validate our findings on a synthetic majority task, the IMDb sentiment classification benchmark, and the ListOps task, demonstrating how our theoretical insights translate into practical model behavior.
[385] Universal Sequence Preconditioning
Annie Marsden, Elad Hazan
Main category: cs.LG
TL;DR: The paper proposes a universal preconditioning method using orthogonal polynomial convolutions to improve sequential prediction performance in linear dynamical systems, achieving sublinear regret bounds.
Details
Motivation: To address the problem of preconditioning in sequential prediction, particularly for systems with marginally stable and asymmetric transition matrices where existing methods struggle.Method: Convolving target sequences with coefficients from orthogonal polynomials (Chebyshev or Legendre) to apply polynomial transformations to hidden transition matrices in linear dynamical systems.
Result: The method reduces regret for prediction algorithms, achieves first-ever sublinear and hidden-dimension-independent regret bounds, and improves performance across diverse algorithms including RNNs in both synthetic and real-world experiments.
Conclusion: Orthogonal polynomial-based preconditioning is a simple yet effective universal strategy that enhances sequential prediction performance and generalizes beyond linear dynamical systems.
Abstract: We study the problem of preconditioning in sequential prediction. From the theoretical lens of linear dynamical systems, we show that convolving the target sequence corresponds to applying a polynomial to the hidden transition matrix. Building on this insight, we propose a universal preconditioning method that convolves the target with coefficients from orthogonal polynomials such as Chebyshev or Legendre. We prove that this approach reduces regret for two distinct prediction algorithms and yields the first ever sublinear and hidden-dimension-independent regret bounds (up to logarithmic factors) that hold for systems with marginally table and asymmetric transition matrices. Finally, extensive synthetic and real-world experiments show that this simple preconditioning strategy improves the performance of a diverse range of algorithms, including recurrent neural networks, and generalizes to signals beyond linear dynamical systems.
[386] Electrical Load Forecasting over Multihop Smart Metering Networks with Federated Learning
Ratun Rahman, Pablo Moriano, Samee U. Khan, Dinh C. Nguyen
Main category: cs.LG
TL;DR: A personalized federated learning method using meta-learning to address data heterogeneity in smart meter load forecasting, with latency optimization and convergence analysis.
Details
Motivation: Traditional ML methods for electric load forecasting require data sharing which raises privacy concerns, and current FL approaches struggle with imbalanced data distribution across heterogeneous smart meters.Method: Developed a meta-learning-based personalized federated learning strategy to handle data heterogeneity, studied latency optimization through optimal resource allocation, and conducted theoretical convergence analysis.
Result: Extensive simulations on real-world datasets show the method outperforms existing approaches in both load forecasting accuracy and reduced operational latency costs.
Conclusion: The proposed PFL method effectively addresses data heterogeneity and privacy concerns in smart meter networks while achieving superior forecasting performance and lower latency.
Abstract: Electric load forecasting is essential for power management and stability in smart grids. This is mainly achieved via advanced metering infrastructure, where smart meters (SMs) record household energy data. Traditional machine learning (ML) methods are often employed for load forecasting, but require data sharing, which raises data privacy concerns. Federated learning (FL) can address this issue by running distributed ML models at local SMs without data exchange. However, current FL-based approaches struggle to achieve efficient load forecasting due to imbalanced data distribution across heterogeneous SMs. This paper presents a novel personalized federated learning (PFL) method for high-quality load forecasting in metering networks. A meta-learning-based strategy is developed to address data heterogeneity at local SMs in the collaborative training of local load forecasting models. Moreover, to minimize the load forecasting delays in our PFL model, we study a new latency optimization problem based on optimal resource allocation at SMs. A theoretical convergence analysis is also conducted to provide insights into FL design for federated load forecasting. Extensive simulations from real-world datasets show that our method outperforms existing approaches regarding better load forecasting and reduced operational latency costs.
[387] Remasking Discrete Diffusion Models with Inference-Time Scaling
Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, Volodymyr Kuleshov
Main category: cs.LG
TL;DR: ReMDM sampler enables iterative refinement in masked discrete diffusion models, allowing tokens to be updated during generation and providing inference-time compute scaling for improved quality.
Details
Motivation: Modern masked discrete diffusion models lack iterative refinement capability - once a token is generated, it cannot be updated even if it introduces errors, limiting their performance compared to autoregressive models.Method: Introduces ReMDM sampler with a custom remasking backward process that can be applied to pretrained masked diffusion models, enabling tokens to be remasked and regenerated during sampling.
Result: ReMDM approaches autoregressive model quality with more sampling steps, maintains better quality with limited compute, improves sample quality for discretized images, and enhances controllability in scientific domains like molecule design.
Conclusion: ReMDM successfully addresses the iterative refinement limitation in masked discrete diffusion, providing flexible compute scaling and improved performance across natural language, images, and scientific applications.
Abstract: Part of the success of diffusion models stems from their ability to perform iterative refinement, i.e., repeatedly correcting outputs during generation. However, modern masked discrete diffusion lacks this capability: when a token is generated, it cannot be updated again, even when it introduces an error. Here, we address this limitation by introducing the remasking diffusion model (ReMDM) sampler, a method that can be applied to pretrained masked diffusion models in a principled way and that is derived from a discrete diffusion model with a custom remasking backward process. Most interestingly, ReMDM endows discrete diffusion with a form of inference-time compute scaling. By increasing the number of sampling steps, ReMDM generates natural language outputs that approach the quality of autoregressive models, whereas when the computation budget is limited, ReMDM better maintains quality. ReMDM also improves sample quality of masked diffusion models for discretized images, and in scientific domains such as molecule design, ReMDM facilitates diffusion guidance and pushes the Pareto frontier of controllability relative to classical masking and uniform noise diffusion. We provide the code along with a blog post on the project page: https://remdm.github.io
[388] Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization
Abdullah Akgül, Gulcin Baykal, Manuel Haußmann, Melih Kandemir
Main category: cs.LG
TL;DR: EPPO uses evidential critics to handle non-stationary environments by maintaining critic plasticity and enabling directed exploration through uncertainty quantification.
Details
Motivation: Continuous control in non-stationary environments is challenging for deep RL due to time-dependent state transitions and stability issues in actor-critic architectures.Method: On-policy RL with evidential critic that approximates uncertainty around state values, detects distributional shifts from changing dynamics, and enables directed exploration.
Result: EPPO outperforms state-of-the-art on-policy RL variants in non-stationary continuous control tasks with changing dynamics.
Conclusion: Evidential uncertainty quantification is crucial for both policy evaluation and improvement in non-stationary environments.
Abstract: Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms. The time-dependency of the state transition dynamics aggravates the notorious stability problems of model-free deep actor-critic architectures. We posit that two properties will play a key role in overcoming non-stationarity in transition dynamics: (i)~preserving the plasticity of the critic network and (ii) directed exploration for rapid adaptation to changing dynamics. We show that performing on-policy reinforcement learning with an evidential critic provides both. The evidential design ensures a fast and accurate approximation of the uncertainty around the state value, which maintains the plasticity of the critic network by detecting the distributional shifts caused by changes in dynamics. The probabilistic critic also makes the actor training objective a random variable, enabling the use of directed exploration approaches as a by-product. We name the resulting algorithm \emph{Evidential Proximal Policy Optimization (EPPO)} due to the integral role of evidential uncertainty quantification in both policy evaluation and policy improvement stages. Through experiments on non-stationary continuous control tasks, where the environment dynamics change at regular intervals, we demonstrate that our algorithm outperforms state-of-the-art on-policy reinforcement learning variants in both task-specific and overall return.
[389] Closing the Intent-to-Behavior Gap via Fulfillment Priority Logic
Bassel El Mabsout, Abdelrahman Abdelgawad, Renato Mancuso
Main category: cs.LG
TL;DR: Fulfillment Priority Logic (FPL) enables practitioners to define logical formulas for multi-objective RL priorities, achieving 500% better sample efficiency than Soft Actor Critic.
Details
Motivation: Translating behavioral objectives into reward functions is challenging due to competing objectives that resist simple linear combinations, especially in robotics where performance conflicts with energy conservation.Method: Proposed Fulfillment Priority Logic (FPL) for defining logical formulas of intentions and priorities, and developed Balanced Policy Gradient algorithm leveraging FPL specifications.
Result: Achieved up to 500% better sample efficiency compared to Soft Actor Critic, representing the first implementation of non-linear utility scalarization for continuous control problems.
Conclusion: FPL provides an effective framework for handling competing objectives in multi-objective reinforcement learning, significantly improving sample efficiency over existing methods.
Abstract: Practitioners designing reinforcement learning policies face a fundamental challenge: translating intended behavioral objectives into representative reward functions. This challenge stems from behavioral intent requiring simultaneous achievement of multiple competing objectives, typically addressed through labor-intensive linear reward composition that yields brittle results. Consider the ubiquitous robotics scenario where performance maximization directly conflicts with energy conservation. Such competitive dynamics are resistant to simple linear reward combinations. In this paper, we present the concept of objective fulfillment upon which we build Fulfillment Priority Logic (FPL). FPL allows practitioners to define logical formula representing their intentions and priorities within multi-objective reinforcement learning. Our novel Balanced Policy Gradient algorithm leverages FPL specifications to achieve up to 500% better sample efficiency compared to Soft Actor Critic. Notably, this work constitutes the first implementation of non-linear utility scalarization design, specifically for continuous control problems.
[390] WaveStitch: Flexible and Fast Conditional Time Series Generation with Diffusion Models
Aditya Shankar, Lydia Y. Chen, Arie van Deursen, Rihan Hai
Main category: cs.LG
TL;DR: WaveStitch is a diffusion-based method that generates temporal data by conditioning on both metadata and partially observed signals, using a hybrid training-inference approach with parallel generation and stitching for coherence.
Details
Motivation: Existing methods for temporal data generation have limitations: they rarely condition on both metadata and observed values together, use either training-time or inference-time approaches that don't generalize well, and suffer from trade-offs between generation speed and temporal coherence.Method: WaveStitch uses dual-sourced conditioning on metadata and partially observed signals, a hybrid training-inference architecture with gradient-based guidance, and a pipeline-style paradigm that generates time windows in parallel while preserving coherence through conditional loss and stitching.
Result: WaveStitch achieves 1.81x lower mean-squared-error compared to state-of-the-art methods and generates data up to 166.48x faster than autoregressive methods while maintaining coherence across diverse datasets.
Conclusion: WaveStitch effectively overcomes key limitations in temporal data generation by enabling dual conditioning, hybrid training-inference, and fast coherent parallel generation through its novel pipeline paradigm.
Abstract: Generating temporal data under conditions is crucial for forecasting, imputation, and generative tasks. Such data often has metadata and partially observed signals that jointly influence the generated values. However, existing methods face three key limitations: (1) they condition on either the metadata or observed values, but rarely both together; (2) they adopt either training-time approaches that fail to generalize to unseen scenarios, or inference-time approaches that ignore metadata; and (3) they suffer from trade-offs between generation speed and temporal coherence across time windows–choosing either slow but coherent autoregressive methods or fast but incoherent parallel ones. We propose WaveStitch, a novel diffusion-based method to overcome these hurdles through: (1) dual-sourced conditioning on both metadata and partially observed signals; (2) a hybrid training-inference architecture, incorporating metadata during training and observations at inference via gradient-based guidance; and (3) a novel pipeline-style paradigm that generates time windows in parallel while preserving coherence through an inference-time conditional loss and a stitching mechanism. Across diverse datasets, WaveStitch demonstrates adaptability to arbitrary patterns of observed signals, achieving 1.81x lower mean-squared-error compared to the state-of-the-art, and generates data up to 166.48x faster than autoregressive methods while maintaining coherence. Our code is available at: https://github.com/adis98/WaveStitch
[391] Emergence and scaling laws in SGD learning of shallow neural networks
Yunwei Ren, Eshaan Nichani, Denny Wu, Jason D. Lee
Main category: cs.LG
TL;DR: Analysis of SGD dynamics for learning two-layer neural networks with extensive width (P»1) on Gaussian data, revealing sharp transition times for signal recovery and smooth scaling laws despite abrupt individual neuron learning.
Details
Motivation: To understand the complexity of online SGD in learning two-layer neural networks, particularly in the challenging extensive-width regime where the number of neurons P is large, and to characterize how diverging condition numbers affect learning dynamics.Method: Analyze SGD dynamics for training a student two-layer network to minimize MSE objective on isotropic Gaussian data with orthonormal signal directions, focusing on power-law scaling of second-layer coefficients and activation functions with information exponent k*>2.
Result: Identified sharp transition times for recovering each signal direction, characterized scaling law exponents for MSE loss with respect to training samples, SGD steps, and network parameters. Showed that while individual neuron learning exhibits abrupt transitions, the combination of many emergent learning curves leads to smooth cumulative scaling laws.
Conclusion: The analysis provides precise understanding of SGD dynamics in extensive-width neural networks, revealing how the interplay between multiple learning timescales results in smooth overall learning behavior despite abrupt transitions at the individual neuron level.
Abstract: We study the complexity of online stochastic gradient descent (SGD) for learning a two-layer neural network with $P$ neurons on isotropic Gaussian data: $f_(\boldsymbol{x}) = \sum_{p=1}^P a_p\cdot \sigma(\langle\boldsymbol{x},\boldsymbol{v}_p^\rangle)$, $\boldsymbol{x} \sim \mathcal{N}(0,\boldsymbol{I}d)$, where the activation $\sigma:\mathbb{R}\to\mathbb{R}$ is an even function with information exponent $k>2$ (defined as the lowest degree in the Hermite expansion), ${\boldsymbol{v}^p}{p\in[P]}\subset \mathbb{R}^d$ are orthonormal signal directions, and the non-negative second-layer coefficients satisfy $\sum_{p} a_p^2=1$. We focus on the challenging ``extensive-width’’ regime $P\gg 1$ and permit diverging condition number in the second-layer, covering as a special case the power-law scaling $a_p\asymp p^{-\beta}$ where $\beta\in\mathbb{R}_{\ge 0}$. We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective, and explicitly identify sharp transition times to recover each signal direction. In the power-law setting, we characterize scaling law exponents for the MSE loss with respect to the number of training samples and SGD steps, as well as the number of parameters in the student neural network. Our analysis entails that while the learning of individual teacher neurons exhibits abrupt transitions, the juxtaposition of $P\gg 1$ emergent learning curves at different timescales leads to a smooth scaling law in the cumulative objective.
[392] AI-driven software for automated quantification of skeletal metastases and treatment response evaluation using Whole-Body Diffusion-Weighted MRI (WB-DWI) in Advanced Prostate Cancer
Antonio Candito, Matthew D Blackledge, Richard Holbrey, Nuria Porta, Ana Ribeiro, Fabio Zugni, Luca D’Erme, Francesca Castagnoli, Alina Dragan, Ricardo Donners, Christina Messiou, Nina Tunariu, Dow-Mu Koh
Main category: cs.LG
TL;DR: Automated software for quantitative assessment of treatment response in advanced prostate cancer with bone metastases using whole-body diffusion-weighted MRI biomarkers.
Details
Motivation: Manual tracking of post-treatment changes in bone metastases from WB-DWI is cumbersome and increases inter-reader variability, creating an unmet clinical need for automated assessment.Method: Developed software with three core technologies: weakly-supervised Residual U-Net for bone isolation, statistical framework for WB-DWI intensity normalization, and shallow CNN for lesion detection using normalized b900 images.
Result: Achieved 0.6 dice score for lesion delineation, 8.8% relative difference for log-TDV, 5% for median gADC, 80.5% accuracy in treatment response assessment, with 90s computation time per scan.
Conclusion: The automated software provides reliable, repeatable quantitative assessment of treatment response in advanced prostate cancer with bone metastases, addressing clinical needs for standardized evaluation.
Abstract: Quantitative assessment of treatment response in Advanced Prostate Cancer (APC) with bone metastases remains an unmet clinical need. Whole-Body Diffusion-Weighted MRI (WB-DWI) provides two response biomarkers: Total Diffusion Volume (TDV) and global Apparent Diffusion Coefficient (gADC). However, tracking post-treatment changes of TDV and gADC from manually delineated lesions is cumbersome and increases inter-reader variability. We developed a software to automate this process. Core technologies include: (i) a weakly-supervised Residual U-Net model generating a skeleton probability map to isolate bone; (ii) a statistical framework for WB-DWI intensity normalisation, obtaining a signal-normalised b=900s/mm^2 (b900) image; and (iii) a shallow convolutional neural network that processes outputs from (i) and (ii) to generate a mask of suspected bone lesions, characterised by higher b900 signal intensity due to restricted water diffusion. This mask is applied to the gADC map to extract TDV and gADC statistics. We tested the tool using expert-defined metastatic bone disease delineations on 66 datasets, assessed repeatability of imaging biomarkers (N=10), and compared software-based response assessment with a construct reference standard (N=118). Average dice score between manual and automated delineations was 0.6 for lesions within pelvis and spine, with an average surface distance of 2mm. Relative differences for log-transformed TDV (log-TDV) and median gADC were 8.8% and 5%, respectively. Repeatability analysis showed coefficients of variation of 4.6% for log-TDV and 3.5% for median gADC, with intraclass correlation coefficients of 0.94 or higher. The software achieved 80.5% accuracy, 84.3% sensitivity, and 85.7% specificity in assessing response to treatment. Average computation time was 90s per scan.
[393] Why and When Deep is Better than Shallow: An Implementation-Agnostic State-Transition View of Depth Supremacy
Sho Sonoda, Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda
Main category: cs.LG
TL;DR: The paper provides a theoretical framework explaining why deep networks outperform shallow ones through bias-variance decomposition, identifying four canonical trade-off regimes where optimal depth typically exceeds 1.
Details
Motivation: To understand why and when deep networks are better than shallow networks in a implementation-agnostic framework, separating abstract state transitions from specific implementations like ReLU nets or transformers.Method: Formulate deep models as abstract state-transition semigroups acting on metric spaces, prove bias-variance decomposition where variance depends only on abstract depth, and analyze depth dependence through metric entropy of state-transition semigroups.
Result: Identified four canonical bias-variance trade-off regimes (EL/EP/PL/PP) with explicit optimal depths, showing that k*>1 typically holds across regimes, providing rigorous depth supremacy. EL regime (exponential bias decay + logarithmic variance growth) achieves lowest generalization error.
Conclusion: Deep networks are typically better than shallow ones, especially for iterative or hierarchical concept classes like neural ODEs, diffusion models, and chain-of-thought reasoning, with the EL regime explaining optimal performance conditions.
Abstract: Why and when is deep better than shallow? We answer this question in a framework that is agnostic to network implementation. We formulate a deep model as an abstract state-transition semigroup acting on a general metric space, and separate the implementation (e.g., ReLU nets, transformers, and chain-of-thought) from the abstract state transition. We prove a bias-variance decomposition in which the variance depends only on the abstract depth-$k$ network and not on the implementation (Theorem 1). We further split the bounds into output and hidden parts to tie the depth dependence of the variance to the metric entropy of the state-transition semigroup (Theorem 2). We then investigate implementation-free conditions under which the variance grow polynomially or logarithmically with depth (Section 4). Combining these with exponential or polynomial bias decay identifies four canonical bias-variance trade-off regimes (EL/EP/PL/PP) and produces explicit optimal depths $k^\ast$. Across regimes, $k^\ast>1$ typically holds, giving a rigorous form of depth supremacy. The lowest generalization error bound is achieved under the EL regime (exp-decay bias + log-growth variance), explaining why and when deep is better, especially for iterative or hierarchical concept classes such as neural ODEs, diffusion/score-matching models, and chain-of-thought reasoning.
[394] Strategic Classification with Non-Linear Classifiers
Benyamin Trachtenberg, Nir Rosenfeld
Main category: cs.LG
TL;DR: Strategic classification extends supervised learning to account for users’ costly feature manipulations in response to classifiers. This work explores how strategic behavior manifests under non-linear classifiers, showing it can either increase or decrease effective class complexity, and that universal approximators like neural nets lose their universality in strategic environments.
Details
Motivation: While standard learning supports various model classes, strategic classification has focused mostly on linear classifiers. This work aims to expand understanding to non-linear classifiers and how strategic behavior affects learning in these settings.Method: The authors take a bottom-up approach analyzing how non-linearity affects decision boundary points, classifier expressivity, and model class complexity in strategic environments.
Result: Results show strategic behavior can either increase or decrease effective class complexity, with complexity decrease potentially being arbitrarily large. Key finding is that universal approximators (e.g., neural nets) are no longer universal in strategic environments.
Conclusion: Strategic environments fundamentally change the capabilities of machine learning models, where even universal approximators lose their universality, creating performance gaps even with unrestricted model classes.
Abstract: In strategic classification, the standard supervised learning setting is extended to support the notion of strategic user behavior in the form of costly feature manipulations made in response to a classifier. While standard learning supports a broad range of model classes, the study of strategic classification has, so far, been dedicated mostly to linear classifiers. This work aims to expand the horizon by exploring how strategic behavior manifests under non-linear classifiers and what this implies for learning. We take a bottom-up approach showing how non-linearity affects decision boundary points, classifier expressivity, and model class complexity. Our results show how, unlike the linear case, strategic behavior may either increase or decrease effective class complexity, and that the complexity decrease may be arbitrarily large. Another key finding is that universal approximators (e.g., neural nets) are no longer universal once the environment is strategic. We demonstrate empirically how this can create performance gaps even on an unrestricted model class.
[395] Revisiting Multivariate Time Series Forecasting with Missing Values
Jie Yang, Yifan Hu, Kexin Zhang, Luyang Niu, Philip S. Yu, Kaize Ding
Main category: cs.LG
TL;DR: CRIB is a novel framework that directly predicts from partially observed time series without imputation, using Information Bottleneck principle and consistency regularization to handle missing values effectively.
Details
Motivation: Traditional imputation-then-prediction approaches for multivariate time series forecasting with missing values are problematic because there's no ground truth for missing values, making imputation error-prone and potentially degrading prediction accuracy.Method: Proposes Consistency-Regularized Information Bottleneck (CRIB) framework that uses unified-variate attention mechanism and consistency regularization to learn robust representations that filter out noise from missing values while preserving predictive signals.
Result: Comprehensive experiments on four real-world datasets show CRIB predicts accurately even under high missing rates, outperforming traditional imputation-based approaches.
Conclusion: CRIB represents a paradigm shift from imputation-based methods to direct prediction from partially observed data, providing a more robust solution for time series forecasting with missing values.
Abstract: Missing values are common in real-world time series, and multivariate time series forecasting with missing values (MTSF-M) has become a crucial area of research for ensuring reliable predictions. To address the challenge of missing data, current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data. However, this framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy. In this paper, we conduct a systematic empirical study and reveal that imputation without direct supervision can corrupt the underlying data distribution and actively degrade prediction accuracy. To address this, we propose a paradigm shift that moves away from imputation and directly predicts from the partially observed time series. We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle. CRIB combines a unified-variate attention mechanism with a consistency regularization scheme to learn robust representations that filter out noise introduced by missing values while preserving essential predictive signals. Comprehensive experiments on four real-world datasets demonstrate the effectiveness of CRIB, which predicts accurately even under high missing rates. Our code is available in https://github.com/Muyiiiii/CRIB.
[396] MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver
Yuepeng Zheng, Fu Luo, Zhenkun Wang, Yaoxin Wu, Yu Zhou
Main category: cs.LG
TL;DR: A novel multi-task learning method using knowledge distillation (MTL-KD) to train heavy decoder models for solving multiple Vehicle Routing Problem variants with strong generalization to large-scale problems.
Details
Motivation: Existing RL-based multi-task methods can only train light decoder models on small-scale problems with limited generalization ability for large-scale VRPs.Method: Knowledge distillation transfers policy knowledge from multiple single-task RL models to a single heavy decoder model, plus Random Reordering Re-Construction (R3C) inference strategy.
Result: Superior performance on 6 seen and 10 unseen VRP variants with up to 1000 nodes, demonstrating robust generalization on both uniform and real-world benchmarks.
Conclusion: MTL-KD enables efficient training of heavy decoder models with strong generalization across diverse VRP tasks, outperforming existing methods.
Abstract: Multi-Task Learning (MTL) in Neural Combinatorial Optimization (NCO) is a promising approach to train a unified model capable of solving multiple Vehicle Routing Problem (VRP) variants. However, existing Reinforcement Learning (RL)-based multi-task methods can only train light decoder models on small-scale problems, exhibiting limited generalization ability when solving large-scale problems. To overcome this limitation, this work introduces a novel multi-task learning method driven by knowledge distillation (MTL-KD), which enables the efficient training of heavy decoder models with strong generalization ability. The proposed MTL-KD method transfers policy knowledge from multiple distinct RL-based single-task models to a single heavy decoder model, facilitating label-free training and effectively improving the model’s generalization ability across diverse tasks. In addition, we introduce a flexible inference strategy termed Random Reordering Re-Construction (R3C), which is specifically adapted for diverse VRP tasks and further boosts the performance of the multi-task model. Experimental results on 6 seen and 10 unseen VRP variants with up to 1000 nodes indicate that our proposed method consistently achieves superior performance on both uniform and real-world benchmarks, demonstrating robust generalization abilities.
[397] How Effective Are Time-Series Models for Precipitation Nowcasting? A Comprehensive Benchmark for GNSS-based Precipitation Nowcasting
Yifang Zhang, Shengwu Xiong, Henan Wang, Wenjie Yin, Jiawang Peng, Yuqiang Zhang, Chen Zhou, Hua Chen, Qile Zhao, Pengfei Duan
Main category: cs.LG
TL;DR: RainfallBench is a new benchmark for precipitation nowcasting (0-6 hour prediction) that addresses gaps in existing meteorological benchmarks by focusing on complex rainfall patterns with zero inflation, temporal decay, and non-stationarity.
Details
Motivation: Existing time series forecasting benchmarks focus on periodic variables like temperature and humidity, failing to capture the complexity of precipitation nowcasting which is critical for disaster mitigation and real-time response planning.Method: Created RainfallBench using 5 years of hourly meteorological data from 140+ GNSS stations globally, incorporating precipitable water vapor (PWV) and designing specialized evaluation protocols for multi-scale prediction, multi-resolution forecasting, and extreme rainfall events. Also introduced Bi-Focus Precipitation Forecaster (BFPF) to address zero-inflation and temporal decay issues.
Result: Benchmarked 17 state-of-the-art models across six major architectures, with statistical analysis and ablation studies validating dataset comprehensiveness and methodology superiority.
Conclusion: RainfallBench provides a comprehensive benchmark for precipitation nowcasting that better reflects real-world meteorological challenges, and the proposed BFPF module effectively addresses domain-specific issues in rainfall forecasting.
Abstract: Precipitation Nowcasting, which aims to predict precipitation within the next 0 to 6 hours, is critical for disaster mitigation and real-time response planning. However, most time series forecasting benchmarks in meteorology are evaluated on variables with strong periodicity, such as temperature and humidity, which fail to reflect model capabilities in more complex and practically meteorology scenarios like precipitation nowcasting. To address this gap, we propose RainfallBench, a benchmark designed for precipitation nowcasting, a highly challenging and practically relevant task characterized by zero inflation, temporal decay, and non-stationarity, focusing on predicting precipitation within the next 0 to 6 hours. The dataset is derived from five years of meteorological observations, recorded at hourly intervals across six essential variables, and collected from more than 140 Global Navigation Satellite System (GNSS) stations globally. In particular, it incorporates precipitable water vapor (PWV), a crucial indicator of rainfall that is absent in other datasets. We further design specialized evaluation protocols to assess model performance on key meteorological challenges, including multi-scale prediction, multi-resolution forecasting, and extreme rainfall events, benchmarking 17 state-of-the-art models across six major architectures on RainfallBench. Additionally, to address the zero-inflation and temporal decay issues overlooked by existing models, we introduce Bi-Focus Precipitation Forecaster (BFPF), a plug-and-play module that incorporates domain-specific priors to enhance rainfall time series forecasting. Statistical analysis and ablation studies validate the comprehensiveness of our dataset as well as the superiority of our methodology.
[398] From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit
Valérie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, Demba Ba
Main category: cs.LG
TL;DR: The paper introduces MP-SAE, a sparse autoencoder that captures hierarchical and nonlinear features by unrolling the encoder into residual-guided steps, challenging the assumption that neural network features are only linearly accessible.
Details
Motivation: To address limitations of existing sparse autoencoders (SAEs) that assume features are linearly accessible and orthogonal, when recent evidence shows neural representations exhibit hierarchical, nonlinear, and multi-dimensional structure.Method: Developed MP-SAE by re-contextualizing matching pursuits algorithm, unrolling the encoder into sequential residual-guided steps to capture hierarchical and nonlinear features.
Result: MP-SAE successfully captures hierarchical concepts with conditional orthogonality, recovers meaningful nonlinear features, and reveals shared structure across modalities in vision-language models, while enabling adaptive sparsity.
Conclusion: Interpretability methods should start from the phenomenology of representations rather than imposing assumptions, and features are not solely linearly accessible as previously assumed.
Abstract: Motivated by the hypothesis that neural network representations encode abstract, interpretable features as linearly accessible, approximately orthogonal directions, sparse autoencoders (SAEs) have become a popular tool in interpretability. However, recent work has demonstrated phenomenology of model representations that lies outside the scope of this hypothesis, showing signatures of hierarchical, nonlinear, and multi-dimensional features. This raises the question: do SAEs represent features that possess structure at odds with their motivating hypothesis? If not, does avoiding this mismatch help identify said features and gain further insights into neural network representations? To answer these questions, we take a construction-based approach and re-contextualize the popular matching pursuits (MP) algorithm from sparse coding to design MP-SAE – an SAE that unrolls its encoder into a sequence of residual-guided steps, allowing it to capture hierarchical and nonlinearly accessible features. Comparing this architecture with existing SAEs on a mixture of synthetic and natural data settings, we show: (i) hierarchical concepts induce conditionally orthogonal features, which existing SAEs are unable to faithfully capture, and (ii) the nonlinear encoding step of MP-SAE recovers highly meaningful features, helping us unravel shared structure in the seemingly dichotomous representation spaces of different modalities in a vision-language model, hence demonstrating the assumption that useful features are solely linearly accessible is insufficient. We also show that the sequential encoder principle of MP-SAE affords an additional benefit of adaptive sparsity at inference time, which may be of independent interest. Overall, we argue our results provide credence to the idea that interpretability should begin with the phenomenology of representations, with methods emerging from assumptions that fit it.
[399] Efficient Latent Variable Causal Discovery: Combining Score Search and Targeted Testing
Joseph Ramsey, Bryan Andrews, Peter Spirtes
Main category: cs.LG
TL;DR: The paper presents a family of score-guided causal search algorithms that improve upon FCI for latent variable settings, including BOSS-FCI, GRaSP-FCI, FCIT, and LV-Dumb, with FCIT showing the best balance of correctness and efficiency.
Details
Motivation: FCI algorithm struggles with exhaustive conditional independence tests in latent variable settings, leading to spurious results and poor scalability.Method: Developed four methods: BOSS-FCI and GRaSP-FCI (GFCI variants), FCIT (targeted testing guided by BOSS), and LV-Dumb (BOSS-POD heuristic that directly returns PAG from BOSS DAG).
Result: BOSS-FCI and GRaSP-FCI provide sound baselines, FCIT improves efficiency and reliability, and LV-Dumb offers practical heuristic with strong empirical performance despite not being strictly correct.
Conclusion: Score-guided and targeted strategies enable scalable latent-variable causal discovery, with FCIT representing the best trade-off between correctness and efficiency.
Abstract: Learning causal structure from observational data is especially challenging when latent variables or selection bias are present. The Fast Causal Inference (FCI) algorithm addresses this setting but often performs exhaustive conditional independence tests across many subsets, leading to spurious independence claims, extra or missing edges, and unreliable orientations. We present a family of score-guided mixed-strategy causal search algorithms that build on this tradition. First, we introduce BOSS-FCI and GRaSP-FCI, straightforward variants of GFCI that substitute BOSS or GRaSP for FGES, thereby retaining correctness while incurring different scalability tradeoffs. Second, we develop FCI Targeted-testing (FCIT), a novel mixed-strategy method that improves upon these variants by replacing exhaustive all-subsets testing with targeted tests guided by BOSS, yielding well-formed PAGs with higher precision and efficiency. Finally, we propose a simple heuristic, LV-Dumb (also known as BOSS-POD), which bypasses latent-variable-specific reasoning and directly returns the PAG of the BOSS DAG. Although not strictly correct in the FCI sense, it scales better and often achieves superior accuracy in practice. Simulations and real-data analyses demonstrate that BOSS-FCI and GRaSP-FCI provide sound baselines, FCIT improves both efficiency and reliability, and LV-Dumb offers a practical heuristic with strong empirical performance. Together, these method highlight the value of score-guided and targeted strategies for scalable latent-variable causal discovery.
[400] Reliably Detecting Model Failures in Deployment Without Labels
Viet Nguyen, Changjian Shui, Vijay Giri, Siddharth Arya, Amol Verma, Fahad Razak, Rahul G. Krishnan
Main category: cs.LG
TL;DR: D3M monitors model performance degradation over time without labels by tracking predictive model disagreement, providing reliable alerts when retraining is needed.
Details
Motivation: Models in dynamic environments need retraining when data distribution shifts, but knowing when to retrain without labels is challenging since not all shifts degrade performance.Method: Proposes D3M algorithm based on disagreement of predictive models to detect post-deployment deterioration.
Result: Achieves low false positive rates under non-deteriorating shifts and provides sample complexity bounds for high true positive rates under deteriorating shifts.
Conclusion: Empirical results demonstrate effectiveness on benchmarks and real-world medical data, showing viability as an alert mechanism for high-stakes ML pipelines.
Abstract: The distribution of data changes over time; models operating in dynamic environments need retraining. But knowing when to retrain, without access to labels, is an open challenge since some, but not all shifts degrade model performance. This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring. We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models, achieving low false positive rates under non-deteriorating shifts and provides sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework and highlight its viability as an alert mechanism for high-stakes machine learning pipelines.
[401] Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit
Valérie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, Demba Ba
Main category: cs.LG
TL;DR: Current sparse autoencoders (SAEs) struggle with correlated features due to implicit quasi-orthogonality assumptions. This paper introduces MP-SAE, an iterative SAE based on Matching Pursuit, which better handles correlated features in hierarchical settings like MNIST digit generation while ensuring monotonic reconstruction improvement.
Details
Motivation: Evaluate SAEs in controlled MNIST setting to understand limitations of current shallow architectures, particularly their inability to extract correlated features due to implicit quasi-orthogonality assumptions.Method: Propose MP-SAE (Matching Pursuit-based Sparse Autoencoder) - an iterative SAE that unrolls Matching Pursuit algorithm, enabling residual-guided extraction of correlated features while guaranteeing monotonic reconstruction improvement.
Result: MP-SAE successfully extracts correlated features that arise in hierarchical settings like handwritten digit generation, overcoming limitations of traditional SAEs.
Conclusion: MP-SAE provides a more effective approach for extracting correlated features in hierarchical data structures while maintaining theoretical guarantees on reconstruction quality.
Abstract: Sparse autoencoders (SAEs) have recently become central tools for interpretability, leveraging dictionary learning principles to extract sparse, interpretable features from neural representations whose underlying structure is typically unknown. This paper evaluates SAEs in a controlled setting using MNIST, which reveals that current shallow architectures implicitly rely on a quasi-orthogonality assumption that limits the ability to extract correlated features. To move beyond this, we compare them with an iterative SAE that unrolls Matching Pursuit (MP-SAE), enabling the residual-guided extraction of correlated features that arise in hierarchical settings such as handwritten digit generation while guaranteeing monotonic improvement of the reconstruction as more atoms are selected.
[402] Neural Collapse in Cumulative Link Models for Ordinal Regression: An Analysis with Unconstrained Feature Model
Chuang Ma, Tomoyuki Obuchi, Toshiyuki Tanaka
Main category: cs.LG
TL;DR: The paper introduces Ordinal Neural Collapse (ONC), a phenomenon in deep Ordinal Regression tasks where features and classifiers exhibit a simple geometric structure similar to Neural Collapse in classification tasks.
Details
Motivation: To investigate whether Neural Collapse (NC) phenomena extend to Ordinal Regression tasks, building on the theoretical framework of the Unconstrained Feature Model (UFM) and aiming to deepen understanding of deep neural network behavior in ordinal settings.Method: Combined the cumulative link model for Ordinal Regression with the Unconstrained Feature Model framework, analytically proving ONC properties and empirically validating them across various datasets.
Result: Identified three key properties of Ordinal Neural Collapse: (1) within-class feature collapse to means with regularization, (2) class means aligning with classifiers in 1D subspace, and (3) optimal latent variables aligning with class order and threshold values.
Conclusion: Ordinal Neural Collapse emerges in deep Ordinal Regression tasks with specific geometric properties, and these insights can be leveraged for practical applications, particularly through the use of fixed thresholds.
Abstract: A phenomenon known as ‘‘Neural Collapse (NC)’’ in deep classification tasks, in which the penultimate-layer features and the final classifiers exhibit an extremely simple geometric structure, has recently attracted considerable attention, with the expectation that it can deepen our understanding of how deep neural networks behave. The Unconstrained Feature Model (UFM) has been proposed to explain NC theoretically, and there emerges a growing body of work that extends NC to tasks other than classification and leverages it for practical applications. In this study, we investigate whether a similar phenomenon arises in deep Ordinal Regression (OR) tasks, via combining the cumulative link model for OR and UFM. We show that a phenomenon we call Ordinal Neural Collapse (ONC) indeed emerges and is characterized by the following three properties: (ONC1) all optimal features in the same class collapse to their within-class mean when regularization is applied; (ONC2) these class means align with the classifier, meaning that they collapse onto a one-dimensional subspace; (ONC3) the optimal latent variables (corresponding to logits or preactivations in classification tasks) are aligned according to the class order, and in particular, in the zero-regularization limit, a highly local and simple geometric relationship emerges between the latent variables and the threshold values. We prove these properties analytically within the UFM framework with fixed threshold values and corroborate them empirically across a variety of datasets. We also discuss how these insights can be leveraged in OR, highlighting the use of fixed thresholds.
[403] Two-Player Zero-Sum Games with Bandit Feedback
Elif Yılmaz, Christos Dimitrakakis
Main category: cs.LG
TL;DR: Three ETC-based algorithms for zero-sum games with bandit feedback achieve instance-dependent regret bounds: O(Δ + √T) for basic ETC and O(log(TΔ²)/Δ) for elimination variants.
Details
Motivation: To demonstrate the applicability of Explore-Then-Commit framework in zero-sum game settings and derive instance-dependent regret bounds, which have received limited attention in the literature.Method: Proposed three algorithms: basic ETC adapted to zero-sum games, adaptive elimination leveraging ε-Nash Equilibrium, and elimination with non-uniform exploration.
Result: Achieved instance-dependent regret bounds: O(Δ + √T) for ETC and O(log(TΔ²)/Δ) for elimination algorithms after T rounds, where Δ is the suboptimality gap.
Conclusion: ETC-based algorithms perform effectively in adversarial game settings, achieving regret bounds comparable to existing methods while providing valuable instance-dependent analysis.
Abstract: We study a two-player zero-sum game in which the row player aims to maximize their payoff against an adversarial column player, under an unknown payoff matrix estimated through bandit feedback. We propose three algorithms based on the Explore-Then-Commit framework. The first adapts it to zero-sum games, the second incorporates adaptive elimination that leverages the $\varepsilon$-Nash Equilibrium property to efficiently select the optimal action pair, and the third extends the elimination algorithm by employing non-uniform exploration. Our objective is to demonstrate the applicability of ETC in a zero-sum game setting by focusing on learning pure strategy Nash Equilibria. A key contribution of our work is a derivation of instance-dependent upper bounds on the expected regret of our proposed algorithms, which has received limited attention in the literature on zero-sum games. Particularly, after $T$ rounds, we achieve an instance-dependent regret upper bounds of $O(\Delta + \sqrt{T})$ for ETC in zero-sum game setting and $O(\log (T \Delta^2) / \Delta)$ for the adaptive elimination algorithm and its variant with non-uniform exploration, where $\Delta$ denotes the suboptimality gap. Therefore, our results indicate that ETC-based algorithms perform effectively in adversarial game settings, achieving regret bounds comparable to existing methods while providing insight through instance-dependent analysis.
[404] Modeling Hierarchical Spaces: A Review and Unified Framework for Surrogate-Based Architecture Design
Paul Saves, Edward Hallé-Hannan, Jasper Bussemaker, Youssef Diouane, Nathalie Bartoli
Main category: cs.LG
TL;DR: A unified framework for handling hierarchical, conditional, and mixed-variable input spaces using design space graphs, hierarchical distances, and kernels for surrogate modeling and optimization.
Details
Motivation: Address challenges in simulation-based problems with hierarchical, conditional, heterogeneous, or tree-structured mixed-variable inputs that pose difficulties for data representation, modeling, and optimization.Method: Propose design space graphs combining feature modeling and graph theory to capture hierarchical relationships, introduce hierarchical distances and kernels, and define meta variables and partially-decreed variables for conditional structures.
Result: Demonstrated effectiveness on complex system design problems including neural networks and green-aircraft case studies, with methods implemented in the open-source Surrogate Modeling Toolbox (SMT 2.0).
Conclusion: The framework provides a comprehensive approach for modeling and optimizing complex hierarchical domains with mixed-variable inputs, generalizing existing methods and enabling practical applications in system design.
Abstract: Simulation-based problems involving mixed-variable inputs frequently feature domains that are hierarchical, conditional, heterogeneous, or tree-structured. These characteristics pose challenges for data representation, modeling, and optimization. This paper reviews extensive literature on these structured input spaces and proposes a unified framework that generalizes existing approaches. In this framework, input variables may be continuous, integer, or categorical. A variable is described as meta if its value governs the presence of other decreed variables, enabling the modeling of conditional and hierarchical structures. We further introduce the concept of partially-decreed variables, whose activation depends on contextual conditions. To capture these inter-variable hierarchical relationships, we introduce design space graphs, combining principles from feature modeling and graph theory. This allows the definition of general hierarchical domains suitable for describing complex system architectures. Our framework defines hierarchical distances and kernels to enable surrogate modeling and optimization on hierarchical domains. We demonstrate its effectiveness on complex system design problems, including a neural network and a green-aircraft case study. Our methods are available in the open-source Surrogate Modeling Toolbox (SMT 2.0).
[405] Aggregation of Published Non-Uniform Axial Power Data for Phase II of the OECD/NEA AI/ML Critical Heat Flux Benchmark
Reece Bourisaw, Reid McCants, Jean-Marie Le Corre, Anna Iskhakova, Arsen S. Iskhakov
Main category: cs.LG
TL;DR: This paper compiles and digitizes CHF datasets for OECD/NEA AI/ML benchmark, showing classical correlations fail on non-uniform heating while neural networks trained only on uniform data don’t generalize, highlighting the need for models that incorporate axial power distributions.
Details
Motivation: To support Phase II of OECD/NEA AI/ML CHF benchmark which introduces spatially varying power profiles, and address the limitations of existing CHF prediction methods under non-uniform heating conditions.Method: Compiled and digitized broad CHF dataset covering uniform and non-uniform axial heating; extracted heating profiles from technical reports, interpolated onto consistent axial mesh, validated via energy-balance checks, and encoded in machine-readable formats.
Result: Classical CHF correlations show substantial errors under uniform heating and degrade markedly with non-uniform profiles; neural network trained only on uniform data performs well in uniform regime but fails to generalize to spatially varying scenarios.
Conclusion: The curated datasets and baseline modeling results provide groundwork for advanced transfer-learning strategies, rigorous uncertainty quantification, and design-optimization efforts in next phase of CHF benchmark.
Abstract: Critical heat flux (CHF) marks the onset of boiling crisis in light-water reactors, defining safe thermal-hydraulic operating limits. To support Phase II of the OECD/NEA AI/ML CHF benchmark, which introduces spatially varying power profiles, this work compiles and digitizes a broad CHF dataset covering both uniform and non-uniform axial heating conditions. Heating profiles were extracted from technical reports, interpolated onto a consistent axial mesh, validated via energy-balance checks, and encoded in machine-readable formats for benchmark compatibility. Classical CHF correlations exhibit substantial errors under uniform heating and degrade markedly when applied to non-uniform profiles, while modern tabular methods offer improved but still imperfect predictions. A neural network trained solely on uniform data performs well in that regime but fails to generalize to spatially varying scenarios, underscoring the need for models that explicitly incorporate axial power distributions. By providing these curated datasets and baseline modeling results, this study lays the groundwork for advanced transfer-learning strategies, rigorous uncertainty quantification, and design-optimization efforts in the next phase of the CHF benchmark.
[406] Learning Low Rank Neural Representations of Hyperbolic Wave Dynamics from Data
Woojin Cho, Kookjin Lee, Noseong Park, Donsub Rim, Gerrit Welper
Main category: cs.LG
TL;DR: A data-driven dimensionality reduction method using low rank neural representation (LRNR) for hyperbolic wave propagation data, achieving efficient compression and interpretable physical feature decomposition.
Details
Motivation: To develop efficient representations for physics-based hyperbolic wave propagation data, motivated by theoretical proofs of efficient representation existence for this wave class.Method: Uses specialized neural network architecture called low rank neural representation (LRNR) within a hypernetwork framework, combining deep learning techniques to learn low-dimensional representations directly from data.
Result: Learned representations naturally exhibit low rank tensor structure, revealing interpretable physical feature decomposition where each mode corresponds to specific wave characteristics. Enables efficient inference via compression.
Conclusion: LRNR architecture successfully provides efficient, interpretable low-dimensional representations for hyperbolic wave propagation with practical compression benefits for performance-demanding applications.
Abstract: We present a data-driven dimensionality reduction method that is well-suited for physics-based data representing hyperbolic wave propagation. The method utilizes a specialized neural network architecture called low rank neural representation (LRNR) inside a hypernetwork framework. The architecture is motivated by theoretical results that rigorously prove the existence of efficient representations for this wave class. We illustrate through archetypal examples that such an efficient low-dimensional representation of propagating waves can be learned directly from data through a combination of deep learning techniques. We observe that a low rank tensor representation arises naturally in the trained LRNRs, and that this reveals a new decomposition of wave propagation where each decomposed mode corresponds to interpretable physical features. Furthermore, we demonstrate that the LRNR architecture enables efficient inference via a compression scheme, which is a potentially important feature when deploying LRNRs in demanding performance regimes.
[407] Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control
Thomas T. Zhang, Daniel Pfrommer, Chaoyi Pan, Nikolai Matni, Max Simchowitz
Main category: cs.LG
TL;DR: Theoretical analysis shows action-chunking and exploratory augmentation in imitation learning avoid exponential compounding errors through control-theoretic stability mechanisms.
Details
Motivation: To understand why action-chunking and exploratory data augmentation work well in imitation learning despite known issues with compounding errors in continuous control settings.Method: Theoretical analysis using control-theoretic stability principles and empirical validation on robot learning benchmarks.
Result: Action-chunking and exploratory augmentation circumvent exponential compounding errors in different regimes, with control-theoretic stability identified as the key mechanism.
Conclusion: Control-theoretic analysis provides finer insights into compounding error and tighter statistical guarantees than previous information-theoretic approaches alone.
Abstract: This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of action-chunking (predicting sequences of actions in open-loop) and exploratory augmentation of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can suffer errors that compound exponentially with task horizon in continuous settings, we demonstrate that action chunking and exploratory data collection circumvent exponential compounding errors in different regimes. Our results identify control-theoretic stability as the key mechanism underlying the benefits of these interventions. On the empirical side, we validate our predictions and the role of control-theoretic stability through experimentation on popular robot learning benchmarks. On the theoretical side, we demonstrate that the control-theoretic lens provides fine-grained insights into how compounding error arises, leading to tighter statistical guarantees on imitation learning error when these interventions are applied than previous techniques based on information-theoretic considerations alone.
[408] Prior-Guided Flow Matching for Target-Aware Molecule Design with Learnable Atom Number
Jingyuan Zhou, Hao Qian, Shikui Tu, Lei Xu
Main category: cs.LG
TL;DR: PAFlow is a novel target-aware molecular generation model that improves structure-based drug design by incorporating prior interaction guidance and learnable atom number prediction to address issues with unstable probability dynamics and molecule-protein size mismatch.
Details
Motivation: Current generative models for structure-based drug design suffer from unstable probability dynamics and mismatch between generated molecule size and protein pocket geometry, leading to inconsistent quality and off-target effects.Method: PAFlow uses flow matching framework with conditional flow matching for discrete atom types, incorporates a protein-ligand interaction predictor to guide generation toward higher-affinity regions, and includes an atom number predictor based on protein pocket information.
Result: On CrossDocked2020 benchmark, PAFlow achieves state-of-the-art binding affinity (up to -8.31 Avg. Vina Score) while maintaining favorable molecular properties.
Conclusion: PAFlow effectively addresses key challenges in structure-based drug design by providing stable generation dynamics and better alignment between generated molecules and target protein geometry.
Abstract: Structure-based drug design (SBDD), aiming to generate 3D molecules with high binding affinity toward target proteins, is a vital approach in novel drug discovery. Although recent generative models have shown great potential, they suffer from unstable probability dynamics and mismatch between generated molecule size and the protein pockets geometry, resulting in inconsistent quality and off-target effects. We propose PAFlow, a novel target-aware molecular generation model featuring prior interaction guidance and a learnable atom number predictor. PAFlow adopts the efficient flow matching framework to model the generation process and constructs a new form of conditional flow matching for discrete atom types. A protein-ligand interaction predictor is incorporated to guide the vector field toward higher-affinity regions during generation, while an atom number predictor based on protein pocket information is designed to better align generated molecule size with target geometry. Extensive experiments on the CrossDocked2020 benchmark show that PAFlow achieves a new state-of-the-art in binding affinity (up to -8.31 Avg. Vina Score), simultaneously maintains favorable molecular properties.
[409] A Compositional Kernel Model for Feature Learning
Feng Ruan, Keli Liu, Michael Jordan
Main category: cs.LG
TL;DR: Compositional kernel ridge regression with coordinate-wise reweighting serves as a testbed for feature learning, showing how relevant variables are recovered while noise is eliminated.
Details
Motivation: To study feature learning in compositional architectures and understand how different kernels affect variable selection in the presence of noise.Method: Variational formulation of compositional kernel ridge regression with coordinate-wise reweighting, analyzing global minimizers and stationary points.
Result: Both global minimizers and stationary points eliminate noise variables when noise is Gaussian. ℓ₁-type kernels (e.g., Laplace) recover nonlinear features at stationary points, while Gaussian kernels only recover linear ones.
Conclusion: The choice of kernel significantly impacts feature recovery in compositional models, with ℓ₁-type kernels being superior for capturing nonlinear effects compared to Gaussian kernels.
Abstract: We study a compositional variant of kernel ridge regression in which the predictor is applied to a coordinate-wise reweighting of the inputs. Formulated as a variational problem, this model provides a simple testbed for feature learning in compositional architectures. From the perspective of variable selection, we show how relevant variables are recovered while noise variables are eliminated. We establish guarantees showing that both global minimizers and stationary points discard noise coordinates when the noise variables are Gaussian distributed. A central finding is that $\ell_1$-type kernels, such as the Laplace kernel, succeed in recovering features contributing to nonlinear effects at stationary points, whereas Gaussian kernels recover only linear ones.
[410] From Uniform to Adaptive: General Skip-Block Mechanisms for Efficient PDE Neural Operators
Lei Liu, Zhongyi Yu, Hong Wang, Huanshuo Dong, Haiyang Xin, Hongwei Zhao, Bin Li
Main category: cs.LG
TL;DR: Skip-Block Routing (SBR) is a framework for Transformer-based neural operators that reduces computational cost by 50% while maintaining accuracy, by adaptively allocating processing capacity based on token complexity.
Details
Motivation: Current neural operators for solving PDEs suffer from computational inefficiency due to uniform computational costs despite varying physical field complexities, especially problematic in large-scale engineering tasks.Method: SBR uses a routing mechanism to learn token complexity rankings during training, then during inference selectively forwards tokens through layers based on their complexity, focusing more processing on complex regions.
Result: SBR reduces computational cost by approximately 50% in FLOPs, achieves up to 2x faster inference speed, and maintains accuracy when integrated into various neural operators.
Conclusion: SBR provides an effective framework for adaptive computation in neural operators, addressing the fundamental mismatch between uniform computational costs and varying physical field complexities.
Abstract: In recent years, Neural Operators(NO) have gradually emerged as a popular approach for solving Partial Differential Equations (PDEs). However, their application to large-scale engineering tasks suffers from significant computational overhead. And the fact that current models impose a uniform computational cost while physical fields exhibit vastly different complexities constitutes a fundamental mismatch, which is the root of this inefficiency. For instance, in turbulence flows, intricate vortex regions require deeper network processing compared to stable flows. To address this, we introduce a framework: Skip-Block Routing (SBR), a general framework designed for Transformer-based neural operators, capable of being integrated into their multi-layer architectures. First, SBR uses a routing mechanism to learn the complexity and ranking of tokens, which is then applied during inference. Then, in later layers, it decides how many tokens are passed forward based on this ranking. This way, the model focuses more processing capacity on the tokens that are more complex. Experiments demonstrate that SBR is a general framework that seamlessly integrates into various neural operators. Our method reduces computational cost by approximately 50% in terms of Floating Point Operations (FLOPs), while still delivering up to 2x faster inference without sacrificing accuracy.
[411] GRAM-DTI: adaptive multimodal representation learning for drug target interaction prediction
Feng Jiang, Amina Mollaysa, Hehuan Ma, Tommaso Mansi, Junzhou Huang, Mangal Prakash, Rui Liao
Main category: cs.LG
TL;DR: GRAMDTI is a multimodal pretraining framework for drug-target interaction prediction that integrates molecular and protein data from four modalities using volume-based contrastive learning and adaptive modality dropout.
Details
Motivation: Existing DTI approaches primarily use SMILES protein pairs and fail to leverage rich multimodal information available for small molecules and proteins, limiting their predictive performance.Method: GRAMDTI uses volume-based contrastive learning across four modalities, adaptive modality dropout to handle modality informativeness, and incorporates IC50 activity measurements as weak supervision for biologically meaningful representations.
Result: Experiments on four publicly available datasets show GRAMDTI consistently outperforms state-of-the-art baselines in DTI prediction.
Conclusion: The framework demonstrates benefits of higher-order multimodal alignment, adaptive modality utilization, and auxiliary supervision for robust and generalizable DTI prediction.
Abstract: Drug target interaction (DTI) prediction is a cornerstone of computational drug discovery, enabling rational design, repurposing, and mechanistic insights. While deep learning has advanced DTI modeling, existing approaches primarily rely on SMILES protein pairs and fail to exploit the rich multimodal information available for small molecules and proteins. We introduce GRAMDTI, a pretraining framework that integrates multimodal molecular and protein inputs into unified representations. GRAMDTI extends volume based contrastive learning to four modalities, capturing higher-order semantic alignment beyond conventional pairwise approaches. To handle modality informativeness, we propose adaptive modality dropout, dynamically regulating each modality’s contribution during pre-training. Additionally, IC50 activity measurements, when available, are incorporated as weak supervision to ground representations in biologically meaningful interaction strengths. Experiments on four publicly available datasets demonstrate that GRAMDTI consistently outperforms state of the art baselines. Our results highlight the benefits of higher order multimodal alignment, adaptive modality utilization, and auxiliary supervision for robust and generalizable DTI prediction.
[412] Exploring Federated Learning for Thermal Urban Feature Segmentation – A Comparison of Centralized and Decentralized Approaches
Leonhard Duda, Khadijeh Alibabaei, Elena Vollmer, Leon Klug, Valentin Kozlov, Lisana Berberi, Mishal Benz, Rebekka Volk, Juan Pedro Gutiérrez Hermosillo Muriedas, Markus Götz, Judith Sáínz-Pardo Díaz, Álvaro López García, Frank Schultmann, Achim Streit
Main category: cs.LG
TL;DR: This paper evaluates Federated Learning (FL) for UAV-based thermal image segmentation in real-world scenarios, comparing FL approaches with centralized learning across accuracy, training time, communication overhead, and energy usage.
Details
Motivation: FL addresses privacy and technical restrictions by enabling distributed training without sharing raw data, which is particularly relevant for UAV thermal imaging data from different cities with non-identical distributions.Method: The study implements and evaluates FL algorithms in real deployment scenarios using UAV thermal images from two German cities, comparing client-controlled and server-controlled workflows against centralized learning baselines.
Result: The research provides empirical evaluation of FL performance metrics including model accuracy, training efficiency, communication costs, and energy consumption in practical UAV imaging applications.
Conclusion: The findings serve as a valuable reference for understanding the practical applications and limitations of FL methods in segmentation tasks for UAV-based imaging systems.
Abstract: Federated Learning (FL) is an approach for training a shared Machine Learning (ML) model with distributed training data and multiple participants. FL allows bypassing limitations of the traditional Centralized Machine Learning CL if data cannot be shared or stored centrally due to privacy or technical restrictions – the participants train the model locally with their training data and do not need to share it among the other participants. This paper investigates the practical implementation and effectiveness of FL in a real-world scenario, specifically focusing on unmanned aerial vehicle (UAV)-based thermal images for common thermal feature detection in urban environments. The distributed nature of the data arises naturally and makes it suitable for FL applications, as images captured in two German cities are available. This application presents unique challenges due to non-identical distribution and feature characteristics of data captured at both locations. The study makes several key contributions by evaluating FL algorithms in real deployment scenarios rather than simulation. We compare several FL approaches with a centralized learning baseline across key performance metrics such as model accuracy, training time, communication overhead, and energy usage. This paper also explores various FL workflows, comparing client-controlled workflows and server-controlled workflows. The findings of this work serve as a valuable reference for understanding the practical application and limitations of the FL methods in segmentation tasks in UAV-based imaging.
[413] HyperHELM: Hyperbolic Hierarchy Encoding for mRNA Language Modeling
Max van Spengler, Artem Moskalev, Tommaso Mansi, Mangal Prakash, Rui Liao
Main category: cs.LG
TL;DR: HyperHELM introduces hyperbolic geometry for mRNA language modeling, outperforming Euclidean models on biological tasks and improving generalization.
Details
Motivation: Euclidean geometry mismatches hierarchical biological structures, while hyperbolic geometry better accommodates hierarchical data but hasn't been applied to mRNA language modeling.Method: Hybrid framework with hyperbolic layers atop Euclidean backbone for masked language model pre-training in hyperbolic space, aligning representations with mRNA-amino acid biological hierarchy.
Result: Outperforms Euclidean baselines on 9/10 property prediction tasks (10% average improvement), excels in out-of-distribution generalization, and surpasses hierarchy-aware Euclidean models by 3% in antibody region annotation accuracy.
Conclusion: Hyperbolic geometry serves as an effective inductive bias for hierarchical language modeling of mRNA sequences.
Abstract: Language models are increasingly applied to biological sequences like proteins and mRNA, yet their default Euclidean geometry may mismatch the hierarchical structures inherent to biological data. While hyperbolic geometry provides a better alternative for accommodating hierarchical data, it has yet to find a way into language modeling for mRNA sequences. In this work, we introduce HyperHELM, a framework that implements masked language model pre-training in hyperbolic space for mRNA sequences. Using a hybrid design with hyperbolic layers atop Euclidean backbone, HyperHELM aligns learned representations with the biological hierarchy defined by the relationship between mRNA and amino acids. Across multiple multi-species datasets, it outperforms Euclidean baselines on 9 out of 10 tasks involving property prediction, with 10% improvement on average, and excels in out-of-distribution generalization to long and low-GC content sequences; for antibody region annotation, it surpasses hierarchy-aware Euclidean models by 3% in annotation accuracy. Our results highlight hyperbolic geometry as an effective inductive bias for hierarchical language modeling of mRNA sequences.
[414] ReNF: Rethinking the Design Space of Neural Long-Term Time Series Forecasters
Yihang Lu, Xianwei Meng, Enhong Chen
Main category: cs.LG
TL;DR: A principled approach to Long-term Time Series Forecasting that combines Auto-Regressive and Direct Output methods, enabling simple MLPs to outperform complex models.
Details
Motivation: Current neural forecasting research overemphasizes architectural complexity while neglecting fundamental forecasting principles, hindering progress in the field.Method: Proposes Boosted Direct Output (BDO) strategy combining AR and DO advantages, with parameter stabilization through smooth tracking, and introduces Multiple Neural Forecasting Theorem.
Result: A simple MLP with BDO achieves state-of-the-art performance, outperforming recent complex models in nearly all cases without domain-specific considerations.
Conclusion: The work establishes theoretical foundations, provides empirical verification of the theorem, and identifies promising future research directions for principled forecasting approaches.
Abstract: Neural Forecasters (NFs) are a cornerstone of Long-term Time Series Forecasting (LTSF). However, progress has been hampered by an overemphasis on architectural complexity at the expense of fundamental forecasting principles. In this work, we return to first principles to redesign the LTSF paradigm. We begin by introducing a Multiple Neural Forecasting Theorem that provides a theoretical basis for our approach. We propose Boosted Direct Output (BDO), a novel forecasting strategy that synergistically combines the advantages of both Auto-Regressive (AR) and Direct Output (DO). In addition, we stabilize the learning process by smoothly tracking the model’s parameters. Extensive experiments show that these principled improvements enable a simple MLP to achieve state-of-the-art performance, outperforming recent, complex models in nearly all cases, without any specific considerations in the area. Finally, we empirically verify our theorem, establishing a dynamic performance bound and identifying promising directions for future research. The code for review is available at: .
[415] Differentiable Fast Top-K Selection for Large-Scale Recommendation
Yanjie Zhu, Zhen Zhang, Yunli Wang, Zhiqiang Wang, Yu Li, Rufan Zhou, Shiyang Wen, Peng Jiang, Chenhao Lin, Jian Yang
Main category: cs.LG
TL;DR: DFTopK is a novel differentiable Top-K operator with O(n) complexity that enables efficient end-to-end training in large-scale recommendation systems, achieving better performance and revenue lift compared to existing methods.
Details
Motivation: Existing differentiable Top-K methods have limitations: Learning-to-Rank suffers from objective misalignment, while differentiable sorting methods introduce gradient conflicts and have O(n log n) complexity, hindering efficient training in large-scale systems.Method: DFTopK constructs a differentiable Top-K operator by relaxing normalization constraints, which admits a closed-form solution and avoids sorting, achieving optimal linear-time complexity without gradient conflicts.
Result: DFTopK significantly improves training efficiency and achieves superior performance, enabling scaling up training samples. In online A/B tests, it yielded +1.77% revenue lift with the same computational budget compared to baseline.
Conclusion: DFTopK is the first differentiable Top-K operator for recommendation systems with theoretically optimal linear-time complexity, offering efficient end-to-end training and improved performance in large-scale applications.
Abstract: Cascade ranking is a widely adopted paradigm in large-scale information retrieval systems for Top-K item selection. However, the Top-K operator is non-differentiable, hindering end-to-end training. Existing methods include Learning-to-Rank approaches (e.g., LambdaLoss), which optimize ranking metrics like NDCG and suffer from objective misalignment, and differentiable sorting-based methods (e.g., ARF, LCRON), which relax permutation matrices for direct Top-K optimization but introduce gradient conflicts through matrix aggregation. A promising alternative is to directly construct a differentiable approximation of the Top-K selection operator, bypassing the use of soft permutation matrices. However, even state-of-the-art differentiable Top-K operator (e.g., LapSum) require $O(n \log n)$ complexity due to their dependence on sorting for solving the threshold. Thus, we propose DFTopK, a novel differentiable Top-K operator achieving optimal $O(n)$ time complexity. By relaxing normalization constraints, DFTopK admits a closed-form solution and avoids sorting. DFTopK also avoids the gradient conflicts inherent in differentiable sorting-based methods. We evaluate DFTopK on both the public benchmark RecFLow and an industrial system. Experimental results show that DFTopK significantly improves training efficiency while achieving superior performance, which enables us to scale up training samples more efficiently. In the online A/B test, DFTopK yielded a +1.77% revenue lift with the same computational budget compared to the baseline. To the best of our knowledge, this work is the first to introduce differentiable Top-K operators into recommendation systems and the first to achieve theoretically optimal linear-time complexity for Top-K selection. We have open-sourced our implementation to facilitate future research in both academia and industry.
[416] Dynamic Routing Between Experts: A Data-Efficient Approach to Continual Learning in Vision-Language Models
Jay Mohta, Kenan Emir Ak, Dimitrios Dimitriadis, Yan Xu, Mingwei Shen
Main category: cs.LG
TL;DR: A routing-based approach for Vision-Language Models that prevents catastrophic forgetting when fine-tuning on new tasks, preserving foundational knowledge while integrating specialized capabilities without requiring simultaneous access to all datasets.
Details
Motivation: VLMs suffer from catastrophic forgetting when sequentially fine-tuned on new tasks, degrading performance on previously learned capabilities. Multi-task learning mitigates this but requires simultaneous access to all datasets and has computational overhead scaling linearly with tasks.Method: Routing-based approach that enables integration of new tasks while preserving foundational knowledge from pretraining. Evaluated on InternVL-2 models (2B and 8B parameters) with extensive ablation studies on scalability and robustness.
Result: Routing preserves foundational capabilities on benchmarks (ChartQA, MMBench, DocVQA) while improving specialized task accuracy. Achieves this without concurrent access to all task data, avoiding computational overhead. Performs well with semantically related tasks and enables superior cross-modal transfer.
Conclusion: The routing mechanism effectively prevents catastrophic forgetting in VLMs, maintains foundational knowledge, enables scalable task integration, and facilitates cross-modal knowledge transfer not achieved by existing continual learning methods.
Abstract: Vision-Language Models (VLMs) suffer from catastrophic forgetting when sequentially fine-tuned on new tasks, degrading performance on previously learned foundational and task-specific capabilities. While multi-task learning can mitigate forgetting, it requires simultaneous access to all datasets and imposes computational overhead that scales linearly with the number of tasks. In this work, we introduce a routing-based approach that enables the integration of new tasks while preserving the foundational knowledge acquired during pretraining. We evaluate our method using InternVL-2 models (2B and 8B parameters) and demonstrate that routing preserves the model’s foundational capabilities by maintaining performance on general-purpose benchmarks such as ChartQA, MMBench, and DocVQA, while simultaneously improving accuracy on specialized tasks. Importantly, our approach achieves this without requiring concurrent access to data from all tasks, avoiding the significant computational and data overhead associated with traditional multi-task learning. We further conduct extensive ablation studies to evaluate the scalability and robustness of routing-based learning, showing that the approach is resilient to a growing number of tasks and performs particularly well when new tasks are semantically related. Finally, we show that the routing mechanism enables superior cross-modal transfer between language and vision capabilities, allowing knowledge learned in one modality to enhance performance in another capability not achieved by existing continual learning methods.
[417] Reflections from Research Roundtables at the Conference on Health, Inference, and Learning (CHIL) 2025
Emily Alsentzer, Marie-Laure Charpignon, Bill Chen, Niharika D’Souza, Jason Fries, Yixing Jiang, Aparajita Kashyap, Chanwoo Kim, Simon Lee, Aishwarya Mandyam, Ashery Mbilinyi, Nikita Mehandru, Nitish Nagesh, Brighton Nuwagira, Emma Pierson, Arvind Pillai, Akane Sano, Tanveer Syeda-Mahmood, Shashank Yadav, Elias Adhanom, Muhammad Umar Afza, Amelia Archer, Suhana Bedi, Vasiliki Bikia, Trenton Chang, George H. Chen, Winston Chen, Erica Chiang, Edward Choi, Octavia Ciora, Paz Dozie-Nnamah, Shaza Elsharief, Matthew Engelhard, Ali Eshragh, Jean Feng, Josh Fessel, Scott Fleming, Kei Sen Fong, Thomas Frost, Soham Gadgil, Judy Gichoya, Leeor Hershkovich, Sujeong Im, Bhavya Jain, Vincent Jeanselme, Furong Jia, Qixuan Jin, Yuxuan Jin, Daniel Kapash, Geetika Kapoor, Behdokht Kiafar, Matthias Kleiner, Stefan Kraft, Annika Kumar, Daeun Kyung, Zhongyuan Liang, Joanna Lin, Qianchu Liu, Chang Liu, Hongzhou Luan, Chris Lunt, Leopoldo Julían Lechuga López, Matthew B. A. McDermott, Shahriar Noroozizadeh, Connor O’Brien, YongKyung Oh, Mixail Ota, Stephen Pfohl, Meagan Pi, Tanmoy Sarkar Pias, Emma Rocheteau, Avishaan Sethi, Toru Shirakawa, Anita Silver, Neha Simha, Kamile Stankeviciute, Max Sunog, Peter Szolovits, Shengpu Tang, Jialu Tang, Aaron Tierney, John Valdovinos, Byron Wallace, Will Ke Wang, Peter Washington, Jeremy Weiss, Daniel Wolfe, Emily Wong, Hye Sun Yun, Xiaoman Zhang, Xiao Yu Cindy Zhang, Hayoung Jeong, Kaveri A. Thakoor
Main category: cs.LG
TL;DR: CHIL 2025 conference featured Research Roundtables with 8 topics on ML-healthcare intersections, fostering collaborative discussions among 19 chairs.
Details
Motivation: To catalyze collaborative dialogue and address critical challenges at the intersection of machine learning and healthcare through small-group discussions.Method: Hosted Research Roundtables moderated by senior and junior chairs, focusing on rigorous discussion, exploration of opportunities, and collective ideation.
Result: Eight roundtables were successfully conducted covering key topics including explainability, fairness, causality, foundation models, and scalable healthcare solutions.
Conclusion: The roundtables successfully fostered open exchange and inclusive engagement, advancing actionable directions in health ML research.
Abstract: The 6th Annual Conference on Health, Inference, and Learning (CHIL 2025), hosted by the Association for Health Learning and Inference (AHLI), was held in person on June 25-27, 2025, at the University of California, Berkeley, in Berkeley, California, USA. As part of this year’s program, we hosted Research Roundtables to catalyze collaborative, small-group dialogue around critical, timely topics at the intersection of machine learning and healthcare. Each roundtable was moderated by a team of senior and junior chairs who fostered open exchange, intellectual curiosity, and inclusive engagement. The sessions emphasized rigorous discussion of key challenges, exploration of emerging opportunities, and collective ideation toward actionable directions in the field. In total, eight roundtables were held by 19 roundtable chairs on topics of “Explainability, Interpretability, and Transparency,” “Uncertainty, Bias, and Fairness,” “Causality,” “Domain Adaptation,” “Foundation Models,” “Learning from Small Medical Data,” “Multimodal Methods,” and “Scalable, Translational Healthcare Solutions.”
[418] QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation
Yang Zhang, Rui Zhang, Jiaming Guo, Lei Huang, Di Huang, Yunpu Zhao, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen
Main category: cs.LG
TL;DR: QiMeng-SALV introduces signal-aware learning for Verilog code generation by extracting verified signal-level implementations from partially incorrect modules to provide meaningful functional rewards for RL optimization.
Details
Motivation: The lack of meaningful functional rewards hinders RL-based preference optimization for generating functionally correct Verilog code in automated circuit design.Method: Extracts verified signal-aware implementations from partially incorrect modules by comparing signal functionality with reference modules, using AST to identify correct signal-level code segments, and applying signal-aware DPO optimization.
Result: Achieves state-of-the-art performance on VerilogEval and RTLLM benchmarks, with a 7B parameter model matching DeepSeek v3 671B model performance and significantly outperforming CodeV.
Conclusion: Proposes a paradigm shift from module-level to fine-grained signal-level optimization in Verilog code generation, effectively addressing insufficient functional rewards.
Abstract: The remarkable progress of Large Language Models (LLMs) presents promising opportunities for Verilog code generation which is significantly important for automated circuit design. The lacking of meaningful functional rewards hinders the preference optimization based on Reinforcement Learning (RL) for producing functionally correct Verilog code. In this paper, we propose Signal-Aware Learning for Verilog code generation (QiMeng-SALV) by leveraging code segments of functionally correct output signal to optimize RL training. Considering Verilog code specifies the structural interconnection of hardware gates and wires so that different output signals are independent, the key insight of QiMeng-SALV is to extract verified signal-aware implementations in partially incorrect modules, so as to enhance the extraction of meaningful functional rewards. Roughly, we verify the functional correctness of signals in generated module by comparing with that of reference module in the training data. Then abstract syntax tree (AST) is employed to identify signal-aware code segments which can provide meaningful functional rewards from erroneous modules. Finally, we introduce signal-aware DPO which is optimized on the correct signal-level code segments, thereby preventing noise and interference from incorrect signals. The proposed QiMeng-SALV underscores the paradigm shift from conventional module-level to fine-grained signal-level optimization in Verilog code generation, addressing the issue of insufficient functional rewards. Experiments demonstrate that our method achieves state-of-the-art performance on VerilogEval and RTLLM, with a 7B parameter model matching the performance of the DeepSeek v3 671B model and significantly outperforming the leading open-source model CodeV trained on the same dataset. Our code is available at https://github.com/zy1xxx/SALV.
[419] Generating Auxiliary Tasks with Reinforcement Learning
Judah Goldfeder, Matthew So, Hod Lipson
Main category: cs.LG
TL;DR: RL-AUX uses reinforcement learning to automatically generate auxiliary tasks and optimize auxiliary loss weights, eliminating the need for human-designed auxiliary tasks and complex bi-level optimization.
Details
Motivation: Current auxiliary learning methods require costly human-labeled auxiliary tasks or computationally expensive bi-level optimization, limiting practical adoption.Method: Proposes RL-AUX framework that uses reinforcement learning to dynamically assign auxiliary labels and learn per-example weights for auxiliary loss, rewarding improvements on the primary task.
Result: On CIFAR-100 grouped into 20 superclasses, RL-AUX outperforms human-labeled auxiliary tasks and matches bi-level optimization baseline performance, with similar strong results on other classification datasets.
Conclusion: Reinforcement learning is a viable approach for automatically generating effective auxiliary tasks, reducing dependency on human expertise and complex optimization methods.
Abstract: Auxiliary Learning (AL) is a form of multi-task learning in which a model trains on auxiliary tasks to boost performance on a primary objective. While AL has improved generalization across domains such as navigation, image classification, and NLP, it often depends on human-labeled auxiliary tasks that are costly to design and require domain expertise. Meta-learning approaches mitigate this by learning to generate auxiliary tasks, but typically rely on gradient based bi-level optimization, adding substantial computational and implementation overhead. We propose RL-AUX, a reinforcement-learning (RL) framework that dynamically creates auxiliary tasks by assigning auxiliary labels to each training example, rewarding the agent whenever its selections improve the performance on the primary task. We also explore learning per-example weights for the auxiliary loss. On CIFAR-100 grouped into 20 superclasses, our RL method outperforms human-labeled auxiliary tasks and matches the performance of a prominent bi-level optimization baseline. We present similarly strong results on other classification datasets. These results suggest RL is a viable path to generating effective auxiliary tasks.
[420] Exploring Human-AI Conceptual Alignment through the Prism of Chess
Semyon Lomasov, Judah Goldfeder, Mehmet Hamza Erol, Matthew So, Yao Yan, Addison Howard, Nathan Kutz, Ravid Shwartz Ziv
Main category: cs.LG
TL;DR: AI chess models achieve grandmaster-level play but their understanding of human strategic concepts paradoxically decreases in deeper layers, revealing reliance on memorized patterns rather than abstract understanding.
Details
Motivation: To investigate whether AI systems truly understand human concepts or just mimic surface patterns, using chess as a domain where human creativity meets precise strategic concepts.Method: Analyzed a 270M-parameter transformer chess model with layer-wise concept analysis, introduced the first Chess960 dataset with 240 expert-annotated positions across 6 strategic concepts to test conceptual robustness beyond memorization.
Result: Early layers encode human concepts like center control and knight outposts with 85% accuracy, but deeper layers (despite superior performance) drop to 50-65% accuracy. When opening theory is eliminated via Chess960, concept recognition drops 10-20%, showing reliance on memorized patterns.
Conclusion: Current AI architectures face a fundamental tension: representations that win games diverge from those aligning with human thinking, suggesting AI systems develop increasingly alien intelligence as they optimize for performance, posing challenges for genuine human-AI collaboration.
Abstract: Do AI systems truly understand human concepts or merely mimic surface patterns? We investigate this through chess, where human creativity meets precise strategic concepts. Analyzing a 270M-parameter transformer that achieves grandmaster-level play, we uncover a striking paradox: while early layers encode human concepts like center control and knight outposts with up to 85% accuracy, deeper layers, despite driving superior performance, drift toward alien representations, dropping to 50-65% accuracy. To test conceptual robustness beyond memorization, we introduce the first Chess960 dataset: 240 expert-annotated positions across 6 strategic concepts. When opening theory is eliminated through randomized starting positions, concept recognition drops 10-20% across all methods, revealing the model’s reliance on memorized patterns rather than abstract understanding. Our layer-wise analysis exposes a fundamental tension in current architectures: the representations that win games diverge from those that align with human thinking. These findings suggest that as AI systems optimize for performance, they develop increasingly alien intelligence, a critical challenge for creative AI applications requiring genuine human-AI collaboration. Dataset and code are available at: https://github.com/slomasov/ChessConceptsLLM.
[421] An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning
Chuyan Chen, Chenyang Ma, Zhangxin Li, Yutong He, Yanjie Dong, Kun Yuan
Main category: cs.LG
TL;DR: ARC-Top-K is a new gradient compressor that combines Top-K’s performance with Rand-K’s efficiency by using sketches to align sparsity patterns across nodes, enabling index-free All-Reduce operations.
Details
Motivation: Existing gradient compressors have limitations: Rand-K discards structural information and performs poorly, while Top-K loses contraction property and requires costly All-Gather operations.Method: ARC-Top-K uses lightweight sketches of gradients to align sparsity patterns across nodes, enabling index-free All-Reduce while preserving globally significant information. It’s combined with momentum error feedback (EF21M).
Result: ARC-Top-K is provably contractive and achieves linear speedup with sharper convergence rates. Empirically, it matches Top-K accuracy while reducing wall-clock training time by up to 60.7%.
Conclusion: ARC-Top-K offers an efficient and scalable solution that combines Rand-K’s robustness with Top-K’s strong performance, addressing communication bottlenecks in distributed machine learning.
Abstract: Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$ discards structural information and performs poorly in practice, while Top-$K$ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$ matches the accuracy of Top-$K$ while reducing wall-clock training time by up to 60.7%, offering an efficient and scalable solution that combines the robustness of Rand-$K$ with the strong performance of Top-$K$.
[422] HADSF: Aspect Aware Semantic Control for Explainable Recommendation
Zheng Nie, Peijie Sun
Main category: cs.LG
TL;DR: HADSF is a two-stage framework that addresses limitations in LLM-based information extraction for recommender systems by creating compact aspect vocabularies and performing constrained extraction, reducing hallucination and improving rating prediction.
Details
Motivation: Current LLM methods for review-based recommendation systems suffer from uncontrolled scope, noisy representations, lack of hallucination metrics, and unexplored cost-quality trade-offs across model scales.Method: Hyper-Adaptive Dual-Stage Semantic Framework (HADSF) with two stages: (1) adaptive selection for corpus-level aspect vocabulary, (2) vocabulary-guided constrained extraction of aspect-opinion triples. Introduces Aspect Drift Rate (ADR) and Opinion Fidelity Rate (OFR) metrics.
Result: Experiments on 3 million reviews across LLMs (1.5B-70B parameters) show consistent reduction in prediction error and enable smaller models to achieve competitive performance. Revealed nonmonotonic relationship between hallucination severity and rating prediction error.
Conclusion: HADSF provides an effective approach for hallucination-aware LLM-enhanced explainable recommendation, with released code and metrics to support reproducible research.
Abstract: Recent advances in large language models (LLMs) promise more effective information extraction for review-based recommender systems, yet current methods still (i) mine free-form reviews without scope control, producing redundant and noisy representations, (ii) lack principled metrics that link LLM hallucination to downstream effectiveness, and (iii) leave the cost-quality trade-off across model scales largely unexplored. We address these gaps with the Hyper-Adaptive Dual-Stage Semantic Framework (HADSF), a two-stage approach that first induces a compact, corpus-level aspect vocabulary via adaptive selection and then performs vocabulary-guided, explicitly constrained extraction of structured aspect-opinion triples. To assess the fidelity of the resulting representations, we introduce Aspect Drift Rate (ADR) and Opinion Fidelity Rate (OFR) and empirically uncover a nonmonotonic relationship between hallucination severity and rating prediction error. Experiments on approximately 3 million reviews across LLMs spanning 1.5B-70B parameters show that, when integrated into standard rating predictors, HADSF yields consistent reductions in prediction error and enables smaller models to achieve competitive performance in representative deployment scenarios. We release code, data pipelines, and metric implementations to support reproducible research on hallucination-aware, LLM-enhanced explainable recommendation. Code is available at https://github.com/niez233/HADSF
[423] A Comparative Analysis of LLM Adaptation: SFT, LoRA, and ICL in Data-Scarce Scenarios
Bernd Bohnet, Rumen Dangovski, Kevin Swersky, Sherry Moore, Arslan Chaudhry, Kathleen Kenealy, Noah Fiedel
Main category: cs.LG
TL;DR: Comparative analysis of LLM adaptation methods (SFT, LoRA, ICL) in data-scarce scenarios shows LoRA provides the best balance between skill acquisition and preserving general knowledge, while SFT excels at skills but causes catastrophic forgetting, and ICL works for factual knowledge but struggles with complex skills.
Details
Motivation: LLMs need tailored adaptation for specific applications, but full fine-tuning is computationally expensive and causes catastrophic forgetting. Need to find optimal adaptation strategies that balance skill acquisition with preserving general reasoning abilities.Method: Comparative analysis of three adaptation methods: Supervised Finetuning (SFT), Parameter-Efficient Fine-Tuning via LoRA, and In-Context Learning (ICL) in data-scarce scenarios.
Result: LoRA provides the most effective balance, successfully instilling new skills with minimal impact on base model’s general knowledge. SFT excels at skill acquisition but is highly susceptible to catastrophic forgetting. ICL is effective for factual knowledge but struggles with complex skills.
Conclusion: Provides a practical framework for selecting LLM adaptation strategies, highlighting the critical distinction between skill acquisition and knowledge integration, and clarifying trade-offs between task-specific performance and preservation of general capabilities.
Abstract: The remarkable capabilities of Large Language Models (LLMs) often need to be tailored for specific applications, requiring the integration of new knowledge or the acquisition of new skills. While full fine-tuning is a powerful adaptation method, it is computationally expensive and can lead to a degradation of general reasoning abilities, a phenomenon known as catastrophic forgetting. A range of alternative techniques exists, each with its own trade-offs. In-Context Learning (ICL) is fast but limited by context length, while Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) offer a middle ground by minimizing parameter changes. However, the challenge of catastrophic forgetting persists, raising questions about the best adaptation strategy for a given task. This paper presents a comparative analysis of Supervised Finetuning (SFT), LoRA, and ICL in data-scarce scenarios. We find that LoRA provides the most effective balance, successfully instilling new skills with minimal impact on the base model’s general knowledge. In contrast, while SFT excels at skill acquisition, it is highly susceptible to catastrophic forgetting. ICL is effective for incorporating factual knowledge but struggles with complex skills. Our findings offer a practical framework for selecting an LLM adaptation strategy. We highlight the critical distinction between skill acquisition and knowledge integration, clarify the trade-offs between task-specific performance and the preservation of general capabilities.
[424] Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals
Sophie Li, Nicholas Huang, Nayan Saxena, Nina Luo, Vincent Lin, Kevin Zhu, Sunishchal Dev
Main category: cs.LG
TL;DR: KAPPA is an inference-time pruning method that combines KL divergence, confidence, and entropy to selectively eliminate low-scoring branches, reducing computational costs while maintaining accuracy compared to standard Best-of-N approaches.
Details
Motivation: Standard Best-of-N methods for LLM reasoning incur high computational costs by generating all candidate solutions, while existing pruning methods like ST-BoN rely on consistency heuristics that don't directly evaluate branch quality.Method: KAPPA uses a principled scoring function combining Kullback-Leibler divergence, confidence, and entropy to guide progressive pruning during inference, promoting diversity during exploration while selectively eliminating low-scoring branches.
Result: Experiments on GSM8K and MATH500 with DeepSeek-R1-Distill-Qwen-1.5B and Qwen2.5-7B-Instruct show KAPPA achieves ~60% reduction in peak memory and ~90% reduction in total token generation compared to BoN, with minimal accuracy impact.
Conclusion: KAPPA provides an effective inference-time pruning method that substantially reduces computational costs while maintaining reasoning accuracy, stabilizing performance in smaller models.
Abstract: Large language models (LLMs) improve reasoning accuracy when generating multiple candidate solutions at test time, but standard methods like Best-of-N (BoN) incur high computational cost by fully generating all branches. Self-Truncation Best-of-N (ST-BoN) mitigates this by truncating unpromising paths early, but its reliance on consistency-based heuristics is a limitation as it does not directly evaluate branch quality. We present KL-Adjusted Pruned Path Algorithm (KAPPA), an inference-time method that combines Kullback-Leibler divergence, confidence, and entropy into a principled scoring function to guide progressive pruning. By promoting diversity during exploration and selectively eliminating low-scoring branches, KAPPA maintains accuracy while substantially reducing memory and token usage. Experiments on GSM8K and MATH500 with DeepSeek-R1-Distill-Qwen-1.5B and Qwen2.5-7B-Instruct demonstrate that KAPPA stabilizes performance in smaller models and achieves up to ~60% reduction in peak memory and ~90% reduction in total token generation relative to BoN, with minimal impact on accuracy.
[425] Investigating the Robustness of Knowledge Tracing Models in the Presence of Student Concept Drift
Morgan Lee, Artem Frenk, Eamon Worden, Karish Gupta, Thinh Pham, Ethan Croteau, Neil Heffernan
Main category: cs.LG
TL;DR: Knowledge Tracing models suffer performance degradation due to concept drift in online learning platforms, with BKT being the most stable while attention-based models degrade faster.
Details
Motivation: To investigate how concept drift and changing student populations impact KT model performance across multiple academic years in online learning platforms.Method: Applied four KT models (including BKT and attention-based models) to five academic years of data to assess susceptibility to concept drift.
Result: All KT models exhibited degraded performance, with BKT remaining most stable on newer data while attention-based models lost predictive power significantly faster.
Conclusion: KT models are susceptible to concept drift, with simpler models like BKT showing better stability than complex attention-based models over time.
Abstract: Knowledge Tracing (KT) has been an established problem in the educational data mining field for decades, and it is commonly assumed that the underlying learning process being modeled remains static. Given the ever-changing landscape of online learning platforms (OLPs), we investigate how concept drift and changing student populations can impact student behavior within an OLP through testing model performance both within a single academic year and across multiple academic years. Four well-studied KT models were applied to five academic years of data to assess how susceptible KT models are to concept drift. Through our analysis, we find that all four families of KT models can exhibit degraded performance, Bayesian Knowledge Tracing (BKT) remains the most stable KT model when applied to newer data, while more complex, attention based models lose predictive power significantly faster.
[426] Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding
Jungyeon Koh, Hyun Jong Yang
Main category: cs.LG
TL;DR: Proposes a unified framework for joint optimization of user association and resource allocation to enable efficient parallel speculative decoding in mobile edge computing systems, reducing latency by 23.7% on average.
Details
Motivation: Address the communication overhead and asynchronous delays in speculative decoding for on-device LLM inference in resource-constrained mobile edge computing environments.Method: Multi-agent deep reinforcement learning algorithm to solve the joint user association and resource allocation problem, evaluated using Sionna simulator under realistic conditions.
Result: Achieves up to 28.0% and average 23.7% reduction in end-to-end latency without compromising inference accuracy.
Conclusion: Enables scalable and low-latency LLM services in MEC systems through optimized parallel speculative decoding.
Abstract: The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.
cs.MA
[427] EvoMem: Improving Multi-Agent Planning with Dual-Evolving Memory
Wenzhe Fan, Ning Yan, Masood Mortazavi
Main category: cs.MA
TL;DR: EvoMem is a multi-agent framework with dual-evolving memory that improves planning tasks through coordinated constraint tracking and iterative refinement.
Details
Motivation: The role of human-like memory in LLM-based multi-agent planning frameworks remains largely unexplored, despite being critical for natural language planning that requires iterative reasoning, constraint tracking, and error correction.Method: EvoMem uses three agents (Constraint Extractor, Verifier, and Actor) with two memory modules: Constraint Memory (evolves across queries) and Query-feedback Memory (evolves within queries). Both memories reset after each query session.
Result: Evaluations on trip planning, meeting planning, and calendar scheduling show consistent performance improvements, demonstrating the framework’s effectiveness.
Conclusion: The success of EvoMem underscores the importance of memory mechanisms in enhancing multi-agent planning capabilities.
Abstract: Planning has been a cornerstone of artificial intelligence for solving complex problems, and recent progress in LLM-based multi-agent frameworks have begun to extend this capability. However, the role of human-like memory within these frameworks remains largely unexplored. Understanding how agents coordinate through memory is critical for natural language planning, where iterative reasoning, constraint tracking, and error correction drive the success. Inspired by working memory model in cognitive psychology, we present EvoMem, a multi-agent framework built on a dual-evolving memory mechanism. The framework consists of three agents (Constraint Extractor, Verifier, and Actor) and two memory modules: Constraint Memory (CMem), which evolves across queries by storing task-specific rules and constraints while remains fixed within a query, and Query-feedback Memory (QMem), which evolves within a query by accumulating feedback across iterations for solution refinement. Both memory modules are reset at the end of each query session. Evaluations on trip planning, meeting planning, and calendar scheduling show consistent performance improvements, highlighting the effectiveness of EvoMem. This success underscores the importance of memory in enhancing multi-agent planning.
[428] Optimizing Multi-Lane Intersection Performance in Mixed Autonomy Environments
Manonmani Sekar, Nasim Nezamoddini
Main category: cs.MA
TL;DR: A novel traffic signal control framework using Graph Attention Networks (GAT) and Soft Actor-Critic (SAC) reinforcement learning to coordinate human-driven and autonomous vehicles at intersections, achieving significant improvements in delay reduction, safety, and fairness.
Details
Motivation: To address the challenge of ensuring smooth coordination between human-driven vehicles (HDVs) and connected autonomous vehicles (CAVs) at multilane intersections in mixed-autonomy traffic systems.Method: Combines Graph Attention Networks (GAT) to model dynamic graph-structured traffic flow and capture spatial-temporal dependencies, with Soft Actor-Critic (SAC) reinforcement learning for adaptive signal control through entropy-optimized decision making.
Result: Achieved 24.1% reduction in average delay, up to 29.2% fewer traffic violations, and improved fairness ratio between HDVs and CAVs to 1.59 in SUMO-based simulations with varying traffic densities and CAV penetration rates.
Conclusion: The GAT-SAC framework shows significant promise for real-world deployment in mixed-autonomy traffic systems due to its effectiveness in improving traffic efficiency, safety, and fairness.
Abstract: One of the main challenges in managing traffic at multilane intersections is ensuring smooth coordination between human-driven vehicles (HDVs) and connected autonomous vehicles (CAVs). This paper presents a novel traffic signal control framework that combines Graph Attention Networks (GAT) with Soft Actor-Critic (SAC) reinforcement learning to address this challenge. GATs are used to model the dynamic graph- structured nature of traffic flow to capture spatial and temporal dependencies between lanes and signal phases. The proposed SAC is a robust off-policy reinforcement learning algorithm that enables adaptive signal control through entropy-optimized decision making. This design allows the system to coordinate the signal timing and vehicle movement simultaneously with objectives focused on minimizing travel time, enhancing performance, ensuring safety, and improving fairness between HDVs and CAVs. The model is evaluated using a SUMO-based simulation of a four-way intersection and incorporating different traffic densities and CAV penetration rates. The experimental results demonstrate the effectiveness of the GAT-SAC approach by achieving a 24.1% reduction in average delay and up to 29.2% fewer traffic violations compared to traditional methods. Additionally, the fairness ratio between HDVs and CAVs improved to 1.59, indicating more equitable treatment across vehicle types. These findings suggest that the GAT-SAC framework holds significant promise for real-world deployment in mixed-autonomy traffic systems.
[429] Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning
Beyazit Yalcinkaya, Marcell Vazquez-Chanlatte, Ameesh Shah, Hanna Krasowski, Sanjit A. Seshia
Main category: cs.MA
TL;DR: ACC-MARL enables learning multi-task, multi-agent policies for cooperative temporal objectives using automata-based task decomposition under centralized training with decentralized execution.
Details
Motivation: Existing approaches for multi-agent reinforcement learning with temporal objectives are sample-inefficient and limited to single-task scenarios, lacking the ability to handle complex multi-task coordination.Method: Proposes Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL) framework that uses automata to represent tasks, decomposes complex tasks into simpler sub-tasks, and learns task-conditioned decentralized policies with centralized training.
Result: The approach enables emergent task-aware, multi-step coordination among agents (e.g., pressing buttons, holding doors, short-circuiting tasks) and allows optimal task assignment at test time using learned value functions.
Conclusion: ACC-MARL provides a feasible and correct framework for learning multi-task, multi-agent policies with temporal objectives, addressing key challenges in cooperative multi-agent reinforcement learning.
Abstract: We study the problem of learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks enables the decomposition of complex tasks into simpler sub-tasks that can be assigned to agents. However, existing approaches remain sample-inefficient and are limited to the single-task case. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify the main challenges to ACC-MARL’s feasibility in practice, propose solutions, and prove the correctness of our approach. We further show that the value functions of learned policies can be used to assign tasks optimally at test time. Experiments show emergent task-aware, multi-step coordination among agents, e.g., pressing a button to unlock a door, holding the door, and short-circuiting tasks.
[430] CPU-Based Layout Design for Picker-to-Parts Pallet Warehouses
Timo Looms, Lin Xie
Main category: cs.MA
TL;DR: A CPU-inspired warehouse layout design with P/E/S zones significantly improves throughput and reduces labor compared to traditional layouts.
Details
Motivation: Picker-to-parts pallet warehouses suffer from inefficiencies due to conventional layouts causing excessive travel distances and high labor requirements.Method: Discrete-event simulation comparing the novel CPU-inspired layout (with Performance, Efficiency, and Shared zones) against traditional rectangular (random and ABC storage) and Flying-V layouts.
Result: Significant improvements in throughput time and reduced labor requirements were achieved with the CPU-based layout design.
Conclusion: CPU-based layouts show strong potential for optimizing warehouse operations by improving efficiency and reducing labor costs.
Abstract: Picker-to-parts pallet warehouses often face inefficiencies due to conventional layouts causing excessive travel distances and high labor requirements. This study introduces a novel layout design inspired by CPU architecture, partitioning warehouse space into specialized zones, namely Performance (P), Efficiency (E), and Shared (S). Discrete-event simulation is used to evaluate this design against traditional rectangular (random and ABC storage) and Flying-V layouts. Results demonstrate significant improvements in throughput time and reduced labor requirements, highlighting the potential for CPU-based layouts in optimizing warehouse operations.
[431] When Is Diversity Rewarded in Cooperative Multi-Agent Learning?
Michael Amir, Matteo Bettini, Amanda Prorok
Main category: cs.MA
TL;DR: The paper provides theoretical and computational analysis of when heterogeneous teams outperform homogeneous ones in multi-agent systems, focusing on reward design and using MARL to validate theoretical predictions.
Details
Motivation: To understand when behavioral diversity in teams provides measurable benefits, particularly in multi-agent task allocation problems where principled explanations for heterogeneous team superiority are lacking.Method: Combines theoretical analysis of reward aggregation operators with multi-agent reinforcement learning (MARL), introducing HetGPS algorithm to optimize environments for heterogeneity advantage.
Result: Proves that operator curvature determines heterogeneity advantage, and shows HetGPS successfully rediscovers reward regimes predicted by theory across different environments.
Conclusion: Provides both theoretical framework and computational validation for understanding when behavioral diversity delivers measurable benefits in multi-agent systems.
Abstract: The success of teams in robotics, nature, and society often depends on the division of labor among diverse specialists; however, a principled explanation for when such diversity surpasses a homogeneous team is still missing. Focusing on multi-agent task allocation problems, we study this question from the perspective of reward design: what kinds of objectives are best suited for heterogeneous teams? We first consider an instantaneous, non-spatial setting where the global reward is built by two generalized aggregation operators: an inner operator that maps the $N$ agents’ effort allocations on individual tasks to a task score, and an outer operator that merges the $M$ task scores into the global team reward. We prove that the curvature of these operators determines whether heterogeneity can increase reward, and that for broad reward families this collapses to a simple convexity test. Next, we ask what incentivizes heterogeneity to emerge when embodied, time-extended agents must learn an effort allocation policy. To study heterogeneity in such settings, we use multi-agent reinforcement learning (MARL) as our computational paradigm, and introduce Heterogeneity Gain Parameter Search (HetGPS), a gradient-based algorithm that optimizes the parameter space of underspecified MARL environments to find scenarios where heterogeneity is advantageous. Across different environments, we show that HetGPS rediscovers the reward regimes predicted by our theory to maximize the advantage of heterogeneity, both validating HetGPS and connecting our theoretical insights to reward design in MARL. Together, these results help us understand when behavioral diversity delivers a measurable benefit.
[432] Strategic Communication and Language Bias in Multi-Agent LLM Coordination
Alessio Buscemi, Daniele Proverbio, Alessandro Di Stefano, The Anh Han, German Castignani, Pietro Liò
Main category: cs.MA
TL;DR: Communication in multi-agent LLM systems affects cooperation differently across languages, personalities, and game structures, with both positive coordination and negative bias reinforcement effects.
Details
Motivation: To investigate how communication amplifies language-driven effects on cooperation in multi-agent LLM scenarios, given that linguistic framing affects strategic coordination.Method: Used FAIRGAME to simulate one-shot and repeated games across different languages and models (GPT-4o and Llama 4 Maverick), testing scenarios both with and without communication.
Result: Communication significantly influences agent behavior, but its impact varies by language, personality, and game structure, showing both coordination benefits and bias reinforcement.
Conclusion: Communication plays a dual role in multi-agent LLM systems - fostering coordination while potentially reinforcing biases, highlighting the complex interplay between language, communication, and strategic behavior.
Abstract: Large Language Model (LLM)-based agents are increasingly deployed in multi-agent scenarios where coordination is crucial but not always assured. Research shows that the way strategic scenarios are framed linguistically can affect cooperation. This paper explores whether allowing agents to communicate amplifies these language-driven effects. Leveraging FAIRGAME, we simulate one-shot and repeated games across different languages and models, both with and without communication. Our experiments, conducted with two advanced LLMs-GPT-4o and Llama 4 Maverick-reveal that communication significantly influences agent behavior, though its impact varies by language, personality, and game structure. These findings underscore the dual role of communication in fostering coordination and reinforcing biases.
[433] Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models
Brennen A. Hill, Mant Koh En Wei, Thangavel Jishnuanandh
Main category: cs.MA
TL;DR: The paper compares two communication strategies in multi-agent reinforcement learning: learned end-to-end communication vs engineered world model-based communication, finding the latter performs better in complex environments.
Details
Motivation: To investigate whether communication protocols in multi-agent systems should be engineered or learned end-to-end, particularly for robust coordination under partial observability in cooperative task-allocation problems.Method: Proposed and compared two approaches: Learned Direct Communication (LDC) that learns protocols end-to-end, and Intention Communication using an engineered world model (ITGM) that simulates future states and compresses plans into messages via MGN. Evaluated on grid world environments with varying complexity.
Result: Emergent communication works in simple settings, but the engineered world model-based approach shows superior performance, sample efficiency, and scalability as environmental complexity increases.
Conclusion: Integrating structured, predictive models into MARL agents enables more effective goal-driven coordination, advocating for engineered approaches over purely learned communication in complex scenarios.
Abstract: Robust coordination is critical for effective decision-making in multi-agent systems, especially under partial observability. A central question in Multi-Agent Reinforcement Learning (MARL) is whether to engineer communication protocols or learn them end-to-end. We investigate this dichotomy using embodied world models. We propose and compare two communication strategies for a cooperative task-allocation problem. The first, Learned Direct Communication (LDC), learns a protocol end-to-end. The second, Intention Communication, uses an engineered inductive bias: a compact, learned world model, the Imagined Trajectory Generation Module (ITGM), which uses the agent’s own policy to simulate future states. A Message Generation Network (MGN) then compresses this plan into a message. We evaluate these approaches on goal-directed interaction in a grid world, a canonical abstraction for embodied AI problems, while scaling environmental complexity. Our experiments reveal that while emergent communication is viable in simple settings, the engineered, world model-based approach shows superior performance, sample efficiency, and scalability as complexity increases. These findings advocate for integrating structured, predictive models into MARL agents to enable active, goal-driven coordination.
[434] Osprey: A Scalable Framework for the Orchestration of Agentic Systems
Thorsten Hellert, João Montenegro, Antonin Sulc
Main category: cs.MA
TL;DR: The Osprey Framework is a domain-agnostic architecture for scalable agentic systems that addresses coordination challenges in safety-critical environments through dynamic tool selection, plan-first orchestration, context-aware task extraction, and production-ready execution.
Details
Motivation: To solve the challenge of coordinating workflows across complex systems in safety-critical environments like scientific facilities, where existing language-model-driven agent approaches lack scalability, reliability, and human oversight.Method: The framework provides: (i) dynamic capability classification for relevant tool selection; (ii) plan-first orchestration with explicit dependencies and optional human approval; (iii) context-aware task extraction combining dialogue history with external memory and domain resources; (iv) production-ready execution with checkpointing, artifact management, and modular deployment.
Result: Demonstrated versatility through two case studies: deployment at the Advanced Light Source particle accelerator and a tutorial-style wind farm monitoring example, establishing Osprey as reliable and transparent for agentic systems across diverse high-stakes domains.
Conclusion: Osprey Framework provides a reliable and transparent framework for scalable agentic systems that can effectively coordinate workflows in safety-critical environments across diverse domains.
Abstract: Coordinating workflows across complex systems remains a central challenge in safety-critical environments such as scientific facilities. Language-model-driven agents offer a natural interface for these tasks, but existing approaches often lack scalability, reliability, and human oversight. We introduce the Osprey Framework, a domain-agnostic, production-ready architecture for scalable agentic systems that integrate conversational context with robust tool orchestration across safety-critical domains. Our framework provides: (i) dynamic capability classification to select only relevant tools; (ii) plan-first orchestration with explicit dependencies and optional human approval; (iii) context-aware task extraction that combines dialogue history with external memory and domain resources; and (iv) production-ready execution with checkpointing, artifact management, and modular deployment. We demonstrate its versatility through two case studies: a deployment at the Advanced Light Source particle accelerator and a tutorial-style wind farm monitoring example. These results establish Osprey as a reliable and transparent framework for agentic systems across diverse high-stakes domains.
[435] Long-Term Mapping of the Douro River Plume with Multi-Agent Reinforcement Learning
Nicolò Dal Fabbro, Milad Mesbahi, Renato Mendes, João Borges de Sousa, George J. Pappas
Main category: cs.MA
TL;DR: Multi-agent reinforcement learning approach for long-term river plume mapping using AUVs, combining spatiotemporal Gaussian process regression with multi-head Q-network controllers for energy-efficient coordination.
Details
Motivation: To enable efficient long-term (multiple days) monitoring of dynamic river plumes using multiple autonomous underwater vehicles, addressing challenges of energy consumption and communication limitations.Method: Integration of spatiotemporal Gaussian process regression with multi-head Q-network controllers that regulate AUV direction and speed, using intermittent central coordination for measurement collection and command issuance.
Result: Outperforms single- and multi-agent benchmarks in simulations using Delft3D ocean model, with scaling the number of agents improving both mean squared error and operational endurance (doubling AUVs can more than double endurance while maintaining/improving accuracy).
Conclusion: The learned policies generalize across unseen seasonal regimes, demonstrating promise for future data-driven long-term monitoring of dynamic plume environments.
Abstract: We study the problem of long-term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use-case. We propose an energy - and communication - efficient multi-agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi-head Q-network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single- and multi-agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of multi-agent coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data-driven long-term monitoring of dynamic plume environments.
cs.MM
[436] An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM
Jiawei Liu, Enis Berk Çoban, Zarina Schevchenko, Hao Tang, Zhigang Zhu, Michael I Mandel, Johanna Devaney
Main category: cs.MM
TL;DR: Interleaved instruction tuning in audio MLLMs improves semantic reasoning performance but reduces audio labeling ability.
Details
Motivation: Standard MLLM training concatenates non-textual information with text prompts, which may not encourage deep modality integration and limits reasoning capabilities.Method: Used interleaved instruction tuning where audio tokens are interleaved within prompts, tested on LTU model with SHARD benchmark for audio-based semantic reasoning.
Result: Zero-shot interleaved prompting improves reasoning performance, and fine-tuning further enhances results but reduces audio labeling ability.
Conclusion: Interleaved instruction tuning benefits semantic reasoning in audio MLLMs but comes with trade-offs in audio labeling performance.
Abstract: Standard training for Multi-modal Large Language Models (MLLMs) involves concatenating non-textual information, like vision or audio, with a text prompt. This approach may not encourage deep integration of modalities, limiting the model’s ability to leverage the core language model’s reasoning capabilities. This work examined the impact of interleaved instruction tuning in an audio MLLM, where audio tokens are interleaved within the prompt. Using the Listen, Think, and Understand (LTU) model as a testbed, we conduct an experiment using the Synonym and Hypernym Audio Reasoning Dataset (SHARD), our newly created reasoning benchmark for audio-based semantic reasoning focusing on synonym and hypernym recognition. Our findings show that while even zero-shot interleaved prompting improves performance on our reasoning tasks, a small amount of fine-tuning using interleaved training prompts improves the results further, however, at the expense of the MLLM’s audio labeling ability.
[437] Wireless Video Semantic Communication with Decoupled Diffusion Multi-frame Compensation
Bingyan Xie, Yongpeng Wu, Yuxuan Shi, Biqian Feng, Wenjun Zhang, Jihong Park, Tony Quek
Main category: cs.MM
TL;DR: Proposes WVSC-D, a wireless video semantic communication framework using decoupled diffusion multi-frame compensation for efficient video transmission by encoding videos at semantic level rather than pixel level.
Details
Motivation: Existing wireless video transmission schemes directly conduct video coding in pixel level while neglecting the inner semantics contained in videos, leading to inefficiency.Method: Encodes video frames as semantic frames, uses reference semantic frame instead of motion vectors, and employs DDMFC with two-stage conditional diffusion process for frame compensation at receiver.
Result: Experimental results show WVSC-D outperforms other DL-based methods like DVSC by about 1.8 dB in PSNR, improving bandwidth efficiency while maintaining video transmission performance.
Conclusion: The semantic communication approach enables efficient wireless video transmission by focusing on semantic-level coding rather than pixel-level coding, with DDMFC effectively compensating frames to reduce communication overhead.
Abstract: Existing wireless video transmission schemes directly conduct video coding in pixel level, while neglecting the inner semantics contained in videos. In this paper, we propose a wireless video semantic communication framework with decoupled diffusion multi-frame compensation (DDMFC), abbreviated as WVSC-D, which integrates the idea of semantic communication into wireless video transmission scenarios. WVSC-D first encodes original video frames as semantic frames and then conducts video coding based on such compact representations, enabling the video coding in semantic level rather than pixel level. Moreover, to further reduce the communication overhead, a reference semantic frame is introduced to substitute motion vectors of each frame in common video coding methods. At the receiver, DDMFC is proposed to generate compensated current semantic frame by a two-stage conditional diffusion process. With both the reference frame transmission and DDMFC frame compensation, the bandwidth efficiency improves with satisfying video transmission performance. Experimental results verify the performance gain of WVSC-D over other DL-based methods e.g. DVSC about 1.8 dB in terms of PSNR.
[438] MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
Jianxuan Yang, Xiaoran Yang, Lipan Zhang, Xinyue Guo, Zhao Wang, Gongping Huang
Main category: cs.MM
TL;DR: MultiSoundGen is a novel video-to-audio framework that addresses limitations in complex multi-event scenarios through SlowFast Contrastive AVP for semantic-temporal alignment and AVP-Ranked Preference Optimization for quality enhancement.
Details
Motivation: Current V2A methods struggle with complex multi-event scenarios due to poor semantic-temporal alignment and lack of quantitative preference optimization for audio quality.Method: Proposes MultiSoundGen with two innovations: SlowFast Contrastive AVP (SF-CAVP) for aligning semantic representations and dynamic features, and AVP-Ranked Preference Optimization (AVP-RPO) using SF-CAVP as reward model.
Result: Achieves state-of-the-art performance in multi-event scenarios with comprehensive gains in distribution matching, audio quality, semantic alignment, and temporal synchronization.
Conclusion: MultiSoundGen successfully addresses core limitations in V2A generation for complex multi-event scenarios through innovative alignment and optimization techniques.
Abstract: Current video-to-audio (V2A) methods struggle in complex multi-event scenarios (video scenarios involving multiple sound sources, sound events, or transitions) due to two critical limitations. First, existing methods face challenges in precisely aligning intricate semantic information together with rapid dynamic features. Second, foundational training lacks quantitative preference optimization for semantic-temporal alignment and audio quality. As a result, it fails to enhance integrated generation quality in cluttered multi-event scenes. To address these core limitations, this study proposes a novel V2A framework: MultiSoundGen. It introduces direct preference optimization (DPO) into the V2A domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios. Our contributions include two key innovations: the first is SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity; second, we integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality. Experiments demonstrate that MultiSoundGen achieves state-of-the-art (SOTA) performance in multi-event scenarios, delivering comprehensive gains across distribution matching, audio quality, semantic alignment, and temporal synchronization. Demos are available at https://v2aresearch.github.io/MultiSoundGen/.
eess.AS
[439] Toward Objective and Interpretable Prosody Evaluation in Text-to-Speech: A Linguistically Motivated Approach
Cedric Chan, Jianjing Kuang
Main category: eess.AS
TL;DR: A linguistically informed framework for evaluating TTS prosody that combines quantitative linguistic criteria with acoustic analysis to provide objective, interpretable metrics that correlate with human perception.
Details
Motivation: Current TTS systems struggle with human-like prosodic variation, and traditional evaluation methods like MOS are resource-intensive, inconsistent, and lack diagnostic insights into why systems sound unnatural.Method: Two-tier architecture mirroring human prosodic organization, using quantitative linguistic criteria to evaluate synthesized speech against human corpora across multiple acoustic dimensions, integrating discrete and continuous prosodic measures.
Result: Strong correlations with perceptual MOS ratings while revealing model-specific weaknesses that traditional perceptual tests cannot capture.
Conclusion: Provides a principled path for diagnosing, benchmarking, and improving prosodic naturalness in next-generation TTS systems.
Abstract: Prosody is essential for speech technology, shaping comprehension, naturalness, and expressiveness. However, current text-to-speech (TTS) systems still struggle to accurately capture human-like prosodic variation, in part because existing evaluation methods for prosody remain limited. Traditional metrics like Mean Opinion Score (MOS) are resource-intensive, inconsistent, and offer little insight into why a system sounds unnatural. This study introduces a linguistically informed, semi-automatic framework for evaluating TTS prosody through a two-tier architecture that mirrors human prosodic organization. The method uses quantitative linguistic criteria to evaluate synthesized speech against human speech corpora across multiple acoustic dimensions. By integrating discrete and continuous prosodic measures, it provides objective and interpretable metrics of both event placement and cue realization, while accounting for the natural variability observed across speakers and prosodic cues. Results show strong correlations with perceptual MOS ratings while revealing model-specific weaknesses that traditional perceptual tests alone cannot capture. This approach provides a principled path toward diagnosing, benchmarking, and ultimately improving the prosodic naturalness of next-generation TTS systems.
[440] From the perspective of perceptual speech quality: The robustness of frequency bands to noise
Junyi Fan, Donald S. Williamson
Main category: eess.AS
TL;DR: A MUSHRA-inspired study analyzed noise robustness of 32 frequency bands for speech quality, finding mid-frequency regions are less robust to noise.
Details
Motivation: Speech quality has not been thoroughly analyzed at the band-level like speech intelligibility has, creating a research gap in understanding how different frequency bands contribute to perceptual speech quality under noise.Method: Used a MUSHRA-inspired approach to filter speech into 32 frequency bands, applied real-world noise at different SNRs, and calculated noise robustness indices based on human-rated perceptual quality scores of reconstructed noisy speech.
Result: The mid-frequency region appeared less robust to noise in terms of perceptual speech quality, showing different patterns from previous intelligibility studies.
Conclusion: Future research on improving speech quality should focus more attention on the mid-frequency region of speech signals, as this area shows particular vulnerability to noise degradation.
Abstract: Speech quality is one of the main foci of speech-related research, where it is frequently studied with speech intelligibility, another essential measurement. Band-level perceptual speech intelligibility, however, has been studied frequently, whereas speech quality has not been thoroughly analyzed. In this paper, a Multiple Stimuli With Hidden Reference and Anchor (MUSHRA) inspired approach was proposed to study the individual robustness of frequency bands to noise with perceptual speech quality as the measure. Speech signals were filtered into thirty-two frequency bands with compromising real-world noise employed at different signal-to-noise ratios. Robustness to noise indices of individual frequency bands was calculated based on the human-rated perceptual quality scores assigned to the reconstructed noisy speech signals. Trends in the results suggest the mid-frequency region appeared less robust to noise in terms of perceptual speech quality. These findings suggest future research aiming at improving speech quality should pay more attention to the mid-frequency region of the speech signals accordingly.
[441] Augmenting Open-Vocabulary Dysarthric Speech Assessment with Human Perceptual Supervision
Kaimeng Jia, Minzhu Tu, Zengrui Jin, Siyin Wang, Chao Zhang
Main category: eess.AS
TL;DR: Using human perceptual annotations from speech synthesis assessment as out-of-domain knowledge improves dysarthric speech assessment in self-supervised learning models.
Details
Motivation: Automatic dysarthria assessment provides scalable, cost-effective support for diagnosing and treating neurological conditions like Parkinson's, Alzheimer's, and stroke, but needs reliable supervision.Method: Leveraging human perceptual annotations from speech synthesis assessment as out-of-domain knowledge for dysarthric speech assessment in self-supervised learning pre-trained models.
Result: Experimental results show consistent and substantial performance improvements in dysarthric speech assessment when using speech synthesis perceptual ratings as supervision.
Conclusion: Perceptual ratings aligned with human judgments from speech synthesis evaluations represent valuable resources for dysarthric speech modeling, enabling effective cross-domain knowledge transfer.
Abstract: Dysarthria is a speech disorder characterized by impaired intelligibility and reduced communicative effectiveness. Automatic dysarthria assessment provides a scalable, cost-effective approach for supporting the diagnosis and treatment of neurological conditions such as Parkinson’s disease, Alzheimer’s disease, and stroke. This study investigates leveraging human perceptual annotations from speech synthesis assessment as reliable out-of-domain knowledge for dysarthric speech assessment. Experimental results suggest that such supervision can yield consistent and substantial performance improvements in self-supervised learning pre-trained models. These findings suggest that perceptual ratings aligned with human judgments from speech synthesis evaluations represent valuable resources for dysarthric speech modeling, enabling effective cross-domain knowledge transfer.
[442] Multiplexing Neural Audio Watermarks
Zheqi Yuan, Yucheng Huang, Guangzhi Sun, Zengrui Jin, Chao Zhang
Main category: eess.AS
TL;DR: Proposes multiplexing neural audio watermarking techniques to improve robustness against various attacks, with PA-TFM achieving best results.
Details
Motivation: Existing audio watermarking methods are vulnerable to advanced dilution attacks like lossy compression and neural reconstruction.Method: Investigates five multiplexing designs: parallel, sequential, frequency-division, time-division and perceptual adaptive time-frequency multiplexing (PA-TFM).
Result: PA-TFM achieves better performance than single watermarking baselines by clear margins on LibriSpeech data with 11 different attack methods.
Conclusion: Multiplexing provides a more robust way of using watermarks for audio authentication.
Abstract: Audio watermarking is a promising tool to ensure authenticity of speech content. However, existing watermarking methods remain vulnerable to more advanced dilution attacks such as lossy compression and neural reconstruction. In this paper, we propose to multiplex neural audio watermarking techniques to leverage their complementarity under different types of attacks. Specifically, five different multiplexing designs are investigated, including parallel, sequential, frequency-division, time-division and perceptual adaptive time-frequency multiplexing (PA-TFM). We evaluate our multiplexing technique on LibriSpeech data with 11 different attack methods, including 2 new neural reconstruction attacks featuring recent advancements in speech processing. As a result, the proposed PA-TFM as a training-free multiplexing method achieves better performance than single watermarking baselines by clear margins, showcasing a more robust way of using watermarks for audio.
[443] DARAS: Dynamic Audio-Room Acoustic Synthesis for Blind Room Impulse Response Estimation
Chunxi Wang, Maoshen Jia, Wenyu Jin
Main category: eess.AS
TL;DR: DARAS is a deep learning framework for blind Room Impulse Response (RIR) estimation from monaural reverberant speech, outperforming existing methods through a novel architecture combining audio encoding, Mamba-based parameter estimation, feature fusion, and dynamic acoustic tuning.
Details
Motivation: Existing blind RIR estimation methods struggle with practical accuracy, limiting their effectiveness in speech enhancement, recognition, and AR/VR applications where accurate acoustic characterization is crucial.Method: DARAS uses a four-stage approach: 1) deep audio encoder for feature extraction, 2) Mamba-based self-supervised parameter estimation module, 3) hybrid-path cross-attention feature fusion, and 4) dynamic acoustic tuning decoder that segments early reflections and late reverberation.
Result: Experimental results including MUSHRA-based subjective listening studies show DARAS substantially outperforms existing baseline models, providing robust blind RIR estimation for real-world acoustic environments.
Conclusion: DARAS offers an effective solution for practical blind RIR estimation, advancing the state-of-the-art through its novel deep learning architecture and achieving superior performance in real-world acoustic applications.
Abstract: Room Impulse Responses (RIRs) accurately characterize acoustic properties of indoor environments and play a crucial role in applications such as speech enhancement, speech recognition, and audio rendering in augmented reality (AR) and virtual reality (VR). Existing blind estimation methods struggle to achieve practical accuracy. To overcome this challenge, we propose the dynamic audio-room acoustic synthesis (DARAS) model, a novel deep learning framework that is explicitly designed for blind RIR estimation from monaural reverberant speech signals. First, a dedicated deep audio encoder effectively extracts relevant nonlinear latent space features. Second, the Mamba-based self-supervised blind room parameter estimation (MASS-BRPE) module, utilizing the efficient Mamba state space model (SSM), accurately estimates key room acoustic parameters and features. Third, the system incorporates a hybrid-path cross-attention feature fusion module, enhancing deep integration between audio and room acoustic features. Finally, our proposed dynamic acoustic tuning (DAT) decoder adaptively segments early reflections and late reverberation to improve the realism of synthesized RIRs. Experimental results, including a MUSHRA-based subjective listening study, demonstrate that DARAS substantially outperforms existing baseline models, providing a robust and effective solution for practical blind RIR estimation in real-world acoustic environments.
[444] Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction
Weijie Wu, Wenhao Guan, Kaidi Wang, Peijie Chen, Zhuanling Zha, Junbo Li, Jun Fang, Lin Li, Qingyang Hong
Main category: eess.AS
TL;DR: Phoenix-VAD is an LLM-based streaming semantic endpoint detection model that enables plug-and-play full-duplex prediction for spoken dialogue systems.
Details
Motivation: Current spoken dialogue models lack plug-and-play full-duplex prediction modules for semantic endpoint detection, hindering seamless audio interactions.Method: Leverages LLM’s semantic comprehension capability with a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference.
Result: Achieves excellent and competitive performance on both semantically complete and incomplete speech scenarios.
Conclusion: Enables independent optimization of full-duplex prediction modules, providing more reliable and flexible support for next-generation human-computer interaction.
Abstract: Spoken dialogue models have significantly advanced intelligent human-computer interaction, yet they lack a plug-and-play full-duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix-VAD, an LLM-based model that enables streaming semantic endpoint detection. Specifically, Phoenix-VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix-VAD achieves excellent and competitive performance. Furthermore, this design enables the full-duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next-generation human-computer interaction.
eess.IV
[445] Opto-Electronic Convolutional Neural Network Design Via Direct Kernel Optimization
Ali Almuallem, Harshana Weligampola, Abhiram Gnanasambandam, Wei Xu, Dilshan Godaliyadda, Hamid R. Sheikh, Stanley H. Chan, Qi Guo
Main category: eess.IV
TL;DR: Two-stage design strategy for opto-electronic CNNs that separates optical front-end optimization from electronic back-end training, achieving better accuracy and efficiency than end-to-end approaches.
Details
Motivation: Conventional end-to-end optimization of opto-electronic neural networks faces challenges with costly simulations and large parameter spaces, limiting practical implementation.Method: First train a standard electronic CNN, then optimize the optical front-end (metasurface array) by directly optimizing the first convolutional layer kernels, avoiding end-to-end joint training.
Result: Reduces computational and memory demands by hundreds of times, improves training stability, and achieves twice the accuracy of end-to-end training on monocular depth estimation under same constraints.
Conclusion: The two-stage design strategy provides a more efficient and effective approach for developing opto-electronic neural networks compared to conventional end-to-end optimization methods.
Abstract: Opto-electronic neural networks integrate optical front-ends with electronic back-ends to enable fast and energy-efficient vision. However, conventional end-to-end optimization of both the optical and electronic modules is limited by costly simulations and large parameter spaces. We introduce a two-stage strategy for designing opto-electronic convolutional neural networks (CNNs): first, train a standard electronic CNN, then realize the optical front-end implemented as a metasurface array through direct kernel optimization of its first convolutional layer. This approach reduces computational and memory demands by hundreds of times and improves training stability compared to end-to-end optimization. On monocular depth estimation, the proposed two-stage design achieves twice the accuracy of end-to-end training under the same training time and resource constraints.
[446] MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization
Yalda Zafari, Hongyi Pan, Gorkem Durak, Ulas Bagci, Essam A. Rashed, Mohamed Mabrok
Main category: eess.IV
TL;DR: MammoClean is a framework for standardizing mammography datasets to address data heterogeneity and bias, improving AI model generalizability.
Details
Motivation: Clinical AI systems for mammography face challenges due to data heterogeneity and dataset-specific biases that compromise model generalizability.Method: MammoClean standardizes case selection, image processing (laterality and intensity correction), and unifies metadata into consistent multi-view structures.
Result: Application to three datasets revealed substantial distributional shifts in breast density and abnormality prevalence, with AI models trained on corrupted data showing significant performance degradation.
Conclusion: MammoClean enables construction of unified training corpora for robust AI models with superior cross-domain generalization, advancing equitable performance across diverse populations.
Abstract: The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: https://github.com/Minds-R-Lab/MammoClean.
[447] Resource-efficient Automatic Refinement of Segmentations via Weak Supervision from Light Feedback
Alix de Langlais, Benjamin Billot, Théo Aguilar Vidal, Marc-Olivier Gauci, Hervé Delingette
Main category: eess.IV
TL;DR: SCORE is a weakly supervised framework that refines medical image segmentation predictions using only light feedback during training, eliminating the need for dense annotations.
Details
Motivation: Manual medical image segmentation is labor-intensive and variable, while automated methods often lack clinical accuracy. Current refinement approaches require heavy user interactions or fully supervised training.Method: SCORE introduces a novel loss function that leverages region-wise quality scores and over/under-segmentation error labels instead of dense annotations, enabling weakly supervised refinement.
Result: On humerus CT scans, SCORE significantly improves initial predictions from TotalSegmentator and achieves performance comparable to existing refinement methods while greatly reducing supervision requirements and annotation time.
Conclusion: SCORE provides an effective weakly supervised approach for medical image segmentation refinement that maintains performance while dramatically reducing annotation burden.
Abstract: Delineating anatomical regions is a key task in medical image analysis. Manual segmentation achieves high accuracy but is labor-intensive and prone to variability, thus prompting the development of automated approaches. Recently, a breadth of foundation models has enabled automated segmentations across diverse anatomies and imaging modalities, but these may not always meet the clinical accuracy standards. While segmentation refinement strategies can improve performance, current methods depend on heavy user interactions or require fully supervised segmentations for training. Here, we present SCORE (Segmentation COrrection from Regional Evaluations), a weakly supervised framework that learns to refine mask predictions only using light feedback during training. Specifically, instead of relying on dense training image annotations, SCORE introduces a novel loss that leverages region-wise quality scores and over/under-segmentation error labels. We demonstrate SCORE on humerus CT scans, where it considerably improves initial predictions from TotalSegmentator, and achieves performance on par with existing refinement methods, while greatly reducing their supervision requirements and annotation time. Our code is available at: https://gitlab.inria.fr/adelangl/SCORE.
[448] Diffusion Models are Robust Pretrainers
Mika Yagoda, Shady Abu-Hussein, Raja Giryes
Main category: eess.IV
TL;DR: Diffusion models provide low-cost adversarial robustness for image classification and object detection without full adversarial training, achieving meaningful robustness with minimal compute.
Details
Motivation: Adversarial attacks challenge standard models by perturbing inputs, and existing approaches require expensive adversarial training. This work explores using diffusion models as a cost-effective alternative for robust representations.Method: Build models on top of off-the-shelf diffusion models, training lightweight heads on frozen diffusion features without full adversarial training.
Result: Diffusion-based classifiers and detectors achieve meaningful adversarial robustness on ImageNet, CIFAR-10, and PASCAL VOC with minimal compute, though clean and adversarial accuracies remain below state-of-the-art adversarially trained models.
Conclusion: Diffusion pretraining offers a favorable tradeoff between efficiency and robustness, opening a promising avenue for resource-constrained robust deployments.
Abstract: Diffusion models have gained significant attention for high-fidelity image generation. Our work investigates the potential of exploiting diffusion models for adversarial robustness in image classification and object detection. Adversarial attacks challenge standard models in these tasks by perturbing inputs to force incorrect predictions. To address this issue, many approaches use training schemes for forcing the robustness of the models, which increase training costs. In this work, we study models built on top of off-the-shelf diffusion models and demonstrate their practical significance: they provide a low-cost path to robust representations, allowing lightweight heads to be trained on frozen features without full adversarial training. Our empirical evaluations on ImageNet, CIFAR-10, and PASCAL VOC show that diffusion-based classifiers and detectors achieve meaningful adversarial robustness with minimal compute. While clean and adversarial accuracies remain below state-of-the-art adversarially trained CNNs or ViTs, diffusion pretraining offers a favorable tradeoff between efficiency and robustness. This work opens a promising avenue for integrating diffusion models into resource-constrained robust deployments.
[449] SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion
Yufei Tong, Guanjie Cheng, Peihan Wu, Yicheng Zhu, Kexu Lu, Feiyi Chen, Meng Xi, Junqin Huang, Xueqiang Yan, Junfan Wang, Shuiguang Deng
Main category: eess.IV
TL;DR: SatFusion is a unified framework that enhances satellite IoT images by fusing multi-temporal and multi-source data through temporal alignment, texture injection, and adaptive composition modules.
Details
Motivation: Existing methods fail to fully exploit complementary information in temporal and source dimensions - MISR has limited texture details while pansharpening is sensitive to noise and misregistration.Method: Three-stage framework: 1) MTIF for deep feature alignment with panchromatic images, 2) MSIF for texture injection from panchromatic data, 3) Fusion Composition module for adaptive integration and spectral consistency refinement.
Result: Extensive experiments on WorldStrat, WV3, QB, and GF2 datasets show significant improvements in fusion quality, robustness under challenging conditions, and generalizability to real-world Sat-IoT scenarios.
Conclusion: SatFusion effectively addresses limitations of existing methods by unifying multi-temporal and multi-source fusion, demonstrating superior performance and practical applicability in satellite IoT applications.
Abstract: With the rapid advancement of the digital society, the proliferation of satellites in the Satellite Internet of Things (Sat-IoT) has led to the continuous accumulation of large-scale multi-temporal and multi-source images across diverse application scenarios. However, existing methods fail to fully exploit the complementary information embedded in both temporal and source dimensions. For example, Multi-Image Super-Resolution (MISR) enhances reconstruction quality by leveraging temporal complementarity across multiple observations, yet the limited fine-grained texture details in input images constrain its performance. Conversely, pansharpening integrates multi-source images by injecting high-frequency spatial information from panchromatic data, but typically relies on pre-interpolated low-resolution inputs and assumes noise-free alignment, making it highly sensitive to noise and misregistration. To address these issues, we propose SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion. Specifically, SatFusion first employs a Multi-Temporal Image Fusion (MTIF) module to achieve deep feature alignment with the panchromatic image. Then, a Multi-Source Image Fusion (MSIF) module injects fine-grained texture information from the panchromatic data. Finally, a Fusion Composition module adaptively integrates the complementary advantages of both modalities while dynamically refining spectral consistency, supervised by a weighted combination of multiple loss functions. Extensive experiments on the WorldStrat, WV3, QB, and GF2 datasets demonstrate that SatFusion significantly improves fusion quality, robustness under challenging conditions, and generalizability to real-world Sat-IoT scenarios. The code is available at: https://github.com/dllgyufei/SatFusion.git.
[450] Cross-modal Diffusion Modelling for Super-resolved Spatial Transcriptomics
Xiaofei Wang, Xingxu Huang, Stephen J. Price, Chao Li
Main category: eess.IV
TL;DR: A cross-modal conditional diffusion model that enhances spatial transcriptomics resolution by integrating histology images with gene expression data using multi-modal disentangling networks and dynamic cross-attention.
Details
Motivation: Current spatial transcriptomics platforms have low resolution, limiting understanding of spatial gene expression. Existing super-resolution methods suffer from restoration uncertainty and mode collapse when integrating histology images with gene expressions.Method: Proposes a cross-modal conditional diffusion model with: 1) multi-modal disentangling network with cross-modal adaptive modulation, 2) dynamic cross-attention for hierarchical cell-to-tissue information extraction, and 3) co-expression-based gene-correlation graph network.
Result: Outperforms state-of-the-art methods in spatial transcriptomics super-resolution on three public datasets.
Conclusion: The proposed method effectively integrates histology images and gene expression to achieve superior super-resolution of spatial transcriptomics maps.
Abstract: The recent advancement of spatial transcriptomics (ST) allows to characterize spatial gene expression within tissue for discovery research. However, current ST platforms suffer from low resolution, hindering in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, current super-resolution methods are limited by restoration uncertainty and mode collapse. Although diffusion models have shown promise in capturing complex interactions between multi-modal conditions, it remains a challenge to integrate histology images and gene expression for super-resolved ST maps. This paper proposes a cross-modal conditional diffusion model for super-resolving ST maps with the guidance of histology images. Specifically, we design a multi-modal disentangling network with cross-modal adaptive modulation to utilize complementary information from histology images and spatial gene expression. Moreover, we propose a dynamic cross-attention modelling strategy to extract hierarchical cell-to-tissue information from histology images. Lastly, we propose a co-expression-based gene-correlation graph network to model the co-expression relationship of multiple genes. Experiments show that our method outperforms other state-of-the-art methods in ST super-resolution on three public datasets.
[451] Real World Federated Learning with a Knowledge Distilled Transformer for Cardiac CT Imaging
Malte Tölle, Philipp Garthe, Clemens Scherer, Jan Moritz Seliger, Andreas Leha, Nina Krüger, Stefan Simm, Simon Martin, Sebastian Eble, Halvar Kelm, Moritz Bednorz, Florian André, Peter Bannas, Gerhard Diller, Norbert Frey, Stefan Groß, Anja Hennemuth, Lars Kaderali, Alexander Meyer, Eike Nagel, Stefan Orwat, Moritz Seiffert, Tim Friede, Tim Seidler, Sandy Engelhardt
Main category: eess.IV
TL;DR: A two-step semi-supervised federated learning approach that uses CNNs to predict on unlabeled data and then trains transformers with label-specific heads, achieving improved accuracy and generalizability in cardiac CT analysis across 8 hospitals.
Details
Motivation: To address challenges of partially labeled datasets in federated learning where only few locations have expert annotations, enabling utilization of large unlabeled data portions to enhance transformer performance with small diverse annotation sets.Method: Two-step semi-supervised strategy: 1) CNNs predict on unlabeled data per label type, 2) transformer learns from these predictions using label-specific heads, enabling simultaneous learning of all partial labels across the federation.
Result: Conducted largest federated cardiac CT analysis (n=8,104) across 8 hospitals, achieving improved predictive accuracy and outperforming UNet-based models in generalizability on downstream tasks.
Conclusion: The approach successfully leverages unlabeled data in federated settings, enhances transformer architectures’ performance with limited annotations, and provides openly available code and model weights for future cardiac CT analysis.
Abstract: Federated learning is a renowned technique for utilizing decentralized data while preserving privacy. However, real-world applications often face challenges like partially labeled datasets, where only a few locations have certain expert annotations, leaving large portions of unlabeled data unused. Leveraging these could enhance transformer architectures ability in regimes with small and diversely annotated sets. We conduct the largest federated cardiac CT analysis to date (n=8,104) in a real-world setting across eight hospitals. Our two-step semi-supervised strategy distills knowledge from task-specific CNNs into a transformer. First, CNNs predict on unlabeled data per label type and then the transformer learns from these predictions with label-specific heads. This improves predictive accuracy and enables simultaneous learning of all partial labels across the federation, and outperforms UNet-based models in generalizability on downstream tasks. Code and model weights are made openly available for leveraging future cardiac CT analysis.
[452] Joint Lossless Compression and Steganography for Medical Images via Large Language Models
Pengcheng Zheng, Xiaorong Pu, Kecheng Chen, Jiaxin Huang, Meng Yang, Bai Feng, Yazhou Ren, Jianan Jiang, Chaoning Zhang, Yang Yang, Heng Tao Shen
Main category: eess.IV
TL;DR: A novel joint lossless compression and steganography framework for medical images that securely embeds privacy messages while maintaining high compression performance and efficiency.
Details
Motivation: Existing LLM-based compressors for medical images have unsatisfactory trade-offs between compression performance and efficiency, and overlook security aspects which are critical in medical scenarios.Method: Uses bit plane slicing-inspired approach with adaptive modalities decomposition to partition images into global/local segments, implements dual-path lossless compression with segmented message steganography in local modality path, and employs anatomical priors-based low-rank adaptation (A-LoRA) fine-tuning.
Result: Extensive experiments demonstrate superiority in compression ratios, efficiency, and security compared to existing methods.
Conclusion: The proposed framework effectively addresses the compression-security-efficiency trade-off in medical image processing and will be made publicly available.
Abstract: Recently, large language models (LLMs) have driven promising progress in lossless image compression. However, directly adopting existing paradigms for medical images suffers from an unsatisfactory trade-off between compression performance and efficiency. Moreover, existing LLM-based compressors often overlook the security of the compression process, which is critical in modern medical scenarios. To this end, we propose a novel joint lossless compression and steganography framework. Inspired by bit plane slicing (BPS), we find it feasible to securely embed privacy messages into medical images in an invisible manner. Based on this insight, an adaptive modalities decomposition strategy is first devised to partition the entire image into two segments, providing global and local modalities for subsequent dual-path lossless compression. During this dual-path stage, we innovatively propose a segmented message steganography algorithm within the local modality path to ensure the security of the compression process. Coupled with the proposed anatomical priors-based low-rank adaptation (A-LoRA) fine-tuning strategy, extensive experimental results demonstrate the superiority of our proposed method in terms of compression ratios, efficiency, and security. The source code will be made publicly available.
[453] Invited Paper: BitMedViT: Ternary-Quantized Vision Transformer for Medical AI Assistants on the Edge
Mikolaj Walczak, Uttej Kallakuri, Edward Humes, Xiaomin Lin, Tinoosh Mohsenin
Main category: eess.IV
TL;DR: BiTMedViT is an edge-optimized Vision Transformer for medical imaging that uses ternary quantization to achieve 43x model compression and 41x energy efficiency while maintaining 86% diagnostic accuracy on MedMNIST.
Details
Motivation: Vision Transformers show promise for medical imaging but are too computationally intensive for real-time deployment on resource-constrained mobile and wearable clinical devices.Method: Uses ternary-quantized linear layers with multi-query attention, task-aware distillation from high-capacity teacher models, and custom CUDA kernel mapping for efficient deployment on Jetson Orin Nano.
Result: Achieves 86% diagnostic accuracy (vs 89% SOTA) on MedMNIST across 12 datasets, with 43x model size reduction, 39x memory traffic reduction, 16.8 ms inference time, and 41x energy efficiency at 183.62 GOPs/J.
Conclusion: BiTMedViT provides a practical route for deploying extreme-precision medical imaging ViTs on edge devices, bridging the gap between algorithmic advances and clinical deployment.
Abstract: Vision Transformers (ViTs) have demonstrated strong capabilities in interpreting complex medical imaging data. However, their significant computational and memory demands pose challenges for deployment in real-time, resource-constrained mobile and wearable devices used in clinical environments. We introduce, BiTMedViT, a new class of Edge ViTs serving as medical AI assistants that perform structured analysis of medical images directly on the edge. BiTMedViT utilizes ternary- quantized linear layers tailored for medical imaging and com- bines a training procedure with multi-query attention, preserving stability under ternary weights with low-precision activations. Furthermore, BiTMedViT employs task-aware distillation from a high-capacity teacher to recover accuracy lost due to extreme quantization. Lastly, we also present a pipeline that maps the ternarized ViTs to a custom CUDA kernel for efficient memory bandwidth utilization and latency reduction on the Jetson Orin Nano. Finally, BiTMedViT achieves 86% diagnostic accuracy (89% SOTA) on MedMNIST across 12 datasets, while reducing model size by 43x, memory traffic by 39x, and enabling 16.8 ms inference at an energy efficiency up to 41x that of SOTA models at 183.62 GOPs/J on the Orin Nano. Our results demonstrate a practical and scientifically grounded route for extreme-precision medical imaging ViTs deployable on the edge, narrowing the gap between algorithmic advances and deployable clinical tools.
[454] Deep Generative Models for Enhanced Vitreous OCT Imaging
Simone Sarrocco, Philippe C. Cattin, Peter M. Maloca, Paul Friedrich, Philippe Valmaggia
Main category: eess.IV
TL;DR: Deep learning models were evaluated for enhancing vitreous OCT image quality and reducing acquisition time, with cDDPM showing the best clinical performance despite U-Net achieving superior quantitative metrics.
Details
Motivation: To improve vitreous optical coherence tomography (OCT) image quality while reducing acquisition time, addressing the trade-off between image quality and scanning duration in clinical practice.Method: Multiple DL models (cDDPM, BBDM, U-Net, Pix2Pix, VQ-GAN) were trained to generate high-quality SD vitreous OCT images from lower-quality inputs, with performance evaluated using image quality metrics and Visual Turing Tests by ophthalmologists.
Result: U-Net achieved best quantitative metrics (PSNR: 30.230, SSIM: 0.820), but cDDPM performed best in clinical evaluation with highest Visual Turing Test ranking (3.07), 32.9% fool rate, and 85.7% anatomical preservation. cDDPM also generated vitreous regions more similar to reference than true ART1/ART10 scans.
Conclusion: There are discrepancies between quantitative metrics and clinical evaluation, emphasizing the need for combined assessment. cDDPM demonstrates strong potential for generating clinically meaningful vitreous OCT images while reducing acquisition time fourfold.
Abstract: Purpose: To evaluate deep learning (DL) models for enhancing vitreous optical coherence tomography (OCT) image quality and reducing acquisition time. Methods: Conditional Denoising Diffusion Probabilistic Models (cDDPMs), Brownian Bridge Diffusion Models (BBDMs), U-Net, Pix2Pix, and Vector-Quantised Generative Adversarial Network (VQ-GAN) were used to generate high-quality spectral-domain (SD) vitreous OCT images. Inputs were SD ART10 images, and outputs were compared to pseudoART100 images obtained by averaging ten ART10 images per eye location. Model performance was assessed using image quality metrics and Visual Turing Tests, where ophthalmologists ranked generated images and evaluated anatomical fidelity. The best model’s performance was further tested within the manually segmented vitreous on newly acquired data. Results: U-Net achieved the highest Peak Signal-to-Noise Ratio (PSNR: 30.230) and Structural Similarity Index Measure (SSIM: 0.820), followed by cDDPM. For Learned Perceptual Image Patch Similarity (LPIPS), Pix2Pix (0.697) and cDDPM (0.753) performed best. In the first Visual Turing Test, cDDPM ranked highest (3.07); in the second (best model only), cDDPM achieved a 32.9% fool rate and 85.7% anatomical preservation. On newly acquired data, cDDPM generated vitreous regions more similar in PSNR to the ART100 reference than true ART1 or ART10 B-scans and achieved higher PSNR on whole images when conditioned on ART1 than ART10. Conclusions: Results reveal discrepancies between quantitative metrics and clinical evaluation, highlighting the need for combined assessment. cDDPM showed strong potential for generating clinically meaningful vitreous OCT images while reducing acquisition time fourfold. Translational Relevance: cDDPMs show promise for clinical integration, supporting faster, higher-quality vitreous imaging. Dataset and code will be made publicly available.